Syntax files and text parsing

The idea of parsing is to process some input text using the rules defined in a syntax file (which are discussed in detail in Textmate documentation). The text is parsed line by line, and, events are sent to a processor according to how the text matches the syntax file.

Image: Overview of the parsing process.

Overview of the parsing process.

For the impatient

Parsing a file using Textpow is as easy as 1-2-3!

  1. Load the Syntax File:
    require 'textpow'
    syntax = Textpow::SyntaxNode.load("ruby.tmSyntax")
    
  2. Initialize a processor:
    processor = Textpow::DebugProcessor.new
    
  3. Parse some text:
    syntax.parse( text,  processor )
    

The gory details

Syntax files

At the heart of syntax parsing are ..., well, syntax files. Lets see for instance the example syntax that appears in textmate's documentation:

   1   {  scopeName = 'source.untitled';
   2      fileTypes = ( txt );
   3      foldingStartMarker = '\{\s*$';
   4      foldingStopMarker = '^\s*\}';
   5      patterns = (
   6         {  name = 'keyword.control.untitled';
   7            match = '\b(if|while|for|return)\b';
   8         },
   9         {  name = 'string.quoted.double.untitled';
  10            begin = '"';
  11            end = '"';
  12            patterns = ( 
  13               {  name = 'constant.character.escape.untitled';
  14                  match = '\\.';
  15               }
  16            );
  17         },
  18      );
  19   }

But Textpow is not able to parse text pfiles. However, in practice this is not a problem, since it is possible to convert both text and binary pfiles to an XML format. Indeed, all the syntaxes in the Textmate syntax repository are in XML format:

   1  <?xml version="1.0" encoding="UTF-8"?>
   2  <!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
   3  <plist version="1.0">
   4  <dict>
   5     <key>scopeName</key>
   6     <string>source.untitled</string>
   7     <key>fileTypes</key>
   8     <array>
   9        <string>txt</string>
  10     </array>
  11     <key>foldingStartMarker</key>
  12     <string>\{\s*$</string>
  13     <key>foldingStopMarker</key>
  14     <string>^\s*\}</string>
  15     <key>patterns</key>
  16     <array>
  17        <dict>
  18           <key>name</key>
  19           <string>keyword.control.untitled</string>
  20           <key>match</key>
  21           <string>\b(if|while|for|return)\b</string>
  22        </dict>
  23        <dict>
  24           <key>name</key>
  25           <string>string.quoted.double.untitled</string>
  26           <key>begin</key>
  27           <string>"</string>
  28           <key>end</key>
  29           <string>"</string>
  30           <key>patterns</key>
  31           <array>
  32              <dict>
  33                 <key>name</key>
  34                 <string>constant.character.escape.untitled</string>
  35                 <key>match</key>
  36                 <string>\\.</string>
  37              </dict>
  38           </array>
  39        </dict>
  40     </array>
  41  </dict>

Of course, most people find XML both ugly and cumbersome. Fortunately, it is also possible to store syntax files in YAML format, which is much easier to read:

   1  --- 
   2  fileTypes: 
   3  - txt
   4  scopeName: source.untitled
   5  foldingStartMarker: \{\s*$
   6  foldingStopMarker: ^\s*\}
   7  patterns:
   8  - name: keyword.control.untitled
   9    match: \b(if|while|for|return)\b
  10  - name: string.quoted.double.untitled
  11    begin: '"'
  12    end: '"'
  13    patterns:
  14    - name: constant.character.escape.untitled
  15      match: \\.

Processors

Until now we have talked about the parsing process without explaining what it is exactly. Basically, parsing consists in reading text from a string or file and applying tags to parts of the text according to what has been specified in the syntax file.

In textpow, the process takes place line by line, from the beginning to the end and from left to right for every line. As the text is parsed, events are sent to a processor object when a tag is open or closed and so on. A processor is any object which implements one or more of the following methods:

   1  class Processor
   2     def open_tag name, position
   3     end
   4        
   5     def close_tag name, position
   6     end
   7        
   8     def new_line line
   9     end
  10        
  11     def start_parsing name
  12     end
  13     
  14     def end_parsing name
  15     end
  16  end
  • open_tag. Is called when a new tag is opened, it receives the tag's name and its position (relative to the current line).
  • close_tag. The same that open_tag, but it is called when a tag is closed.
  • new_line. Is called every time that a new line is processed, it receives the line's contents.
  • start_parsing. Is called once at the beginning of the parsing process. It receives the scope name for the syntax being used.
  • end_parsing. Is called once after all the input text has been parsed. It receives the scope name for the syntax being used.

Textpow ensures that the methods are called in parsing order, thus, for example, if there are two subsequent calls to open_tag, the first having name="text.string", position=10 and the second having name="markup.string", position=10, it should be understood that the "markup.string" tag is inside the "text.string" tag.