Orc Lexical Specifications#

This page specifies Orc's processing of an input byte stream into an Orc lexical token sequence. This token sequence is the input to the Orc parsing procedure.

Input Byte Stream#

Orc can read source code input byte streams from a number of types of sources. For example, Orc 2.0 running on Java SE 6 accepts input from local files, FTP, Gopher, HTTP, and JAR files.

Orc source code input byte streams must encode a Unicode character sequence using the UTF-8 encoding form. No other encoding is supported. HTTP headers specifying other charsets are ignored.

Orc Lexical Tokens#

Orc lexical scanning reads Unicode characters and emits corresponding Orc lexical tokens. Orc uses 7 lexical token types: identifier, keyword, operator, delimiter, integer literal, floating-point literal, string literal. Orc comments and whitespace are scanned as separators among other tokens and disregarded.

Identifier#

Orc scans identifiers per Unicode Standard Annex #31, Unicode Identifier and Pattern Syntax, namely:

  • Identifiers start with "Characters having the Unicode General_Category of uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl)".
  • Identifiers can continue with "All of the above, plus characters having the Unicode General_Category of nonspacing marks (Mn), spacing combining marks (Mc), decimal number (Nd), connector punctuations (Pc)", plus Orc's addition of apostrophe as a "prime" mark.
  • All identifiers are normalized to Unicode Normalization Form C as they are parsed.

Examples of allowed Orc identifiers: orchestrate, iscenesætte, ενορχηστρώνω, الانسجام, 編排, 練り上げる, 관현악으로_편곡하다, оркестровать, आर्केस्ट्रा_करना. (These are all translations of "orchestrate" using Google translate, but it's often comically wrong.)

Also, mathematical letter-like characters are allowed, such as ℤ, ℏ, ℵ0, and, of course, greek letters.

Orc also treats an operator (defined below) placed in parenthesis, such as (+), as an identifier with the name of the operator (without the parenthesis).

An identifier cannot match a keyword; see the following section.

Keyword#

Any token that otherwise follows the rules for identifiers, but matches an entry in the following list is not treated as an identifier, but as a keyword instead.

true false signal stop null lambda if then else as _ val def type site class include Top Bot

Note that _ is a special case: Identifiers cannot start with an underscore, but _ scans as a keyword nonetheless.

Operator#

An Orc operator is a character sequence that matches one of the following. (This match is greedy -- for example, ** is matched in preference to 2 * operators, if possible.)

+ - * / % ** && || ~ < > = <: :> <= >= /= : . ? :=

Delimiters#

An Orc delimiter is a character sequence that matches one of the following.

( ) {. .} , | ; :: :!:

Numeric Literal#

Numeric literals are matched per the following regular expression: ([0-9]+)([.][0-9]+)?([Ee][+-]?([0-9]+))?

If the matched string contains a decimal point, an e, or an E, it is a floating point literal, otherwise it is an integer literal.

All Orc numeric literals are decimal (radix 10).

String Literal#

Double quotes demarcate a string literal. The (possibly empty) sequence of characters in a string literal may be composed of any character except an unescaped double quote or an unescaped newline character (as defined below).

The backslash escape convention is recognized as follows:

  • \f U+000C FF Form Feed
  • \n U+000A LF Line Feed
  • \r U+000D CR Carriage Return
  • \t U+0009 HT Character Tabulation
  • Other escaped chars (like \\ or \" ) are treated as themselves without the backslash.

Comment#

Orc comments take two forms:

  1. -- to end of the line (see newlines, below)
  2. {- multi-line comment body -}

A multi-line comment body is any character sequence (possibly empty), where {- and -} have lexical significance. Orc comments can be nested, so {- starts a nested multi-line comment, and -} ends the current multi-line comment.

Whitespace#

Per Unicode Standard Annex #31, section 4, recommendation R3, Orc treats all Unicode Pattern_White_Space characters as whitespace.

This is:

  • U+0009 HT Character Tabulation
  • U+000B VT Line Tabulation
  • U+0020 Space
  • U+200E Left-to-Right Mark
  • U+200F Right-to-Left Mark
  • And the six newline characters (below)

Newlines#

Orc follows the newline definition of Unicode standard section 5.8, Newline Guidelines, recommendation R4.

Namely, Orc stops reading a line when it encounters one of the following characters:

  • U+000D CR Carriage Return
  • U+000A LF Line Feed
  • U+0085 NEL Next Line
  • U+2028 LS Line Separator
  • U+000C FF Form Feed
  • U+2029 PS Paragraph Separator

Add new attachment

Only authorized users are allowed to upload new attachments.
« This page (revision-11) was last changed on 04-Dec-2010 09:44 by JohnThywissen