10.2. Lexical Specifications

This page specifies Orc's processing of an input byte stream into an Orc lexical token sequence. This token sequence is the input to the Orc parsing procedure.

10.2.1. Input Byte Stream

Orc can read source code input byte streams from a number of types of sources. For example, Orc 2.0 running on Java SE 6 accepts input from local files, FTP, Gopher, HTTP, and JAR files. Orc source code input byte streams must encode a Unicode character sequence using the UTF-8 encoding form. No other encoding is supported. HTTP headers specifying other charsets are ignored.

10.2.2. Lexical Tokens

Orc lexical scanning reads Unicode characters and emits corresponding Orc lexical tokens. Orc uses seven lexical token types: identifier, keyword, operator, delimiter, integer literal, floating-point literal, string literal. Orc comments and whitespace are scanned as separators among other tokens and disregarded.

Identifier

Orc scans identifiers per Unicode Standard Annex #31, Unicode Identifier and Pattern Syntax, namely:

  • Identifiers start with "Characters having the Unicode General_Category of uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl)".

  • Identifiers can continue with "All of the above, plus characters having the Unicode General_Category of nonspacing marks (Mn), spacing combining marks (Mc), decimal number (Nd), connector punctuations (Pc)", plus Orc's addition of apostrophe as a "prime" mark.

  • All identifiers are normalized to Unicode Normalization Form C as they are parsed.

Examples of allowed Orc identifiers: orchestrate, iscenesætte, ενορχηστρώνω, الانسجام, 編排, 練り上げる, 관현악으로_편곡하다, оркестровать, आर्केस्ट्रा_करना.

Also, mathematical letter-like characters are allowed, such as ℤ, ℏ, ℵ0, and, of course, Greek letters.

Orc also treats an operator (defined below) placed in parenthesis, such as (+), as an identifier with the name of the operator (without the parenthesis).

An identifier cannot match a keyword; see the following section.

Keyword

Any token that otherwise follows the rules for identifiers, but matches an entry in the following list is not treated as an identifier, but as a keyword instead.

as def else if import include lambda signal stop then type val true false null _

Note that _ is a special case: Identifiers cannot start with an underscore, but _ scans as a keyword nonetheless.

Operator

An Orc operator is a character sequence that matches one of the following. This match is greedy — for example, ** is matched in preference to two * operators, if possible.)

+ - * / % ** && || ~ = <: :> <= >= /= : . ? :=

Delimiters

An Orc delimiter is a character sequence that matches one of the following.

( ) [ ] {. .} , # < > | ; :: :!:

Literals

Numeric literals and character string literals are recognized per the syntax given in those sections.

Comments

Orc comments take two forms:

  1. --to end of the line (see newlines, below)

  2. {- multi-line comment body -}

A multi-line comment body is any character sequence (possibly empty), where {- and -} have lexical significance. Orc comments can be nested, so {- starts a nested multi-line comment, and -} ends the current multi-line comment.

Whitespace

Per Unicode Standard Annex #31, section 4, recommendation R3, Orc treats all Unicode Pattern_White_Space characters as whitespace.

  • U+0009 HT Character Tabulation

  • U+000B VT Line Tabulation

  • U+0020 Space

  • U+200E Left-to-Right Mark

  • U+200F Right-to-Left Mark

  • And the six newline characters (below)

Newlines

Orc follows the newline definition of Unicode standard section 5.8, Newline Guidelines, recommendation R4.

Namely, Orc stops reading a line when it encounters one of the following characters:

  • U+000D CR Carriage Return

  • U+000A LF Line Feed

  • U+0085 NEL Next Line

  • U+2028 LS Line Separator

  • U+000C FF Form Feed

  • U+2029 PS Paragraph Separator

10.2.3. Related Links

Related Tutorial Sections