Orc Lexical Specifications#
This page specifies Orc's processing of an input byte stream into an Orc lexical token sequence. This token sequence is the input to the Orc parsing procedure.
Input Byte Stream#
Orc can read source code input byte streams from a number of types of sources. For example, Orc 2.0 running on Java SE 6 accepts input from local files, FTP, Gopher, HTTP, and JAR files.
Orc source code input byte streams must encode a Unicode character sequence using the UTF-8
encoding form. No other encoding is supported. HTTP headers specifying other charsets are ignored.
Orc Lexical Tokens#
Orc lexical scanning reads Unicode characters and emits corresponding Orc lexical tokens. Orc uses 7 lexical token types: identifier, keyword, operator, delimiter, integer literal, floating-point literal, string literal. Orc comments and whitespace are scanned as separators among other tokens and disregarded.
Identifier#
Orc scans identifiers per Unicode Standard Annex #31, Unicode Identifier and Pattern Syntax
, namely:
- Identifiers start with "Characters having the Unicode General_Category of uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl)".
- Identifiers can continue with "All of the above, plus characters having the Unicode General_Category of nonspacing marks (Mn), spacing combining marks (Mc), decimal number (Nd), connector punctuations (Pc)", plus Orc's addition of apostrophe as a "prime" mark.
- All identifiers are normalized to Unicode Normalization Form C as they are parsed.
Examples of allowed Orc identifiers: orchestrate, iscenesætte, ενορχηστρώνω, الانسجام, 編排, 練り上げる, 관현악으로_편곡하다, оркестровать, आर्केस्ट्रा_करना. (These are all translations of "orchestrate" using Google translate, but it's often comically wrong.)
Also, mathematical letter-like characters are allowed, such as ℤ, ℏ, ℵ0, and, of course, greek letters.
Orc also treats an operator (defined below) placed in parenthesis, such as (+), as an identifier with the name of the operator (without the parenthesis).
An identifier cannot match a keyword; see the following section.
Keyword#
Any token that otherwise follows the rules for identifiers, but matches an entry in the following list is not treated as an identifier, but as a keyword instead.
true false signal stop null lambda if then else as _ val def type site class include Top Bot
Note that _ is a special case: Identifiers cannot start with an underscore, but _ scans as a keyword nonetheless.
Operator#
An Orc operator is a character sequence that matches one of the following. (This match is greedy -- for example, ** is matched in preference to 2 * operators, if possible.)
+ - * / % ** && || ~ < > = <: :> <= >= /= : . ? :=
Delimiters#
An Orc delimiter is a character sequence that matches one of the following.
( ) {. .} , | ; :: :!:
Numeric Literal#
Numeric literals are matched per the following regular expression: ([0-9]+)([.][0-9]+)?([Ee][+-]?([0-9]+))?
If the matched string contains a decimal point, an e, or an E, it is a floating point literal, otherwise it is an integer literal.
All Orc numeric literals are decimal (radix 10).
String Literal#
Double quotes demarcate a string literal. The (possibly empty) sequence of characters in a string literal may be composed of any character except an unescaped double quote or an unescaped newline character (as defined below).
The backslash escape convention is recognized as follows:
- \f U+000C FF Form Feed
- \n U+000A LF Line Feed
- \r U+000D CR Carriage Return
- \t U+0009 HT Character Tabulation
- Other escaped chars (like \\ or \" ) are treated as themselves without the backslash.
Comment#
Orc comments take two forms:
- -- to end of the line (see newlines, below)
- {- multi-line comment body -}
A multi-line comment body is any character sequence (possibly empty), where {- and -} have lexical significance. Orc comments can be nested, so {- starts a nested multi-line comment, and -} ends the current multi-line comment.
Whitespace#
Per Unicode Standard Annex #31, section 4, recommendation R3
, Orc treats all Unicode Pattern_White_Space characters as whitespace.
This is:
- U+0009 HT Character Tabulation
- U+000B VT Line Tabulation
- U+0020 Space
- U+200E Left-to-Right Mark
- U+200F Right-to-Left Mark
- And the six newline characters (below)
Newlines#
Orc follows the newline definition of Unicode standard section 5.8, Newline Guidelines
, recommendation R4.
Namely, Orc stops reading a line when it encounters one of the following characters:
- U+000D CR Carriage Return
- U+000A LF Line Feed
- U+0085 NEL Next Line
- U+2028 LS Line Separator
- U+000C FF Form Feed
- U+2029 PS Paragraph Separator