This page specifies Orc's processing of an input byte stream into an Orc lexical token sequence. This token sequence is the input to the Orc parsing procedure.
Orc can read source code input byte streams from a number of types of sources. For example, Orc 2.0 running on Java SE 6 accepts input from local files, FTP, Gopher, HTTP, and JAR files. Orc source code input byte streams must encode a Unicode character sequence using the UTF-8 encoding form. No other encoding is supported. HTTP headers specifying other charsets are ignored.
Orc lexical scanning reads Unicode characters and emits corresponding Orc lexical tokens. Orc uses seven lexical token types: identifier, keyword, operator, delimiter, integer literal, floating-point literal, string literal. Orc comments and whitespace are scanned as separators among other tokens and disregarded.
Orc scans identifiers per Unicode Standard Annex #31, Unicode Identifier and Pattern Syntax, namely:
Identifiers start with "Characters having the Unicode General_Category of uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl)".
Identifiers can continue with "All of the above, plus characters having the Unicode General_Category of nonspacing marks (Mn), spacing combining marks (Mc), decimal number (Nd), connector punctuations (Pc)", plus Orc's addition of apostrophe as a "prime" mark.
All identifiers are normalized to Unicode Normalization Form C as they are parsed.
Examples of allowed Orc identifiers: orchestrate, iscenesætte, ενορχηστρώνω, الانسجام, 編排, 練り上げる, 관현악으로_편곡하다, оркестровать, आर्केस्ट्रा_करना.
Also, mathematical letter-like characters are allowed, such as ℤ, ℏ, ℵ0, and, of course, Greek letters.
Orc also treats an operator (defined below) placed in parenthesis, such as (+)
, as an identifier with the name of the operator (without the parenthesis).
An identifier cannot match a keyword; see the following section.
Any token that otherwise follows the rules for identifiers, but matches an entry in the following list is not treated as an identifier, but as a keyword instead.
as def else if import include lambda signal stop then type val true false null _
Note that _
is a special case: Identifiers cannot start with an underscore, but _
scans as a keyword nonetheless.
An Orc operator is a character sequence that matches one of the following. This match is greedy — for example, ** is matched in preference to two * operators, if possible.)
+ - * / % ** && || ~ = <: :> <= >= /= : . ? :=
An Orc delimiter is a character sequence that matches one of the following.
( ) [ ] {. .} , # < > | ; :: :!:
Numeric literals and character string literals are recognized per the syntax given in those sections.
--to end of the line (see newlines, below)
{- multi-line comment body -}
A multi-line comment body is any character sequence (possibly empty), where {- and -} have lexical significance. Orc comments can be nested, so {- starts a nested multi-line comment, and -} ends the current multi-line comment.
Per Unicode Standard Annex #31, section 4, recommendation R3, Orc treats all Unicode Pattern_White_Space characters as whitespace.
U+0009 HT Character Tabulation
U+000B VT Line Tabulation
U+0020 Space
U+200E Left-to-Right Mark
U+200F Right-to-Left Mark
And the six newline characters (below)
Orc follows the newline definition of Unicode standard section 5.8, Newline Guidelines, recommendation R4.
Namely, Orc stops reading a line when it encounters one of the following characters:
U+000D CR Carriage Return
U+000A LF Line Feed
U+0085 NEL Next Line
U+2028 LS Line Separator
U+000C FF Form Feed
U+2029 PS Paragraph Separator
Related Reference Topics
Related Tutorial Sections