How ought syntax highlighting be done?
First, recognize that this is a lexing problem, not a parsing one. (In retrospect, this is true only for some languages.) It is not necessary to know the full grammar of a language. But it is imperitive to recognize tokens perfectly: if a single string literal terminator is missed, the attempt to highlight backfires.
Besides splitting tokens, the only other job of a syntax highlighter is the classify and color them. Thus each language needs only a lexing file which describes a few categories of tokens. I would suggest just four universal categories. Does anyone know of a language for which these categories are inappropriate?
C | Scheme | Haskell | Extended BNF | |
Syntax | { } ; ( ) * & + - if struct | ( ) ` , | { } ; case of data type where | ::= ; [ ] { } ( ) | |
Identifier | foo | foo + | foo Nothing | <foo> |
Literal | 3 "hello" | 3 'hi "hello" | 3 "hello" | "hello" |
Operator | * & + | ++ `elem` | ||
Comment | // /* */ | ; | -- {- -} | (* *) |
(Whitespace) |