How ought syntax highlighting be done?
First, recognize that this is a lexing problem, not a parsing one. (In retrospect, this is true only for some languages.) It is not necessary to know the full grammar of a language. But it is imperitive to recognize tokens perfectly: if a single string literal terminator is missed, the attempt to highlight backfires.
Besides splitting tokens, the only other job of a syntax highlighter is the classify and color them. Thus each language needs only a lexing file which describes a few categories of tokens. I would suggest just four universal categories. Does anyone know of a language for which these categories are inappropriate?
| C | Scheme | Haskell | Extended BNF | |
| Syntax | { } ; ( ) * & + - if struct | ( ) ` , | { } ; case of data type where | ::= ; [ ] { } ( ) | | 
| Identifier | foo | foo + | foo Nothing | <foo> | 
| Literal | 3 "hello" | 3 'hi "hello" | 3 "hello" | "hello" | 
| Operator | * & + | ++ `elem` | ||
| Comment | // /* */ | ; | -- {- -} | (* *) | 
| (Whitespace) |