Syntax Highlighting

How ought syntax highlighting be done?

First, recognize that this is a lexing problem, not a parsing one. (In retrospect, this is true only for some languages.) It is not necessary to know the full grammar of a language. But it is imperitive to recognize tokens perfectly: if a single string literal terminator is missed, the attempt to highlight backfires.

Besides splitting tokens, the only other job of a syntax highlighter is the classify and color them. Thus each language needs only a lexing file which describes a few categories of tokens. I would suggest just four universal categories. Does anyone know of a language for which these categories are inappropriate?

	C	Scheme	Haskell	Extended BNF
Syntax	{ } ; ( ) * & + - if struct	( ) ` ,	{ } ; case of data type where	::= ; [ ] { } ( ) \|
Identifier	foo	foo +	foo Nothing	<foo>
Literal	3 "hello"	3 'hi "hello"	3 "hello"	"hello"
Operator	* & +		++ `elem`
Comment	// /* */	;	-- {- -}	(* *)
(Whitespace)

Justin Pombrio

About Me

Code

Writing

Syntax Highlighting