One of my continual frustrations with most programming languages is the lack of
facilities for creating quick'n'easy parser. C has scanf
, but that's
unbelievably fragile, and not particularly useful for more complex formats. You
can use yacc
, or Haskell's parsec
, but all such options are rather
overblown for just parsing, say, a log file.
Take a specific example: the nginx log_format
directive takes a format string
like so (this is not the full default):
$remote_addr - $remote_user [$time_local] [took $request_time ms] "$request"
This is a nice, simple structure, corresponding neatly to how one would generate the string in many scripting languages. Unfortunately, actually parsing the resulting log is much more challenging. It's perhaps easiest in perl, but even then, you have to go to great lengths to figure out precisely what characters will be permitted in which variable. (Or alternatively, you can probably hack out an entirely unreadable solution with lookaheads and whatnot. Not an attractive prospect.)
Wouldn't it be nice if you could parse a log file using that same format string?
So I've hacked up a haskell library to do that. On hackage and on github. Hopefully, it will prove useful.
Some technical details. It works by treating contiguous chunks of raw text (with no intervening variables) as delimeters, marking the end of the text to be assigned to the previous variable. The shortest match, therefore, is always used. This provides the maximum flexibility for the content of variables, at the cost of rigid requirements for the rest of the text - as this library is intended for parsing auto-generated text, I consider that an acceptable tradeoff. Real-world tools might want to perform some whitespace manipulations (all whitespace to a single space, for example) to allow for greater flexibility in input.