'Fun' with Miller

The Miller data-manipulation tool has a lot of potential, but the organic development process has left it with a lot of rough edges.

My first hint that all was not well internally was when I was hacking on PDQ output and realized that I had to type nest --implode --values --across-records --nested-fs space to get what I wanted. The manpage includes a shortcut for the nest verb’s other mode, so that --explode --values --across-records --nested-fs can be replaced with --evar, but there’s no matching --ivar. So I forked the project, added it, and sent a pull request.

The functionality was trivial, but the single-line usage description had to be added five times. That sent up a little red flag.

Still, it was easy to do, so I thought I’d see if it was feasible to fix one of the other things that bugs me, which is the lack of quoting and/or escape characters in its native DKVP file format (delimited key-value pairs, aka foo=1,bar=2,baz=3).

The answer turns out to be “not easily”, and I quickly learned some unpleasant things about how it handles other data-format conversions.

Given the perfectly-ordinary CSV input file foo.csv containing:


I get the following results:

# DKVP: useless as expected
% mlr --icsv cat foo.csv

# (note: internally, fields have correct values)
% mlr --icsv --ojson cut -f a,b foo.csv
{ "a": "1,1", "b": "2\n2" }

# CSV: reasonable, but strings converted to ints
% mlr --csv cat foo.csv

# Quoted CSV: better, but should be default
% mlr --csv --quote-original cat foo.csv

# JSON: reasonable, but strings converted to ints
% mlr --icsv --ojson  cat foo.csv 
{ "a": "1,1", "b": "2\n2", "c": 3, "d": "4\\r4", "e": "5\\n5" }

# Quoted JSON: oh, hell no
% mlr --icsv --ojson --jvquoteall cat foo.csv 
{ "a": "1,1", "b": "2
2", "c": "3", "d": "4\r4", "e": "5\n5" }

Note that the automated num-ification has real consequences for data processing, since you can’t do things like regex matches or string-substitutions on a number type, and have to explicitly coerce fields back to strings; the error message for this is less than clear. Also, leading zeroes trigger octal conversion…

So that’s an enhancement request for escaping comma, cr, and lf in DKVP, plus a bug on the busted output when you add the --jvquoteall option to avoid the num-ification of string literals. (and it bothers me that I had to explain the bug in a completely different way because one of the devs didn’t understand my sample CSV file…)

I see a massive refactoring in its future (“cover a wall with color-coded sticky notes, then break out the chainsaws and forklifts”). Oh, well, useful tool when used with care.

