'Fun' with Miller

Thu 8/15/19 12:05am

Comments: 0

Tools

The Miller data-manipulation tool has a lot of potential, but the organic development process has left it with a lot of rough edges.

My first hint that all was not well internally was when I was hacking on PDQ output and realized that I had to type nest --implode --values --across-records --nested-fs space to get what I wanted. The manpage includes a shortcut for the nest verb’s other mode, so that --explode --values --across-records --nested-fs can be replaced with --evar, but there’s no matching --ivar. So I forked the project, added it, and sent a pull request.

The functionality was trivial, but the single-line usage description had to be added five times. That sent up a little red flag.

Still, it was easy to do, so I thought I’d see if it was feasible to fix one of the other things that bugs me, which is the lack of quoting and/or escape characters in its native DKVP file format (delimited key-value pairs, aka foo=1,bar=2,baz=3).

The answer turns out to be “not easily”, and I quickly learned some unpleasant things about how it handles other data-format conversions.

Given the perfectly-ordinary CSV input file foo.csv containing:

a,b,c,d,e
"1,1","2
2","3","4\r4","5\n5"

I get the following results:

# DKVP: useless as expected
% mlr --icsv cat foo.csv
a=1,1,b=2
2,c=3,d=4\r4,e=5\n5

# (note: internally, fields have correct values)
% mlr --icsv --ojson cut -f a,b foo.csv
{ "a": "1,1", "b": "2\n2" }

# CSV: reasonable, but strings converted to ints
% mlr --csv cat foo.csv
a,b,c,d,e
"1,1","2
2",3,4\r4,5\n5

# Quoted CSV: better, but should be default
% mlr --csv --quote-original cat foo.csv
a,b,c,d,e
"1,1","2
2","3","4\r4","5\n5"

# JSON: reasonable, but strings converted to ints
% mlr --icsv --ojson  cat foo.csv 
{ "a": "1,1", "b": "2\n2", "c": 3, "d": "4\\r4", "e": "5\\n5" }

# Quoted JSON: oh, hell no
% mlr --icsv --ojson --jvquoteall cat foo.csv 
{ "a": "1,1", "b": "2
2", "c": "3", "d": "4\r4", "e": "5\n5" }

Note that the automated num-ification has real consequences for data processing, since you can’t do things like regex matches or string-substitutions on a number type, and have to explicitly coerce fields back to strings; the error message for this is less than clear. Also, leading zeroes trigger octal conversion…

So that’s an enhancement request for escaping comma, cr, and lf in DKVP, plus a bug on the busted output when you add the --jvquoteall option to avoid the num-ification of string literals. (and it bothers me that I had to explain the bug in a completely different way because one of the devs didn’t understand my sample CSV file…)

I see a massive refactoring in its future (“cover a wall with color-coded sticky notes, then break out the chainsaws and forklifts”). Oh, well, useful tool when used with care.

Comments via Isso

Markdown formatting and simple HTML accepted.

Sometimes you have to double-click to enter text in the form (interaction between Isso and Bootstrap?). Tab is more reliable.