The Miller data-manipulation tool has a lot of potential, but the organic development process has left it with a lot of rough edges.
My first hint that all was not well internally was when I was hacking
on PDQ output and realized that I had to
type nest --implode --values --across-records --nested-fs space
to
get what I wanted. The manpage includes a shortcut for the nest verb’s
other mode, so that --explode --values --across-records --nested-fs
can be replaced with --evar
, but there’s no matching --ivar
. So I
forked the project, added it, and sent a pull
request.
The functionality was trivial, but the single-line usage description had to be added five times. That sent up a little red flag.
Still, it was easy to do, so I thought I’d see if it was feasible to
fix one of the other things that bugs me, which is the lack of quoting
and/or escape characters in its native DKVP file format (delimited
key-value pairs, aka foo=1,bar=2,baz=3
).
The answer turns out to be “not easily”, and I quickly learned some unpleasant things about how it handles other data-format conversions.
Given the perfectly-ordinary CSV input file foo.csv containing:
a,b,c,d,e
"1,1","2
2","3","4\r4","5\n5"
I get the following results:
# DKVP: useless as expected
% mlr --icsv cat foo.csv
a=1,1,b=2
2,c=3,d=4\r4,e=5\n5
# (note: internally, fields have correct values)
% mlr --icsv --ojson cut -f a,b foo.csv
{ "a": "1,1", "b": "2\n2" }
# CSV: reasonable, but strings converted to ints
% mlr --csv cat foo.csv
a,b,c,d,e
"1,1","2
2",3,4\r4,5\n5
# Quoted CSV: better, but should be default
% mlr --csv --quote-original cat foo.csv
a,b,c,d,e
"1,1","2
2","3","4\r4","5\n5"
# JSON: reasonable, but strings converted to ints
% mlr --icsv --ojson cat foo.csv
{ "a": "1,1", "b": "2\n2", "c": 3, "d": "4\\r4", "e": "5\\n5" }
# Quoted JSON: oh, hell no
% mlr --icsv --ojson --jvquoteall cat foo.csv
{ "a": "1,1", "b": "2
2", "c": "3", "d": "4\r4", "e": "5\n5" }
Note that the automated num-ification has real consequences for data processing, since you can’t do things like regex matches or string-substitutions on a number type, and have to explicitly coerce fields back to strings; the error message for this is less than clear. Also, leading zeroes trigger octal conversion…
So that’s an enhancement request for escaping comma, cr, and lf in
DKVP, plus a bug on
the busted output when you add the --jvquoteall
option to avoid the
num-ification of string literals. (and it bothers me that I had to
explain the bug in a completely different way because one of the devs
didn’t understand my sample CSV file…)
I see a massive refactoring in its future (“cover a wall with color-coded sticky notes, then break out the chainsaws and forklifts”). Oh, well, useful tool when used with care.
Markdown formatting and simple HTML accepted.
Sometimes you have to double-click to enter text in the form (interaction between Isso and Bootstrap?). Tab is more reliable.