Why is it suddenly too late for CADS?


For some time now, I’ve been writing Perl scripts to work with kanji. While Perl supports transparent conversion from pretty much any character encoding, internally it’s all Unicode, and since that’s also the native encoding for a Mac, all is well.

Except that for backwards compatibility, Perl doesn’t default to interpreting all input and output as Unicode. There are still lots of scripts out there that assume all operations work in terms of 8-bit characters, and you don’t want to silently corrupt their results.

The solution created in 5.8 was the -C option, which has half a dozen or so options for deciding exactly what should be treated as Unicode. I use -CADS, which Uni-fies @ARGV, all open calls for input/output, and STDIN, STDOUT, and STDERR.

Until today. My EEE is running Fedora 9, and when I copied over my recently-rewritten dictionary tools, they refused to run, insisting that it’s ‘Too late for “-CADS” option at lookup line 1’. In Perl 5.10, you can no longer specify Unicode compatibility level on a script-by-script basis. It’s all or nothing, using the PERL_UNICODE environment variable.

That, as they say, sucks.

The claim in the release notes is:

The -C option can no longer be used on the #! line. It wasn't working there anyway, since the standard streams are already set up at this point in the execution of the perl interpreter. You can use binmode() instead to get the desired behaviour.

But this isn’t true, since I’ve been running gigs of kanji in and out of Perl for years this way, and they didn’t work without either putting -CADS into the #! line or crufting up the scripts with explicitly specified encodings. Obviously, I preferred the first solution.

In fact, just yesterday I knocked together a quick script on my Mac to locate some non-kanji characters that had crept into someone’s kanji vocabulary list, using \p{InCJKUnifiedIdeographs}, \p{InHiragana}, and \p{InKatakana}. I couldn’t figure out why it wasn’t working correctly until I looked at the #! line and realized I’d forgotten the -CADS.

What I think the developer means in the release notes is that it didn’t work on systems with a non-Unicode default locale. So they broke it everywhere.