Sunday, August 30 2009

Shift-JIS versus CP932

If you’re on a Unicode-based OS, and you’re trying to read something encoded in Shift-JIS, and you’re getting errors about a small number of illegal characters that can’t be converted to Unicode in an otherwise perfectly-reasonable file, it’s not Shift-JIS, it’s CP932.

Windows Code Page 932 includes mappings for characters like 〝 and 〟, which do not exist in S-JIS.

…and that’s another hour of my life that I want back.

[Update: the luit conversion tool in X11 supports Shift-JIS only, and silently discards CP932 extensions. I’m not sure what else is available for Linux users; I just do it with a Perl one-liner.]

[oh, and there’s yet another name for this encoding: Windows-31J. And there are several other incompatible variants of Shift-JIS that require guesswork on the part of the decoder, making the continued resistance to Unicode frankly baffling. (except for not-very-smartphones, where hardware and software limits have made support for multiple encodings tricky)]