Tuesday, March 23 2010

Japanese in email

Just some random info:

  • If the email doesn’t use MIME headers at all, it’s almost certainly encoded in Shift-JIS (or, more likely, the Microsoft variant CP-932, also known as Windows-31J).
  • If it has a proper Content-Type header with RFC 2047 encoding for the header lines, it’s almost certainly encoded in ISO-2022-JP, and your mailer should figure this out correctly.
  • …unless their software encoded the header in Shift-JIS but claimed it was ISO-2022-JP, which I’ve seen a few times. In one case, the Subject and body were right, but the From header was in Shift-JIS.
  • Unicode is still quite rare in Japan, and cellphones in particular will ignore the headers and simply decode the message as if it were ISO-2022-JP. If someone tells you that your email is mojibake (文字化け, literally “character corruption”), force your mailer to set the right encoding. For Apple’s Mail.app, the command is:
       defaults write com.apple.mail NSPreferredMailCharset ISO-2022-JP
  • Oddly enough, most phones will apparently handle raw Shift-JIS; at least, spammers feel comfortable using it.
  • If the body encoding is specified, but the headers are not RFC 2047 encoded, then the headers usually are encoded the same, but not always. Mail.app reliably guesses wrong for these.
  • Messages are generally formatted for a fixed-width font where most characters are double-width, not just the kanji and kana. Graphics characters and emoticons are quite common, and many messages will look like crap in the wrong sort of font. This also applies to blog entries.
  • The EUC-JP encoding basically doesn’t exist outside of legacy Unix code and old data files.
  • Japanese spam tends to be polite, grammatically correct, and riddled with loanwords.