Thursday, February 24 2011

Automagic JIS/ShiftJIS/EUC to UTF8

Finally got sick of constantly dealing with the variety of encoding schemes used for Japanese text files. I still convert everything to UTF-8 before any serious use, but for just looking at a random downloaded file, I wanted to eliminate a step.

less supports input filters with the LESSOPEN environment variable, but you need something to put into it. Turns out the Perl Encode::Guess module works nicely for this, and now I no longer care if a file is JIS, ShiftJIS, CP932, EUC-JP, or UTF-8. Code below the fold.

#!/usr/bin/perl
use strict;
use Encode;
use Encode::Guess;
binmode(STDOUT,":utf8");
 
my $data;
open(In,"<:raw",$ARGV[0]) or exit 1;
while (<In>) {
	$data .= $_;
	last if $. > 1000;
}
my $name = guess_encoding($data,qw(jis cp932 euc-jp utf8))->name;
$name = "utf8" if $name eq "ascii";
 
print decode($name,$data);
binmode(In,":encoding($name)");
while (<In>) {
	print;
}
close(In);
exit 0;

export LESSOPEN=”| inputfilter.pl %s”

The only two things interesting about this code are the fact that it tests the first thousand lines of the file, and that I override any attempt to treat a file as “ascii”. Both are to work around lengthy ASCII-only headers, such as XML DTDs, etc.