Finally got sick of constantly dealing with the variety of encoding schemes used for Japanese text files. I still convert everything to UTF-8 before any serious use, but for just looking at a random downloaded file, I wanted to eliminate a step.
less supports input filters with the LESSOPEN environment variable, but you need something to put into it. Turns out the Perl Encode::Guess module works nicely for this, and now I no longer care if a file is JIS, ShiftJIS, CP932, EUC-JP, or UTF-8. Code below the fold.
#!/usr/bin/perl use strict; use Encode; use Encode::Guess; binmode(STDOUT,":utf8"); my $data; open(In,"<:raw",$ARGV[0]) or exit 1; while () { $data .= $_; last if $. > 1000; } my $name = guess_encoding($data,qw(jis cp932 euc-jp utf8))->name; $name = "utf8" if $name eq "ascii"; print decode($name,$data); binmode(In,":encoding($name)"); while ( ) { print; } close(In); exit 0;
export LESSOPEN=”| inputfilter.pl %s”
The only two things interesting about this code are the fact that it tests the first thousand lines of the file, and that I override any attempt to treat a file as “ascii”. Both are to work around lengthy ASCII-only headers, such as XML DTDs, etc.