Corpus Fun

I’m pretty sure “futanari” is not Dutch. Also “gmail”, “iphone”, “http”, “cialis”, and “jackalope”. “bewerkstelligen”, on the other hand, fits right in.

For my new random word generator, I’ve been supplementing and replacing the small language samples from Chris Pound’s site. The old ones do a pretty good job, but the new generator has built-in caching of the parsed source files, so it’s possible to use much larger samples, which gives a broader range of language-flavored words. 5,000 distinct words seems to be the perfect size for most languages.

Project Gutenberg has material in a few non-English languages, and it’s easy to grab an entire chapter of a book. Early Indo-European Online has some terrific samples, most of them easily extracted. But what looked like a gold mine was Deltacorpus: 107 different languages, all extracted with the same software and tagged for part-of-speech. And the range of languages is terrific: Korean, Yiddish, Serbian, Afrikaans, Frisian, Low Saxon, Swedish, Catalan, Haitian Creole, Irish, Kurdish, Nepali, Uzbek, Mongol, etc, each with around 900,000 terms. The PoS-tagging even made it easy to strip out things that were not native words, and generate a decent-sized random subset.

Then I tried them out in the generator, and started to see anomolies: “jpg” is not generally found in a natural language, getting a plausible Japanese name out of a Finnish data set is highly unlikely, etc. There were a number of oddballs like this, even in languages that I had to run through a romanizer, like Korean and Yiddish.

So I opened up the corpus files and started searching through them, and found a lot of things like this:

437 바로가기    PROPN   
438 =   PUNCT   
439 http    VERB    
440 :   PUNCT   
441 /   PUNCT   
442 /   PUNCT   
443 www NOUN    
444 .   PUNCT   
445 shoop   NOUN    
446 .   PUNCT   
447 co  NOUN    
448 .   PUNCT   
449 kr  INTJ    
450 /   PUNCT   
451 shop    PROPN   
452 /   PUNCT   
453 goods   NOUN    
454 /   PUNCT   
455 goods_list  NOUN    
456 .   PUNCT   
457 php NOUN    
458 ?   DET 
459 category    NOUN    
460 =   PUNCT   
461 001014  NUM 

1   우리의  ADP 
2   예제에서    NOUN    
3   content X   
4   div에   NOUN    
5   float   VERB    
6   :   PUNCT   
7   left    VERB    
8   ;   PUNCT   

Their corpus-extraction script was treating HTML as plain text, and the pages they chose to scan included gaming forums and technology review sites. Eventually I might knock together a script to decruft the original sources, but for now I’m just excluding the obvious ones and skimming through the output looking for words that don’t belong. This is generally pretty easy, because most of them are obvious in a sorted list:


Missing some of them isn’t a big problem, because the generator uses weighted-random selection for each token, and if a start token only appears once, it won’t be selected often, and there are few possible transitions. Still worth cleaning up, since they become more likely when you mix multiple language sources together.

Ai’s Eyes

I think this is the first picture I’ve seen of Ai Shinozaki where the makeup artist didn’t try to de-emphasize her natural asian features.

NSFW because have you seen Ai Shinozaki?


Who Needs Villains

So far, this season of Doctor Who has been… “unimpressive”. The set design mostly lacks imagination and scope. The stories feel like they were cribbed from better, or at least more ambitious, episodes in previous seasons. The Doctor’s monologs are being written by Captain Obvious and The Campus Socialist. The new companion got a decent intro, but has done little of note since. And as for the monster of the week, well, so far we’ve had:

  1. a puddle. (but an alien puddle)
  2. a swarm of nano-Legos.
  3. a strawman capitalist.
  4. a mama’s boy.
  5. strawman capitalism.

The ideas and characters in each episode are undeveloped. There’s no supporting cast to speak of, just the Doctor, Bill, and Nardole, and Nardole spends most of his time delivering ominous foreshadowing with the delicate grace of a firehose.

Dear Microsoft,

The “new, improved, consistent-across-all-devices” 4-column layout in OneNote sucks shit through a straw. Having full-height columns for notebooks, sections, and pages taking up half the window is an incredible waste of space, and a horrible navigation method. It sucks on a tablet. It sucks worse on a laptop.

Dear Amazon,

I don’t like you in that way

Cheesecake Champloo 5: Duos

Another trip through the leftover folder, this time selecting pictures with exactly two girls. For obvious reasons, some of these are NSFW, and hidden at the end.


Dear Doctor Who…

Now that I’m caught up to episode 10.5, I have only one question:

Will the entire season be written by interns copying scenes from their favorite episodes?

Mai Nishida, Buckeye


“Need a clue, take a clue,
 got a clue, leave a clue”