I’m pretty sure “futanari” is not Dutch. Also “gmail”, “iphone”, “http”, “cialis”, and “jackalope”. “bewerkstelligen”, on the other hand, fits right in.
For my new random word generator, I’ve been supplementing and replacing the small language samples from Chris Pound’s site. The old ones do a pretty good job, but the new generator has built-in caching of the parsed source files, so it’s possible to use much larger samples, which gives a broader range of language-flavored words. 5,000 distinct words seems to be the perfect size for most languages.
Project Gutenberg has material in a few non-English languages, and it’s easy to grab an entire chapter of a book. Early Indo-European Online has some terrific samples, most of them easily extracted. But what looked like a gold mine was Deltacorpus: 107 different languages, all extracted with the same software and tagged for part-of-speech. And the range of languages is terrific: Korean, Yiddish, Serbian, Afrikaans, Frisian, Low Saxon, Swedish, Catalan, Haitian Creole, Irish, Kurdish, Nepali, Uzbek, Mongol, etc, each with around 900,000 terms. The PoS-tagging even made it easy to strip out things that were not native words, and generate a decent-sized random subset.
Then I tried them out in the generator, and started to see anomolies: “jpg” is not generally found in a natural language, getting a plausible Japanese name out of a Finnish data set is highly unlikely, etc. There were a number of oddballs like this, even in languages that I had to run through a romanizer, like Korean and Yiddish.
So I opened up the corpus files and started searching through them, and found a lot of things like this:
437 바로가기 PROPN 438 = PUNCT 439 http VERB 440 : PUNCT 441 / PUNCT 442 / PUNCT 443 www NOUN 444 . PUNCT 445 shoop NOUN 446 . PUNCT 447 co NOUN 448 . PUNCT 449 kr INTJ 450 / PUNCT 451 shop PROPN 452 / PUNCT 453 goods NOUN 454 / PUNCT 455 goods_list NOUN 456 . PUNCT 457 php NOUN 458 ? DET 459 category NOUN 460 = PUNCT 461 001014 NUM 1 우리의 ADP 2 예제에서 NOUN 3 content X 4 div에 NOUN 5 float VERB 6 : PUNCT 7 left VERB 8 ; PUNCT
Their corpus-extraction script was treating HTML as plain text, and the pages they chose to scan included gaming forums and technology review sites. Eventually I might knock together a script to decruft the original sources, but for now I’m just excluding the obvious ones and skimming through the output looking for words that don’t belong. This is generally pretty easy, because most of them are obvious in a sorted list:
Missing some of them isn’t a big problem, because the generator uses weighted-random selection for each token, and if a start token only appears once, it won’t be selected often, and there are few possible transitions. Still worth cleaning up, since they become more likely when you mix multiple language sources together.
Markdown formatting and simple HTML accepted.
Sometimes you have to double-click to enter text in the form (interaction between Isso and Bootstrap?). Tab is more reliable.