Over the last 5 years, I’ve transcribed the Calvin and Hobbes comics, and tagged them manually by theme. But can I generate themes automatically?
One way is to use Amazon’s statistically improbable phrases. It’s a list of words that occur a lot in a book, but rarely occur in others. It gives you a good feel of what topics the book is about.
Here’s how I did it:
- Transcribe Calvin & Hobbes. This is 99% of the work.
- Make a C&H word list. Just join all the words in Calvin and Hobbes. (Be careful about punctuation, and colloquialisms like “dunno”, “leggo”, etc.)
- Get an English corpus. That is, get a big list of words in normally occurring text. I have some e-books, and I picked 23 megabytes worth of these as my corpus.
- Compare the word frequency in C&H with the corpus. That is, compare the % of occurrences of a word in Calvin and Hobbes versus the corpus.
- Display those with significantly higher frequency in C&H.
The list below has common Calvin & Hobbes words occurring 10 times as often as in normal text. It’s incredible how closely it relates to most of the themes.
(Big words occur more often. Dark words are more improbable.)
allowance assignment babe balloon bat bath beanie bedtime bee beep bet bike blaster boring bug bus butter calvin calvinball cartoon cent cereal cheat chew chocolate click comic cookie crunch dad dame derkins dictator-for-life dinosaur disgusting doll doomed dumb duplicate earthling explorer fang fearless ferocious flip flush frog frosted fun fuzzy genius goggle goodness goon grade gross grown-up gum hack hamburger hamster hate hero hideous hobbes homework huey insect invent jelly jerk jurassic kid leaf loot martian math mild-mannered mom monster moron motto munch mushy nickel oatmeal ouija pant peanut perspective pit playground poll porridge poster quiz recess rosalyn rotten rub sandwich santa scary sculpture scum shovel
sissy sitter sled slimy slug slushball sniff snow snowball snowman soak spaceman spiff splash spoil sport squirt steer sting stuffed stupendous sugar susie tickle tiger toy transmogrifier transmogrify tub tuna twinky tyrannosaur underwear vacation weird wham whiff worm wormwood
Summary: “Statistically improbable phrases” are a powerful tool for text analysis. You can apply it on any content and figure out what topics it talks about.
Update: Technically, these are “Statistically improbable WORDS”, not phrases. So I re-did this analysis using phrases instead of words.