My earlier list of statistically improbable phrases in Calvin and Hobbes is technically just a list of “Statistically Improbable Words”. I re-did the same analysis using phrases. Here are the top 20 statistically improbable phrases (2 – 4 words only):
baby sitter chocolate frosted sugar bombs comic books doing homework fearless spaceman spiff() good night hamster huey ice cream miss wormwood new year peanut butter really think slimy girls spaceman spiff stuffed tiger stupendous man sugar bombs susie derkins watch tv water balloon
That is, these are the 2-4 word phrases whose frequency in Calvin and Hobbes is substantially (at least 5 times) higher than in the other books I have.
While doing this, the single biggest problem that stumped me was: what is a word?
- Is “it’s” one word or two words?
- Is “six-year-old” one word or three words?
- How do I distinguish between abbreviations (g.r.o.s.s.) and full-stops without a space ( … homework.what’s a …)?
- Does a comma always split words? (It doesn’t in numbers, like “3,500″)
The other problem is, phrases with more words are more improbable. Right now, if a phrase occurs 5 times more frequently in Calvin and Hobbes than my other books, I include it. But three-letter words rarely occur that often, and four-letter words even less so. Maybe I should have a lower cutoff for longer phrases.
Anyway, this analysis is a crude first approximation. Clearly Amazon’s gotten much further with their system.