I often need to extract words out of sentences. It's one of the things I used to build the Statistically Improbable Phrases for Calvin and Hobbes. But splitting a sentence into words isn't as easy as you think.
Think about it. What is a word?
Something that has spaces around it? OK, let's start with the simplest way to get words: split by spaces. Consider this piece:
"I'd look at McDonald's," he said. "They sell over 3,000,000 burgers a day -- at $1.50 each." High-fat foods were the rage. For e.g., margins in fries were over 50%... and (except for R&M & Dyana [sic]) everyone was at ~30% net margin; growing at 25% too!
Splitting this by spaces (consider new lines, tabs, etc as spaces too.), we get the following:
"I'd look at McDonald's," ...
Now, some of these like "I'd" are words. But "McDonald's" isn't. I mean, there's a full-stop and a double-quotes at the end. Clearly we need to remove the punctuation as well. But, if we do that, I'd becomes Id. So we need to be careful about which punctuation to remove. Let's take a closer look.
The following punctuation marks are clear word separators: spaces, the exclamation mark, the question mark, semicolon, brackets of any kind, and double-quotes (not single quotes). No word has these in the middle. If we use these as separators, our list of words is better, but we still have some words with punctuation:
McDonald's, e.g., High-fat R&M ...
The issue is, these punctuation marks are ambiguous word separators: comma, hyphen, single-quote, ampersand, period and slash. These usually separate words, but there are exceptions:
Colon is ambiguous too. In normal English usage, it would be a separator. But URLs like http://www.s-anand.net/ use these characters, and it doesn't make sense to separate them.
So here are my current rules for splitting a sentence into words. (It's a Perl regular expression. Don't worry. Cooper's Law: If you do not understand a particular word in a piece of technical writing, ignore it. The piece will make perfect sense without it.)
# Split by clear word separators
/ [\s \! \? \;\(\)\[\]\{\}\<\> " ]
# ... by COMMA, unless it has numbers on both sides: 3,000,000
| (?<=\D) ,
| , (?=\D)
# ... by FULL-STOP, SINGLE-QUOTE, HYPHEN, AMPERSAND, unless it has a letter on both sides
| (?<=\W) [\.\-\&]
| [\.\-\&] (?=\W)
# ... by QUOTE, unless it follows a letter (e.g. McDonald's, Holmes')
| (?<=\W) [']
# ... by SLASH, if it has spaces on at least one side. (URLs shouldn't be split)
| \s \/
| \/ \s
# ... by COLON, unless it's a URL or a time (11:30am for e.g.)
| \:(?!\/\/|\d)
/x;
This doesn't even scratch the surface of the issue, though. Here are some issues:
And you thought it was easy!