Splitting a sentence into words

I often need to extract words out of sentences. It's one of the things I used to build the Statistically Improbable Phrases for Calvin and Hobbes. But splitting a sentence into words isn't as easy as you think.

Think about it. What is a word?

Something that has spaces around it? OK, let's start with the simplest way to get words: split by spaces. Consider this piece:

"I'd look at McDonald's," he said.
"They sell over 3,000,000 burgers a day -- at $1.50 each."
High-fat foods were the rage. For e.g., margins in fries
were over 50%... and (except for R&M & Dyana [sic]) everyone
was at ~30% net margin; growing at 25% too!

Splitting this by spaces (consider new lines, tabs, etc as spaces too.), we get the following:

"I'd
look
at
McDonald's,"
...

Now, some of these like "I'd" are words. But "McDonald's" isn't. I mean, there's a full-stop and a double-quotes at the end. Clearly we need to remove the punctuation as well. But, if we do that, I'd becomes Id. So we need to be careful about which punctuation to remove. Let's take a closer look.

The following punctuation marks are clear word separators: spaces, the exclamation mark, the question mark, semicolon, brackets of any kind, and double-quotes (not single quotes). No word has these in the middle. If we use these as separators, our list of words is better, but we still have some words with punctuation:

McDonald's,
e.g.,
High-fat
R&M
...

The issue is, these punctuation marks are ambiguous word separators: comma, hyphen, single-quote, ampersand, period and slash. These usually separate words, but there are exceptions:

Comma
Not inside numbers: 3,000,000.
Hyphen
Not for hyphenated words: High-fat.
Single-quote
Not for possessives: McDonald's. Not for joint words: I'd.
Ampersand
Not for abbreviations: R&M
Period
Not for abbreviations: O.K. Not for URLs: www.s-anand.net
Slash
Not for fractions: 3/4. Not for URLs: google.com/search

Colon is ambiguous too. In normal English usage, it would be a separator. But URLs like http://www.s-anand.net/ use these characters, and it doesn't make sense to separate them.

So here are my current rules for splitting a sentence into words. (It's a Perl regular expression. Don't worry. Cooper's Law: If you do not understand a particular word in a piece of technical writing, ignore it. The piece will make perfect sense without it.)

# Split by clear word separators
/	[\s \! \? \;\(\)\[\]\{\}\<\> " ]
 
# ... by COMMA, unless it has numbers on both sides: 3,000,000
|	(?<=\D) ,
|	, (?=\D)
 
# ... by FULL-STOP, SINGLE-QUOTE, HYPHEN, AMPERSAND, unless it has a letter on both sides
|	(?<=\W) [\.\-\&]
|	[\.\-\&] (?=\W)
 
# ... by QUOTE, unless it follows a letter (e.g. McDonald's, Holmes')
|	(?<=\W) [']
 
# ... by SLASH, if it has spaces on at least one side. (URLs shouldn't be split)
|	\s \/
|	\/ \s
 
# ... by COLON, unless it's a URL or a time (11:30am for e.g.)
|	\:(?!\/\/|\d)
/x;

This doesn't even scratch the surface of the issue, though. Here are some issues:

And you thought it was easy!

Written on 26 Apr 2007

Comments


(not shared, not spammed)


S Anand, Infosys Consulting, London UK. +44 7957 440 260