Splitting a sentence into words

I often need to extract words out of sentences. It’s one of the things I used to build the Statistically Improbable Phrases for Calvin and Hobbes. But splitting a sentence into words isn’t as easy as you think.

Think about it. What is a word?

Something that has spaces around it? OK, let’s start with the simplest way to get words: split by spaces. Consider this piece:

"I'd look at McDonald's," he said.
"They sell over 3,000,000 burgers a day -- at $1.50 each."
High-fat foods were the rage. For e.g., margins in fries
were over 50%... and (except for R&M & Dyana [sic]) everyone
was at ~30% net margin; growing at 25% too!

Splitting this by spaces (consider new lines, tabs, etc as spaces too.), we get the following:

"I'd
look
at
McDonald's,"
...

Now, some of these like “I’d” are words. But “McDonald’s” isn’t. I mean, there’s a full-stop and a double-quotes at the end. Clearly we need to remove the punctuation as well. But, if we do that, I'd becomes Id. So we need to be careful about which punctuation to remove. Let’s take a closer look.

The following punctuation marks are clear word separators: spaces, the exclamation mark, the question mark, semicolon, brackets of any kind, and double-quotes (not single quotes). No word has these in the middle. If we use these as separators, our list of words is better, but we still have some words with punctuation:

McDonald's,
e.g.,
High-fat
R&M
...

The issue is, these punctuation marks are ambiguous word separators: comma, hyphen, single-quote, ampersand, period and slash. These usually separate words, but there are exceptions:

Comma
Not inside numbers: 3,000,000.
Hyphen
Not for hyphenated words: High-fat.
Single-quote
Not for possessives: McDonald’s. Not for joint words: I’d.
Ampersand
Not for abbreviations: R&M
Period
Not for abbreviations: O.K. Not for URLs: www.s-anand.net
Slash
Not for fractions: 3/4. Not for URLs: google.com/search

Colon is ambiguous too. In normal English usage, it would be a separator. But URLs like http://www.s-anand.net/ use these characters, and it doesn’t make sense to separate them.

So here are my current rules for splitting a sentence into words. (It’s a Perl regular expression. Don’t worry. Cooper’s Law: If you do not understand a particular word in a piece of technical writing, ignore it. The piece will make perfect sense without it.)

# Split by clear word separators
/       [\s \! \? \;\(\)\[\]\{\}\<\> " ]
 
# ... by COMMA, unless it has numbers on both sides: 3,000,000
|       (?<=\D) ,
|       , (?=\D)
 
# ... by FULL-STOP, SINGLE-QUOTE, HYPHEN, AMPERSAND, unless it has a letter on both sides
|       (?<=\W) [\.\-\&]
|       [\.\-\&] (?=\W)
 
# ... by QUOTE, unless it follows a letter (e.g. McDonald's, Holmes')
|       (?<=\W) [']
 
# ... by SLASH, if it has spaces on at least one side. (URLs shouldn't be split)
|       \s \/
|       \/ \s
 
# ... by COLON, unless it's a URL or a time (11:30am for e.g.)
|       \:(?!\/\/|\d)
/x;

This doesn’t even scratch the surface of the issue, though. Here are some issues:

  • Lots of files split words into two at the end of a line. How do we handle that?
  • How do we handle incorrect punctuation? For instance, if someone types “done.Yet,” without leaving a space after the full-stop, I’ll think it’s an abbreviation.
  • What about other separators? Like the ± symbol or the £ symbol for instance.
  • What about other languages?!

And you thought it was easy!

  1. Andrew says:

    Thanks, this is very helpful.

  2. Kasi says:

    Hi Anand, Nice site..It was good talking to you! When u find time, pls do visit my humble less visited blog at http://www.raconteurkasi.blogspot.com !

  3. Ramya says:

    Hi Anand, Can u guide me through the interview process of INSEAD MBA . Can you share your wisdom with me to crack the interview, and one more question .. do u sleep for less than 5 hrs a day ?