Splitting a sentence into words
I often need to extract words out of sentences. It’s one of the things I used to build the Statistically Improbable Phrases for Calvin and Hobbes. But splitting a sentence into words isn’t as easy as you think. Think about it. What is a word? Something that has spaces around it? OK, let’s start with the simplest way to get words: split by spaces. Consider this piece: "I'd look at McDonald's," he said. "They sell over 3,000,000 burgers a day -- at $1.50 each." High-fat foods were the rage. For e.g., margins in fries were over 50%... and (except for R&M & Dyana [sic]) everyone was at ~30% net margin; growing at 25% too! Splitting this by spaces (consider new lines, tabs, etc as spaces too.), we get the following: ...