Shortening sentences

When writing Mixamail, I wanted tweets automatically shortened to 140 characters – but in the most readable manner.

Some steps are obvious. Removing redundant spaces, for example. And URL shortening. I use bit.ly because it has an API. I’ll switch to Goo.gl, once theirs is out.

I tried a few more strategies:

  1. Replace words with short forms. “u” for “you”, “&” for and, etc.
  2. Remove articles – a, an, the
  3. Remove optional punctuation – comma, semicolon, colon and quotes, in particular
  4. Replace “one” with “1”, “to” or “too” with 2, etc. “Before” becomes “Be4”, for example
  5. Remove spaces after punctuations. So “a, b” becomes “a,b” – the space after the comma is removed
  6. Remove vowels in the middle. nglsh s lgbl wtht vwls.

How did they pan out? I tested out these on the English sentences on the Tanaka Corpus, which has about 150,000 sentences. (No, they’re not typical tweets, but hey…). By just doing these, independently, here is the percentage reduction in the size of text:

2.0% Remove optional punctuations – comma, semicolon, colon and quotes
2.2% Remove spaces after punctuations. So “a, b” becomes “a,b”
3.3% Replace words with short forms. “u” for “you”, “&” for and, etc.
3.3% Replace “one” with “1”, “to” or “too” with 2, etc.
6.7% Remove articles – a, an, the
18.2% Remove vowels in the middle

Touching punctuations doesn’t have much impact. There aren’t that many of them anyway. Word substitution helps, but not too much. I could’ve gone in for a wider base, but the key is the last one: removing vowels in the middle kills a whopping 18%! That’s tough to beat with any strategy. So I decided to just stop there.

The overall reduction, applying all of the above, is about 22%. So there’s a decent chance you can type in a 180-character tweet, and Mixamail.com will still tweet it intelligibly.

I had one such tweet a few days ago. I try and stay well within 140, but this one was just too long.

The Lesson: If you’re writing an app (or building anything), find a use for yourself. There’s no better motivation — and it won’t ever be a wasted effort.

That was 156 characters. It got shortened to:

Lesson If u’re writing app (or building anything) find use 4 yourself. There’s no better motivation — & it won’t ever be wasted ef4t.

Perfectly acceptable.

You may notice that Mixamail didn’t have to employ vowel shortening. It makes the most readable shortenings first, checks if it’s within 140, and tries the next only if required.

If anyone has a simple, readable way of shortening Tweets further, please let me know!

4 thoughts on “Shortening sentences”

  1. 1. Are removing punctuation and removing spaces after punctuation mutually exclusive, or do you have a rule for determining which punctuation is optional?

    2. You give the example of “before” going to “be4.” It could go to “b4” with the rule that the “be” and “de” prefixes reduce to “b” and “d.”

    3. You have “you’re” going to “u’re.” You could expand “you’re” to “you are” and then reducing to “u r.”

    4. For readability, I suggest preserving the first and last letters of words. (Compare with http://www.boingboing.net/2003/09/14/scrambled-words-are-.html) E.g., reduce “English” to “Englsh” instead of to “nglsh.”

    Finally, I the point of the 140-char limit is to limit the scope of a tweet to a simple thought. Shortening words is cheating. But, if you’re going to cheat, I suggest an unshortener for the receiver of tweets.

    408wij

  2. I am writing something similar. Will share once done.
    But my preliminary attempts brought down the characters atleast 10% lesser than this. For eg. You have got 133 chars , whereas my system results in 123. I shall implement some more NLP techniques and bring it down to an acceptable level without sacrificing the semantics

Leave a Comment

Your email address will not be published. Required fields are marked *