Tamil spelling corrector

The Internet has a lot of tamil song lyrics in English. Finding them is not easy, though. Two problems. The lyrics are fragmented: there’s no one site to search them. And Google doesn’t help. It doesn’t know that alaipaayudhe, alaipaayuthe and alaipayuthey are the same word.

This is similar to the problem I faced with tamil audio. The solution, as before, is to make an index, and provide a search interface that is tolerant of English spellings of Tamil words. But I want to go a step further. Is it possible to display these lyrics in Tamil?

My Tamil Transliterator does a lousy job of this. Though it’s tolerant of mistakes, it’s ignorant of spelling and grammer. So,

kanda nal muthalai kathal peruguthadi

… becomes…

kanda nal muthalai kathal peruguthadi

… when in fact we want…

kanda naaL muthalaay kaathal peruguthadi

(If you’re viewing this on an RSS reader, check my post to see what I mean.)

I need an automated Tamil spelling corrector. Reading Peter Norvig’s “How to Write a Spelling Corrector” and actually having understood it, I gave spelling correction in Tamil a shot.

Norvig’s approach, in simple terms, is this:

  1. Get a dictionary
  2. Tweak the word you want to check (add a letter, delete one, swap 2 letters, etc.)
  3. Pick all tweaks that get you to a valid word on the dictionary
  4. Choose the most likely correction (most common correct word, adjusted for the probability of that mistake happening)

Making a dictionary is easy. I just need lots of Tamil literature, and then I pick out words from it. For now, I’m just using the texts in Project Madurai.

Tweaking the word to check is easy. Norvig’s article has a working code example.

Picking valid tweaks is easy. Just check against the dictionary.

The tough part is choosing the likely correction. For each valid word, I need the probability of having made this particular error.

Let’s take an example. I’ve spelt kathal. A list of valid tweaks to this word include: kal, kol, kadal, kanal, and kaadhal. For each of these, I need to figure out how often the valid tweaks occur, and the probability that I typed kathal when I really meant one of these tweaks. This is what such a calculation would look like:

Tweak Frequency Probability of typing kathal Product
kal 1 0.04 0.04
kol 4 0.02 0.08
kadal 10 0.1 1.0
kanal 1 0.01 0.01
kaadhal 6 0.25 1.50

Once we have this, we can see that kaadhal is the right one — it has the maximum value (1.50) in the last column, where we multiply the frequency and the probability.

(You probably realise how small my dictionary is, looking at the frequencies. Well, that’s how big Project Madurai is. But increasing the size of a dictionary is a trivial problem.)

Anyway, getting the frequency is easy. How do I get the probabilities, though? That’s what I’m working on right now.

14 thoughts on “Tamil spelling corrector”

  1. hajaroz@yahoo.com

    please give me that spelling checking software….it is very good and it is new one

  2. ரவிசங்கர்

    நல்ல முயற்சி, அணுகுமுறை.

  3. Good article. I am very interested in text analysis, phrase extraction too (and statistical machine translation, if you will 🙂 Is not probability just the number of occurences of each of these divided by total number of occurence? Maybe I am missing something. Did you also consider using Vikatan or Kumudam article data in addition to Project Madurai to get contemporary word counts and probabilities? It is one of my projects to create a word frequency list for Tamil. Now that I see another active soul, I might start the project 🙂 Python or ruby I am wondering…

  4. hi i would like to know if there are any resources online which would help me to check if my spellings in tamil typing are right? i am not very confortable with tamil so i can use the help of spell check, i would be most greatful! kindly help, thanks:)

  5. i want to know the exact spelling of thamizh selvan in tamil . whether a connecting letter will come between thamizh and selvan or not .. thank you

  6. Hi anand,
    Do you have any tamil spell checker tool online. If yes, please provide the link asap. Thanks in advance.

  7. You may do easy text processing of Tamil content using open-tamil library in Python.

    $ pip install open-tamil

    To convert your data into Tamil letters (not Unicode code-points) you can type in Python,

    >> import tamil
    >> letters = tamil.utf8.get_letters( data )

    Thanks,
    -Muthu

Leave a Comment

Your email address will not be published. Required fields are marked *