In this post, I am going to talk about automated spelling correction. Let’s say you are writing a document on your computer, and instead of typing “morning”, you accidentally type “mornig”. If you have automated spelling correction enabled, you will probably see that “mornig” has been transformed to “morning” on its own. How does this work? How does your computer know that when you typed “mornig”, you actually meant “morning”? We are going to see how in this post.
Spelling mistakes could turn out to be real words!
Before we actually go through how spelling correction works, let’s think about the complexity of this problem. In the previous example, “mornig” was not a real word, so we knew it had to be a spelling mistake. But what if you misspelled “college” as “collage”, or you misspelled “three” as “tree”? In these cases, the word you typed incorrectly happens to be an actual word itself! Correcting these types of errors is called real word spelling correction. On the other hand, if the error is not a real word (like “mornig” instead of “morning”), correcting those errors is called non-word spelling correction. You can see that real world spelling correction seems more difficult than non-word spelling correction because every word that you type could be an error (even if it has a correct spelling). For example, the sentence “The tree threes were tail” makes no sense because every word except “the” and “were” is an error even though they are all actual words. The actual sentence should be “The three trees were tall”. In this post, I am going to talk about non-word spelling correction with a basic approach to it.
Spelling Correction as a Probability Problem
Now let’s go through our spelling correction problem and think about it as a problem of probability. First of all, let’s think about the conditional probability P(w | misspelled word), where w is any word in the dictionary. This basically answers the question – What is the probability of w being the correct word given the misspelled word? For example, P(morning | mornig) tells us the probability of morning being the correct word for the misspelled word mornig. So the answer to our spelling correction problem can be to go through all the words in the dictionary, and for each word, calculate this conditional probability. Then, we can choose the word which has the highest probability. For instance, we can calculate the following probabilities – P(apple | mornig), P( actor | mornig),… P(morning | mornig), … P(zebra | mornig)
You are probably thinking: “Won’t this take too long if we consider every word in the dictionary for any misspelled word?”. You are right. You can see that we are wasting time by calculating probabilities like P(apple | mornig), P( actor | mornig) or P(zebra | mornig) because there is almost zero chance that a person would accidentally type something like “mornig” instead of “apple”. We need to reduce the words that we actually check and a good way to do this is by using something called edit distance.
What is edit distance?
Let’s take a small detour and talk a little about edit distance. Say you have two words. The edit distance between them is the minimum number of operations you need to do to transform one word to another. Operations can be things like inserting a character, deleting a character or replacing one character by another. For example, the edit distance between “see” and “sea” is 1. Why? Because we need to replace one character ‘a’ in ‘sea’ by ‘e’ to get ‘see’. So we did one operation which was replacement. Now let’s think about the edit distance between “apple” and “mornig”. We need 5 replacements (to replace the 5 characters “morni” by “apple”) and 1 deletion (to delete the last character “g” in “mornig”). Overall we need 6 operations for this transformation from “mornig” to “apple”. So the edit distance between them is 6.
Coming back to our original problem, we can say we won’t calculate the probability P(apple | mornig) because the edit distance between the two words “apple” and “mornig” is 6 which is too high. We can say that the only words for which we will calculate the probability are the words for which the edit distance (between that word and the misspelled word) is very small (maybe 1 or 2). This will substantially reduce the number of words we need to check.
How do we calculate the probability P(w | misspelled word)?
The next question we need to answer is how to calculate the conditional probability P(w | misspelled word). We can’t estimate this value directly, so we need to use Bayes’ rule. Then, we can write this as :
= P(misspelled word | w) * P(w) ÷ P(misspelled word)
When we are trying to find the w which maximizes the above term for a particular misspelled word, we can omit the denominator because the probability of the misspelled word is going to remain the same for all w.
So we can formally write our problem as:
Correct Word = argmax P(misspelled word | w) * P(w)
where argmax is over all w in D, where D is the set of words in the dictionary with low edit distance to the misspelled word. Then for every word in D, we need to calculate P(misspelled word | w) * P(w).
We can calculate P(w) using a language model (can be from unigram up to around 4-gram). For more details about N-gram language models, you can check out my post – https://xrds.acm.org/blog/2017/10/introduction-n-grams-need/
For instance, for a unigram language model, we could find the P(w) as the number of times w appears in a corpus divided by the total number of words in the corpus. For a bigram or trigram model, we would need to find the probability of the word w in the context in which it would appear in the sentence (the sentence that contains the spelling mistake), which would give better results.
The probability P(misspelled word | w) is the probability of typing the misspelled word instead of the word w. This depends on a lot of things like which letters are close to each other on the keyboard or the speed of typing. We can estimate this probability by looking at corpuses of commonly misspelled words like http://www.dcs.bbk.ac.uk/~ROGER/corpora.html
Based on the corpus, we could estimate the number of times a character x is substituted by another character y and based on this information we could calculate the probability. Another option could be to directly use the misspelled words in the corpus to estimate their probabilities using count based approaches.
Then, once we have the two probabilities P(w) and P(misspelled word | w), we need to multiply them and go through all possible words w in D and find the word w for which the product is maximum. This word w would be the correct word for the misspelled word and your computer would replace the misspelled word with this word in your document.
This was a basic approach to how automated spelling correction works. For further reading on this, you can check out the reference below –