Typo tolerance
Example
On a movie dataset, let’s search for .
We use a prefix Levenshtein algorithm (opens new window) to check if the words match. The only difference with a Levenshtein algorithm is that it accepts every word that starts with the query words too. Therefore, words are accepted if they start with or have equal length.
- substitution of a character of M by a character other than P. (e.g. kitten → sitten)
- deletion of a character from M. (e.g. saturday → satuday)
There are some rules about what can be considered “similar”. These rules are by word and not for the whole query string.
- If the query word is between 1 and 4 characters long, therefore, no typo is allowed. Only documents that contain words that start with or are of equal length with this query word are considered valid for this request.
- If the query word is between 5 and 8 characters long, one typo is allowed. Documents that contain words that match with one typo are retained for the next steps.
- “saturday” is accepted because it is the same word.
- “sat” is not accepted because the query word is not a prefix of it (it is the opposite).
- “satuday” is accepted because it contains one typo.