Typo tolerance

    Example

    On a movie dataset, let’s search for .

    We use a prefix Levenshtein algorithm (opens new window) to check if the words match. The only difference with a Levenshtein algorithm is that it accepts every word that starts with the query words too. Therefore, words are accepted if they start with or have equal length.

    • substitution of a character of M by a character other than P. (e.g. kitten → sitten)
    • deletion of a character from M. (e.g. saturday → satuday)

    There are some rules about what can be considered “similar”. These rules are by word and not for the whole query string.

    • If the query word is between 1 and 4 characters long, therefore, no typo is allowed. Only documents that contain words that start with or are of equal length with this query word are considered valid for this request.
    • If the query word is between 5 and 8 characters long, one typo is allowed. Documents that contain words that match with one typo are retained for the next steps.
    • “saturday” is accepted because it is the same word.
    • “sat” is not accepted because the query word is not a prefix of it (it is the opposite).
    • “satuday” is accepted because it contains one typo.