Did-you-mean

    For example, if a user types “fliud,” OpenSearch suggests a corrected search term like “fluid.” You can then suggest the corrected term to the user or even automatically correct the search term.

    You can implement the did-you-mean suggester using one of the following methods:

    • Use a term suggester to suggest corrections for individual words.
    • Use a to suggest corrections for phrases.

    Use the term suggester to suggest corrected spellings for individual words. The term suggester uses an to compute suggestions.

    The edit distance is the number of single-character insertions, deletions, or substitutions that need to be performed for a term to match another term. For example, to change the word “cat” to “hats”, you need to substitute “h” for “c” and insert an “s”, so the edit distance in this case is 2.

    To use the term suggester, you don’t need any special field mappings for your index. By default, string field types are mapped as text. A text field is analyzed, so the title in the following example is tokenized into individual words. Indexing the following documents creates a books index where title is a text field:

    To check how a string is split into tokens, you can use the _analyze endpoint. To apply the same analyzer that the field uses, you can specify the field’s name in the field parameter:

    1. GET books/_analyze
    2. {
    3. "text": "Design Patterns (Object-Oriented Software)",
    4. "field": "title"
    5. }

    The default analyzer (standard) splits a string at word boundaries, removes punctuation, and lowercases the tokens:

    1. {
    2. "tokens" : [
    3. {
    4. "token" : "design",
    5. "start_offset" : 0,
    6. "end_offset" : 6,
    7. "type" : "<ALPHANUM>",
    8. "position" : 0
    9. },
    10. {
    11. "token" : "patterns",
    12. "start_offset" : 7,
    13. "end_offset" : 15,
    14. "type" : "<ALPHANUM>",
    15. "position" : 1
    16. },
    17. {
    18. "token" : "object",
    19. "start_offset" : 17,
    20. "end_offset" : 23,
    21. "type" : "<ALPHANUM>",
    22. "position" : 2
    23. },
    24. {
    25. "token" : "oriented",
    26. "start_offset" : 24,
    27. "end_offset" : 32,
    28. "type" : "<ALPHANUM>",
    29. "position" : 3
    30. },
    31. {
    32. "token" : "software",
    33. "start_offset" : 33,
    34. "end_offset" : 41,
    35. "type" : "<ALPHANUM>",
    36. "position" : 4
    37. }
    38. ]
    39. }

    To get suggestions for a misspelled search term, use the term suggester. Specify the input text that needs suggestions in the text field, and specify the field from which to get suggestions in the field field:

    1. GET books/_search
    2. {
    3. "suggest": {
    4. "spell-check": {
    5. "text": "patern",
    6. "term": {
    7. "field": "title"
    8. }
    9. }
    10. }
    11. }

    The term suggester returns a list of corrections for the input text in the options array:

    1. {
    2. "took" : 2,
    3. "timed_out" : false,
    4. "_shards" : {
    5. "total" : 1,
    6. "successful" : 1,
    7. "skipped" : 0,
    8. "failed" : 0
    9. },
    10. "hits" : {
    11. "total" : {
    12. "value" : 0,
    13. "relation" : "eq"
    14. },
    15. "max_score" : null,
    16. "hits" : [ ]
    17. },
    18. "suggest" : {
    19. "spell-check" : [
    20. {
    21. "text" : "patern",
    22. "offset" : 0,
    23. "length" : 6,
    24. "options" : [
    25. {
    26. "text" : "patterns",
    27. "score" : 0.6666666,
    28. "freq" : 2
    29. }
    30. ]
    31. }
    32. ]
    33. }
    34. }

    The score value is calculated based on the edit distance. The higher the score, the better the suggestion. The freq is the frequency that represents the number of times the term appears in the documents of the specified index.

    To receive suggestions for the same input text in multiple fields, you can define the text globally to avoid duplication:

    1. {
    2. "suggest": {
    3. "text" : "patern",
    4. "spell-check1" : {
    5. "term" : {
    6. "field" : "title"
    7. }
    8. },
    9. "spell-check2" : {
    10. "term" : {
    11. "field" : "subject"
    12. }
    13. }
    14. }
    15. }

    If text is specified both at the global and individual suggestion levels, the suggestion-level value overrides the global value.

    You can specify the following options to the term suggester.

    Phrase suggester

    To implement did-you-mean, use a phrase suggester. The phrase suggester is similar to the term suggester, except it uses n-gram language models to suggest whole phrases instead of individual words.

    To set up a phrase suggester, create a custom analyzer called trigram that uses a shingle filter and lowercases tokens. This filter is similar to the edge_ngram filter, but it applies to words instead of letters. Then configure the field from which you’ll be sourcing suggestions with the custom analyzer you created:

    1. PUT books2
    2. {
    3. "settings": {
    4. "index": {
    5. "analysis": {
    6. "analyzer": {
    7. "trigram": {
    8. "type": "custom",
    9. "tokenizer": "standard",
    10. "filter": [
    11. "lowercase",
    12. "shingle"
    13. ]
    14. }
    15. },
    16. "filter": {
    17. "shingle": {
    18. "type": "shingle",
    19. "min_shingle_size": 2,
    20. "max_shingle_size": 3
    21. }
    22. }
    23. }
    24. }
    25. },
    26. "mappings": {
    27. "properties": {
    28. "title": {
    29. "type": "text",
    30. "fields": {
    31. "trigram": {
    32. "type": "text",
    33. "analyzer": "trigram"
    34. }
    35. }
    36. }
    37. }
    38. }
    39. }

    Index the documents into the new index:

    1. PUT books2/_doc/1
    2. {
    3. "title": "Design Patterns"
    4. }
    5. PUT books2/_doc/2
    6. {
    7. "title": "Software Architecture Patterns Explained"
    8. }

    Suppose the user searches for an incorrect phrase:

    1. GET books2/_search
    2. {
    3. "suggest": {
    4. "phrase-check": {
    5. "text": "design paterns",
    6. "phrase": {
    7. "field": "title.trigram"
    8. }
    9. }
    10. }
    11. }

    The phrase suggester returns the corrected phrase:

    To highlight suggestions, set up the highlight field for the phrase suggester:

    1. GET books2/_search
    2. {
    3. "suggest": {
    4. "phrase-check": {
    5. "text": "design paterns",
    6. "phrase": {
    7. "field": "title.trigram",
    8. "gram_size": 3,
    9. "highlight": {
    10. "pre_tag": "<em>",
    11. "post_tag": "</em>"
    12. }
    13. }
    14. }
    15. }
    16. }

    The results contain the highlighted text:

    1. {
    2. "took" : 2,
    3. "timed_out" : false,
    4. "_shards" : {
    5. "total" : 1,
    6. "successful" : 1,
    7. "skipped" : 0,
    8. "failed" : 0
    9. "hits" : {
    10. "total" : {
    11. "value" : 0,
    12. "relation" : "eq"
    13. },
    14. "max_score" : null,
    15. "hits" : [ ]
    16. },
    17. "suggest" : {
    18. "phrase-check" : [
    19. {
    20. "text" : "design paterns",
    21. "offset" : 0,
    22. "length" : 14,
    23. "options" : [
    24. {
    25. "text" : "design patterns",
    26. "highlighted" : "design <em>patterns</em>",
    27. "score" : 0.31666178
    28. }
    29. ]
    30. }
    31. }
    32. }

    To filter out spellchecked suggestions that will not return any results, you can use the collate field. This field contains a scripted query that is run for each returned suggestion. See for information on constructing a templated query. You can specify the current suggestion using the {{suggestion}} variable, or you can pass your own template parameters in the params field (the suggestion value will be added to the variables you specify).

    The collate query for a suggestion is run only on the shard from which the suggestion was sourced. The query is required.

    Additionally, if the prune parameter is set to true, a collate_match field is added to each suggestion. If a query returns no results, the collate_match value is false. You can then filter out suggestions based on the collate_match field. The prune parameter’s default value is false.

    For example, the following query configures the collate field to run a match_phrase query matching the title field to the current suggestion:

    1. GET books2/_search
    2. {
    3. "suggest": {
    4. "phrase-check": {
    5. "text": "design paterns",
    6. "phrase": {
    7. "field": "title.trigram",
    8. "collate" : {
    9. "query" : {
    10. "source": {
    11. "match_phrase" : {
    12. "title": ""
    13. }
    14. }
    15. },
    16. "prune": "true"
    17. }
    18. }
    19. }
    20. }
    21. }

    The resulting suggestion contains the collate_match field set to true, which means the match_phrase query will return matching documents for the suggestion:

    1. {
    2. "took" : 7,
    3. "timed_out" : false,
    4. "_shards" : {
    5. "total" : 1,
    6. "successful" : 1,
    7. "skipped" : 0,
    8. "failed" : 0
    9. },
    10. "hits" : {
    11. "total" : {
    12. "value" : 0,
    13. "relation" : "eq"
    14. },
    15. "max_score" : null,
    16. "hits" : [ ]
    17. },
    18. "suggest" : {
    19. "phrase-check" : [
    20. {
    21. "text" : "design paterns",
    22. "offset" : 0,
    23. "length" : 14,
    24. "options" : [
    25. {
    26. "text" : "design patterns",
    27. "score" : 0.56759655,
    28. "collate_match" : true
    29. }
    30. ]
    31. }
    32. ]
    33. }
    34. }

    For most use cases, when calculating a suggestion’s score, you want to take into account not only the frequency of a shingle but also the shingle’s size. Smoothing models are used to calculate scores for shingles of different sizes, balancing the weight of frequent and infrequent shingles.

    The following smoothing models are supported.

    By default, OpenSearch uses the Stupid Backoff model—a simple algorithm that starts with the shingles of the highest order and takes lower-order shingles if higher-order shingles are not found. For example, if you set up the phrase suggester to have 3-grams, 2-grams, and 1-grams, the Stupid Backoff model first inspects the 3-grams. If there are no 3-grams, it inspects 2-grams but multiplies the score by the discount factor. If there are no 2-grams, it inspects 1-grams but again multiplies the score by the discount factor. The Stupid Backoff model works well in most cases. If you need to choose the Laplace smoothing model, specify it in the smoothing parameter:

    Candidate generators provide possible suggestion terms based on the terms in the input text. There is one candidate generator available—direct_generator. A direct generator functions similarly to a term suggester: It is also called for each term in the input text. The phrase suggester supports multiple candidate generators, where each generator is called for each term in the input text. It also lets you specify a pre-filter (an analyzer that analyzes the input text terms before they enter the spellcheck phase) and a post-filter (an analyzer that analyzes the generated suggestions before they are returned).

    Set up a direct generator for a phrase suggester:

    1. GET books2/_search
    2. {
    3. "suggest": {
    4. "text": "design paterns",
    5. "phrase-check": {
    6. "phrase": {
    7. "field": "title.trigram",
    8. "size": 1,
    9. "direct_generator": [
    10. {
    11. "field": "title.trigram",
    12. "suggest_mode": "always",
    13. "min_word_length": 3
    14. }
    15. ]
    16. }
    17. }
    18. }