反向索引

    For example, let’s say we have two documents, each with a field
    containing:

    1. ``The quick brown fox jumped over the lazy dog’’
    2. ``Quick brown foxes leap over lazy dogs in summer’’

    To create an inverted index, we first split the content field of each
    document into separate words (which we call terms or tokens), create a
    sorted list of all the unique terms, then list in which document each term
    appears. The result looks something like this:

    Now, if we want to search for "quick brown" we just need to find the
    documents in which each term appears:

    Both documents match, but the first document has more matches than the second.
    If we apply a naive similarity algorithm which just counts the number of
    matching terms, then we can say that the first document is a better match —
    is more relevant to our query — than the second document.

    But there are a few problems with our current inverted index:

    1. "fox" and "foxes" are pretty similar, as are "dog" and "dogs"
      — they share the same root word.

    2. "jumped" and , while not from the same root word, are similar
      in meaning — they are synonyms.

    With the above index, a search for "+Quick +fox" wouldn’t match any
    documents. (Remember, a preceding + means that the word must be present).
    Both the term "Quick" and the term "fox" have to be in the same document
    in order to satisfy the query, but the first doc contains "quick fox" and
    the second doc contains "Quick foxes".

    Our user could reasonably expect both documents to match the query. We can do
    better.

    If we normalize the terms into a standard format, then we can find documents
    that contain terms that are not exactly the same as the user requested, but
    are similar enough to still be relevant. For instance:

    1. "foxes" can be stemmed — reduced to its root form — to
      become . Similarly "dogs" could be stemmed to "dog".

    2. "jumped" and "leap" are synonyms and can be indexed as just the
      single term "jump".

    Now the index looks like this:

    But we’re not there yet. Our search for "+Quick +fox" would still fail,
    because we no longer have the exact term "Quick" in our index. However, if
    we apply the same normalization rules that we used on the content field to
    our query string, it would become a query for "+quick +fox", which would
    match both documents!

    IMPORTANT: This is very important. You can only find terms that actually exist in your
    index, so: both the indexed text and and query string must be normalized
    into the same form
    .