Fulltext indices

Table of contents

Index definition
Plain index (default)
Creating a custom analyzer

In CrateDB, every column’s data is indexed using the index method by default. Currently there are three choices related to index definition:

Warning

Creating an index after a table was already created is currently not supported, so think carefully while designing your table definition.

Indexing can be turned off by using the INDEX OFF column definition. Consider that a column without an index can only be used as a result column and will never produce a hit when queried.

cr> insert into table_a (first_column) values ('hello');
INSERT OK, 1 row affected (... sec)

cr> select * from table_a where first_column = 'hello';
SQLParseException[Cannot search on field [first_column] since it is not indexed.]

An index of type plain is indexing the input data as-is without analyzing. Using the plain index method is the default behaviour but can also be declared explicitly:

cr> create table table_b1 (
...   first_column text INDEX using plain
... );
CREATE OK, 1 row affected (... sec)

This results in the same behaviour than without any index declaration:

cr> create table table_b2 (
...   first_column text
CREATE OK, 1 row affected (... sec)

By defining an index on a column, it’s analyzed data is indexed instead of the raw data. Thus, depending on the used analyzer, querying for the exact data may not work anymore. See Built-in analyzers for details about available builtin analyzer or .

If no analyzer is specified when using a fulltext index, the plain analyzer is used:

Defining the usage of a concrete analyzer is straight forward by defining the analyzer as a parameter using the WITH statement:

cr> create table table_d (
...   first_column text INDEX using fulltext with (analyzer = 'english')
... );
CREATE OK, 1 row affected (... sec)

It’s also possible to define an index column which treat the data of a given column as input. This is especially useful if you want to search for both, the exact and analyzed data:

cr> create table table_e (
...   first_column text,
...   INDEX first_column_ft using fulltext (first_column)
... );
CREATE OK, 1 row affected (... sec)

Of course defining a custom analyzer is possible here too:

cr> create table table_f (
...   INDEX first_column_ft
...     using fulltext(first_column) with (analyzer = 'english')
... );
CREATE OK, 1 row affected (... sec)

Defining a composite (or combined) index is done using the same syntax as above despite multiple columns are given to the fulltext index method:

cr> create table documents_a (
...   title text,
...   body text,
...   INDEX title_body_ft
...     using fulltext(title, body) with (analyzer = 'english')
... );
CREATE OK, 1 row affected (... sec)

Composite indices can include nested columns within object columns as well:

An analyzer consists of one tokenizer, zero or more token-filters, and zero or more char-filters.

When a field-content is analyzed to become a stream of tokens, the char-filter is applied at first. It is used to filter some special chars from the stream of characters that make up the content.

Tokenizers split the possibly filtered stream of characters into tokens.

Token-filters can add tokens, delete tokens or transform them to finally produce the desired stream of tokens.

With these elements in place, analyzers provide finegrained control over building a token stream used for fulltext search. For example you can use language specific analyzers, tokenizers and token-filters to get proper search results for data provided in a certain language.

Here is a simple Example:

cr> CREATE ANALYZER myanalyzer (
...   TOKENIZER whitespace,
...   TOKEN_FILTERS (
...     lowercase,
...     kstem
...   ),
...     html_strip
...   )
... );
CREATE OK, 1 row affected (... sec)

This creates a custom analyzer called myanalyzer. It uses the built-in tokenizer and two built-in token filters. lowercase and , as well as a mapping char-filter. : It is possible to further customize the built-in token filters, char-filters or tokenizers:

cr> create ANALYZER myanalyzer_customized (
...   TOKENIZER whitespace,
...     lowercase,
...     kstem
...   ),
...   CHAR_FILTERS (
...     mymapping WITH (
...       type='mapping',
...       mappings = ['ph=>f', 'qu=>q', 'foo=>bar']
...     )
...   )
... );
CREATE OK, 1 row affected (... sec)

The provided type property is required as it specifies which built-in char-filter should be customized. The other option mappings is specific to the used type/char-filter.

Tokenizer and token-filters can be customized in the same way.

Note

Altering analyzers is not supported yet.