Fulltext indices
Table of contents
In CrateDB, every column’s data is indexed using the index method by default. Currently there are three choices related to index definition:
Warning
Creating an index after a table was already created is currently not supported, so think carefully while designing your table definition.
Indexing can be turned off by using the INDEX OFF
column definition. Consider that a column without an index can only be used as a result column and will never produce a hit when queried.
cr> insert into table_a (first_column) values ('hello');
INSERT OK, 1 row affected (... sec)
cr> select * from table_a where first_column = 'hello';
SQLParseException[Cannot search on field [first_column] since it is not indexed.]
An index of type plain
is indexing the input data as-is without analyzing. Using the plain
index method is the default behaviour but can also be declared explicitly:
cr> create table table_b1 (
... first_column text INDEX using plain
... );
CREATE OK, 1 row affected (... sec)
This results in the same behaviour than without any index declaration:
cr> create table table_b2 (
... first_column text
CREATE OK, 1 row affected (... sec)
By defining an index on a column, it’s analyzed data is indexed instead of the raw data. Thus, depending on the used analyzer, querying for the exact data may not work anymore. See Built-in analyzers for details about available builtin analyzer or .
If no analyzer is specified when using a fulltext index, the plain analyzer is used:
Defining the usage of a concrete analyzer is straight forward by defining the analyzer as a parameter using the WITH
statement:
cr> create table table_d (
... first_column text INDEX using fulltext with (analyzer = 'english')
... );
CREATE OK, 1 row affected (... sec)
It’s also possible to define an index column which treat the data of a given column as input. This is especially useful if you want to search for both, the exact and analyzed data:
cr> create table table_e (
... first_column text,
... INDEX first_column_ft using fulltext (first_column)
... );
CREATE OK, 1 row affected (... sec)
Of course defining a custom analyzer is possible here too:
cr> create table table_f (
... INDEX first_column_ft
... using fulltext(first_column) with (analyzer = 'english')
... );
CREATE OK, 1 row affected (... sec)
Defining a composite (or combined) index is done using the same syntax as above despite multiple columns are given to the fulltext
index method:
cr> create table documents_a (
... title text,
... body text,
... INDEX title_body_ft
... using fulltext(title, body) with (analyzer = 'english')
... );
CREATE OK, 1 row affected (... sec)
Composite indices can include nested columns within object columns as well:
An analyzer consists of one tokenizer, zero or more token-filters, and zero or more char-filters.
When a field-content is analyzed to become a stream of tokens, the char-filter is applied at first. It is used to filter some special chars from the stream of characters that make up the content.
Tokenizers split the possibly filtered stream of characters into tokens.
Token-filters can add tokens, delete tokens or transform them to finally produce the desired stream of tokens.
With these elements in place, analyzers provide finegrained control over building a token stream used for fulltext search. For example you can use language specific analyzers, tokenizers and token-filters to get proper search results for data provided in a certain language.
Here is a simple Example:
cr> CREATE ANALYZER myanalyzer (
... TOKENIZER whitespace,
... TOKEN_FILTERS (
... lowercase,
... kstem
... ),
... html_strip
... )
... );
CREATE OK, 1 row affected (... sec)
This creates a custom analyzer called myanalyzer
. It uses the built-in tokenizer and two built-in token filters. lowercase and , as well as a mapping char-filter. : It is possible to further customize the built-in token filters, char-filters or tokenizers:
cr> create ANALYZER myanalyzer_customized (
... TOKENIZER whitespace,
... lowercase,
... kstem
... ),
... CHAR_FILTERS (
... mymapping WITH (
... type='mapping',
... mappings = ['ph=>f', 'qu=>q', 'foo=>bar']
... )
... )
... );
CREATE OK, 1 row affected (... sec)
The provided type
property is required as it specifies which built-in char-filter should be customized. The other option mappings
is specific to the used type/char-filter.
Tokenizer and token-filters can be customized in the same way.
Note
Altering analyzers is not supported yet.
See also
for the syntax reference.
Built-in tokenizers for a list of built-in tokenizer.
for a list of built-in token-filter.
Built-in char filter for a list of built-in char-filter.
Existing Analyzers can be used to create custom Analyzers by means of extending them.
You can extend and parameterize like this:
cr> create ANALYZER "german_snowball" extends snowball WITH (
... language = 'german'
... );
CREATE OK, 1 row affected (... sec)
If you extend Built-in analyzers, tokenizer, char-filter or token-filter cannot be defined. In this case use the parameters available for the extended .
If you extend custom-analyzers, every part of the analyzer that is ommitted will be taken from the extended one. Example:
cr> create ANALYZER e2 EXTENDS myanalyzer (
... TOKENIZER mypattern WITH (
... type = 'pattern',
... pattern = '.*'
... )
... );
CREATE OK, 1 row affected (... sec)
This analyzer will use the char-filters and token-filters from and will override the tokenizer with mypattern
.
See also
See the reference documentation of the Built-in analyzers to get detailed information on the available analyzers.