12.5. 斷詞
The built-in parser is named. It recognizes 23 token types, shown inTable 12.1.
The parser’s notion of a“letter”is determined by the database’s locale setting, specificallylc_ctype
. Words containing only the basic ASCII letters are reported as a separate token type, since it is sometimes useful to distinguish them. In most European languages, token typesword
andasciiword
should be treated alike.
It is possible for the parser to produce overlapping tokens from the same piece of text. As an example, a hyphenated word will be reported both as the entire word and as each component:
SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
----------+---------------+------------------------------
protocol | Protocol head | http://
host | Host | example.com