Applying a Catboost Model in ClickHouse
With this instruction, you will learn to apply pre-trained models in ClickHouse by running model inference from SQL.
To apply a CatBoost model in ClickHouse:
- Create a Table.
- .
- Integrate CatBoost into ClickHouse (Optional step).
- .
For more information about training CatBoost models, see Training and applying models.
You can reload CatBoost models if the configuration was updated without restarting the server using and RELOAD MODELS system queries.
If you do not have the yet, install it.
Note
Docker is a software platform that allows you to create containers that isolate a CatBoost and ClickHouse installation from the rest of the system.
Before applying a CatBoost model:
1. Pull the from the registry:
This Docker image contains everything you need to run CatBoost and ClickHouse: code, runtime, libraries, environment variables, and configuration files.
2. Make sure the Docker image has been successfully pulled:
REPOSITORY TAG IMAGE ID CREATED SIZE
yandex/tutorial-catboost-clickhouse latest 622e4d17945b 22 hours ago 1.37GB
3. Start a Docker container based on this image:
$ docker run -it -p 8888:8888 yandex/tutorial-catboost-clickhouse
To create a ClickHouse table for the training sample:
1. Start ClickHouse console client in the interactive mode:
$ clickhouse client
The ClickHouse server is already running inside the Docker container.
2. Create the table using the command:
3. Exit from ClickHouse console client:
:) exit
To insert the data:
1. Run the following command:
$ clickhouse client --host 127.0.0.1 --query 'INSERT INTO amazon_train FORMAT CSVWithNames' < ~/amazon/train.csv
2. Start ClickHouse console client in the interactive mode:
$ clickhouse client
3. Make sure the data has been uploaded:
Note
Optional step. The Docker image contains everything you need to run CatBoost and ClickHouse.
To integrate CatBoost into ClickHouse:
1. Build the evaluation library.
The fastest way to evaluate a CatBoost model is compile libcatboostmodel.<so|dll|dylib>
library. For more information about how to build the library, see CatBoost documentation.
2. Create a new directory anywhere and with any name, for example, data
and put the created library in it. The Docker image already contains the library data/libcatboostmodel.so
.
3. Create a new directory for config model anywhere and with any name, for example, models
.
5. Describe the model configuration:
<models>
<model>
<!-- Model type. Now catboost only. -->
<!-- Model name. -->
<name>amazon</name>
<!-- Path to trained model. -->
<!-- Update interval. -->
<lifetime>0</lifetime>
</model>
</models>
6. Add the path to CatBoost and the model configuration to the ClickHouse configuration:
<!-- File etc/clickhouse-server/config.d/models_config.xml. -->
<catboost_dynamic_library_path>/home/catboost/data/libcatboostmodel.so</catboost_dynamic_library_path>
<models_config>/home/catboost/models/*_model.xml</models_config>
Note
You can change path to the CatBoost model configuration later without restarting server.
For test model run the ClickHouse client $ clickhouse client
.
Let’s make sure that the model is working:
:) SELECT
modelEvaluate('amazon',
RESOURCE,
MGR_ID,
ROLE_ROLLUP_1,
ROLE_ROLLUP_2,
ROLE_DEPTNAME,
ROLE_TITLE,
ROLE_FAMILY_DESC,
ROLE_CODE) > 0 AS prediction,
FROM amazon_train
LIMIT 10
Note
Function returns tuple with per-class raw predictions for multiclass models.
Let’s predict the probability:
Note
More info about exp() function.
Let’s calculate LogLoss on the sample:
:) SELECT -avg(tg * log(prob) + (1 - tg) * log(1 - prob)) AS logloss
FROM
(
SELECT
modelEvaluate('amazon',
RESOURCE,
MGR_ID,
ROLE_ROLLUP_1,
ROLE_ROLLUP_2,
ROLE_DEPTNAME,
ROLE_TITLE,
ROLE_FAMILY_DESC,
ROLE_FAMILY,
ROLE_CODE) AS prediction,
1. / (1. + exp(-prediction)) AS prob,
ACTION AS tg
Note
More info about and log() functions.