Applying a Catboost Model in ClickHouse
With this instruction, you will learn to apply pre-trained models in ClickHouse by running model inference from SQL.
To apply a CatBoost model in ClickHouse:
- Create a Table.
- .
- Integrate CatBoost into ClickHouse (Optional step).
- .
For more information about training CatBoost models, see Training and applying models.
If you don’t have the yet, install it.
Note
Docker is a software platform that allows you to create containers that isolate a CatBoost and ClickHouse installation from the rest of the system.
Before applying a CatBoost model:
1. Pull the from the registry:
This Docker image contains everything you need to run CatBoost and ClickHouse: code, runtime, libraries, environment variables, and configuration files.
2. Make sure the Docker image has been successfully pulled:
REPOSITORY TAG IMAGE ID CREATED SIZE
yandex/tutorial-catboost-clickhouse latest 622e4d17945b 22 hours ago 1.37GB
3. Start a Docker container based on this image:
$ docker run -it -p 8888:8888 yandex/tutorial-catboost-clickhouse
To create a ClickHouse table for the train sample:
1. Start ClickHouse console client in interactive mode:
$ clickhouse client
The ClickHouse server is already running inside the Docker container.
2. Create the table using the command:
3. Exit from ClickHouse console client:
:) exit
To insert the data:
1. Run the following command:
$ clickhouse client --host 127.0.0.1 --query 'INSERT INTO amazon_train FORMAT CSVWithNames' < ~/amazon/train.csv
2. Start ClickHouse console client in interactive mode:
$ clickhouse client
3. Make sure the data has been uploaded:
Note
Optional step. The Docker image contains everything you need to run CatBoost and ClickHouse.
To integrate CatBoost into ClickHouse:
1. Build the evaluation library.
The fastest way to evaluate a CatBoost model is compile libcatboostmodel.<so|dll|dylib>
library. For more information about how to build the library, see CatBoost documentation.
2. Create a new directory anywhere and with any name, for example, data
and put the created library in it. The Docker image already contains the library data/libcatboostmodel.so
.
4. Create a model configuration file with any name, for example, models/amazon_model.xml
.
5. Describe the model configuration:
<models>
<model>
<!-- Model type. Now catboost only. -->
<!-- Model name. -->
<name>amazon</name>
<!-- Path to trained model. -->
<!-- Update interval. -->
<lifetime>0</lifetime>
</model>
</models>
6. Add the path to CatBoost and the model configuration to the ClickHouse configuration:
<!-- File etc/clickhouse-server/config.d/models_config.xml. -->
<catboost_dynamic_library_path>/home/catboost/data/libcatboostmodel.so</catboost_dynamic_library_path>
<models_config>/home/catboost/models/*_model.xml</models_config>
For test model run the ClickHouse client $ clickhouse client
.
Let’s make sure that the model is working:
:) SELECT
modelEvaluate('amazon',
RESOURCE,
MGR_ID,
ROLE_ROLLUP_1,
ROLE_ROLLUP_2,
ROLE_DEPTNAME,
ROLE_TITLE,
ROLE_FAMILY_DESC,
ROLE_CODE) > 0 AS prediction,
FROM amazon_train
LIMIT 10
Note
Function returns tuple with per-class raw predictions for multiclass models.
Let’s predict probability:
Note
More info about exp() function.
Let’s calculate LogLoss on the sample:
:) SELECT -avg(tg * log(prob) + (1 - tg) * log(1 - prob)) AS logloss
FROM
(
SELECT
modelEvaluate('amazon',
RESOURCE,
MGR_ID,
ROLE_ROLLUP_1,
ROLE_ROLLUP_2,
ROLE_DEPTNAME,
ROLE_TITLE,
ROLE_FAMILY_DESC,
ROLE_FAMILY,
ROLE_CODE) AS prediction,
1. / (1. + exp(-prediction)) AS prob,
ACTION AS tg
Note
More info about and log() functions.