Applying a Catboost Model in ClickHouse

With this instruction, you will learn to apply pre-trained models in ClickHouse by running model inference from SQL.

To apply a CatBoost model in ClickHouse:

Create a Table.
.
Integrate CatBoost into ClickHouse (Optional step).
.

For more information about training CatBoost models, see Training and applying models.

If you don’t have the yet, install it.

Note

Docker is a software platform that allows you to create containers that isolate a CatBoost and ClickHouse installation from the rest of the system.

Before applying a CatBoost model:

1. Pull the from the registry:

This Docker image contains everything you need to run CatBoost and ClickHouse: code, runtime, libraries, environment variables, and configuration files.

2. Make sure the Docker image has been successfully pulled:


REPOSITORY                            TAG                 IMAGE ID            CREATED             SIZE
yandex/tutorial-catboost-clickhouse   latest              622e4d17945b        22 hours ago        1.37GB

3. Start a Docker container based on this image:

$ docker run -it -p 8888:8888 yandex/tutorial-catboost-clickhouse

To create a ClickHouse table for the train sample:

1. Start ClickHouse console client in interactive mode:

$ clickhouse client

The ClickHouse server is already running inside the Docker container.

2. Create the table using the command:

3. Exit from ClickHouse console client:

:) exit

To insert the data:

1. Run the following command:

$ clickhouse client --host 127.0.0.1 --query 'INSERT INTO amazon_train FORMAT CSVWithNames' < ~/amazon/train.csv

2. Start ClickHouse console client in interactive mode:

$ clickhouse client

3. Make sure the data has been uploaded:

Note

Optional step. The Docker image contains everything you need to run CatBoost and ClickHouse.

To integrate CatBoost into ClickHouse:

1. Build the evaluation library.

The fastest way to evaluate a CatBoost model is compile libcatboostmodel.<so|dll|dylib> library. For more information about how to build the library, see CatBoost documentation.

2. Create a new directory anywhere and with any name, for example, data and put the created library in it. The Docker image already contains the library data/libcatboostmodel.so.

4. Create a model configuration file with any name, for example, models/amazon_model.xml.

5. Describe the model configuration:

<models>
    <model>
        <!-- Model type. Now catboost only. -->
        <!-- Model name. -->
        <name>amazon</name>
        <!-- Path to trained model. -->
        <!-- Update interval. -->
        <lifetime>0</lifetime>
    </model>
</models>

6. Add the path to CatBoost and the model configuration to the ClickHouse configuration:

<!-- File etc/clickhouse-server/config.d/models_config.xml. -->
<catboost_dynamic_library_path>/home/catboost/data/libcatboostmodel.so</catboost_dynamic_library_path>
<models_config>/home/catboost/models/*_model.xml</models_config>

For test model run the ClickHouse client $ clickhouse client.

Let’s make sure that the model is working:

:) SELECT 
    modelEvaluate('amazon', 
                RESOURCE,
                MGR_ID,
                ROLE_ROLLUP_1,
                ROLE_ROLLUP_2,
                ROLE_DEPTNAME,
                ROLE_TITLE,
                ROLE_FAMILY_DESC,
                ROLE_CODE) > 0 AS prediction, 
FROM amazon_train
LIMIT 10

Note

Function returns tuple with per-class raw predictions for multiclass models.

Let’s predict probability:

Note

More info about exp() function.

Let’s calculate LogLoss on the sample:

:) SELECT -avg(tg * log(prob) + (1 - tg) * log(1 - prob)) AS logloss
FROM 
(
    SELECT 
        modelEvaluate('amazon', 
                    RESOURCE,
                    MGR_ID,
                    ROLE_ROLLUP_1,
                    ROLE_ROLLUP_2,
                    ROLE_DEPTNAME,
                    ROLE_TITLE,
                    ROLE_FAMILY_DESC,
                    ROLE_FAMILY,
                    ROLE_CODE) AS prediction,
        1. / (1. + exp(-prediction)) AS prob, 
        ACTION AS tg

Note

More info about and log() functions.

Applying CatBoost Models

Applying a Catboost Model in ClickHouse