Applying a Catboost Model in ClickHouse

    With this instruction, you will learn to apply pre-trained models in ClickHouse by running model inference from SQL.

    To apply a CatBoost model in ClickHouse:

    1. Create a Table.
    2. .
    3. Integrate CatBoost into ClickHouse (Optional step).
    4. .

    For more information about training CatBoost models, see Training and applying models.

    If you don’t have the yet, install it.

    Note

    Docker is a software platform that allows you to create containers that isolate a CatBoost and ClickHouse installation from the rest of the system.

    Before applying a CatBoost model:

    1. Pull the from the registry:

    This Docker image contains everything you need to run CatBoost and ClickHouse: code, runtime, libraries, environment variables, and configuration files.

    2. Make sure the Docker image has been successfully pulled:

    1. REPOSITORY TAG IMAGE ID CREATED SIZE
    2. yandex/tutorial-catboost-clickhouse latest 622e4d17945b 22 hours ago 1.37GB

    3. Start a Docker container based on this image:

    1. $ docker run -it -p 8888:8888 yandex/tutorial-catboost-clickhouse

    To create a ClickHouse table for the train sample:

    1. Start ClickHouse console client in interactive mode:

    1. $ clickhouse client

    The ClickHouse server is already running inside the Docker container.

    2. Create the table using the command:

    3. Exit from ClickHouse console client:

    1. :) exit

    To insert the data:

    1. Run the following command:

    1. $ clickhouse client --host 127.0.0.1 --query 'INSERT INTO amazon_train FORMAT CSVWithNames' < ~/amazon/train.csv

    2. Start ClickHouse console client in interactive mode:

    1. $ clickhouse client

    3. Make sure the data has been uploaded:

    Note

    Optional step. The Docker image contains everything you need to run CatBoost and ClickHouse.

    To integrate CatBoost into ClickHouse:

    1. Build the evaluation library.

    The fastest way to evaluate a CatBoost model is compile libcatboostmodel.<so|dll|dylib> library. For more information about how to build the library, see CatBoost documentation.

    2. Create a new directory anywhere and with any name, for example, data and put the created library in it. The Docker image already contains the library data/libcatboostmodel.so.

    4. Create a model configuration file with any name, for example, models/amazon_model.xml.

    5. Describe the model configuration:

    1. <models>
    2. <model>
    3. <!-- Model type. Now catboost only. -->
    4. <!-- Model name. -->
    5. <name>amazon</name>
    6. <!-- Path to trained model. -->
    7. <!-- Update interval. -->
    8. <lifetime>0</lifetime>
    9. </model>
    10. </models>

    6. Add the path to CatBoost and the model configuration to the ClickHouse configuration:

    1. <!-- File etc/clickhouse-server/config.d/models_config.xml. -->
    2. <catboost_dynamic_library_path>/home/catboost/data/libcatboostmodel.so</catboost_dynamic_library_path>
    3. <models_config>/home/catboost/models/*_model.xml</models_config>

    For test model run the ClickHouse client $ clickhouse client.

    Let’s make sure that the model is working:

    1. :) SELECT
    2. modelEvaluate('amazon',
    3. RESOURCE,
    4. MGR_ID,
    5. ROLE_ROLLUP_1,
    6. ROLE_ROLLUP_2,
    7. ROLE_DEPTNAME,
    8. ROLE_TITLE,
    9. ROLE_FAMILY_DESC,
    10. ROLE_CODE) > 0 AS prediction,
    11. FROM amazon_train
    12. LIMIT 10

    Note

    Function returns tuple with per-class raw predictions for multiclass models.

    Let’s predict probability:

    Note

    More info about exp() function.

    Let’s calculate LogLoss on the sample:

    1. :) SELECT -avg(tg * log(prob) + (1 - tg) * log(1 - prob)) AS logloss
    2. FROM
    3. (
    4. SELECT
    5. modelEvaluate('amazon',
    6. RESOURCE,
    7. MGR_ID,
    8. ROLE_ROLLUP_1,
    9. ROLE_ROLLUP_2,
    10. ROLE_DEPTNAME,
    11. ROLE_TITLE,
    12. ROLE_FAMILY_DESC,
    13. ROLE_FAMILY,
    14. ROLE_CODE) AS prediction,
    15. 1. / (1. + exp(-prediction)) AS prob,
    16. ACTION AS tg

    Note

    More info about and log() functions.