Pulsar SQL configuration and deployment

You can configure Presto Pulsar Connector in the properties file. The configuration for the connector and the default values are as follows.

You can connect Presto to a Pulsar cluster with multiple hosts. To configure multiple hosts for brokers, add multiple URLs to pulsar.web-service-url. To configure multiple hosts for ZooKeeper, add multiple URIs to pulsar.zookeeper-uri. The following is an example.

  1. pulsar.web-service-url=http://localhost:8080,localhost:8081,localhost:8082
  2. pulsar.zookeeper-uri=localhost1,localhost2:2181

Note: by default, Pulsar SQL does not get the last message in a topic. It is by design and controlled by settings. By default, BookKeeper LAC only advances when subsequent entries are added. If there is no subsequent entry added, the last written entry is not visible to readers until the ledger is closed. This is not a problem for Pulsar which uses managed ledger, but Pulsar SQL directly reads from BookKeeper ledger.

If you want to get the last message in a topic, set the following configurations:

  1. For the broker configuration, set bookkeeperExplicitLacIntervalInMills > 0 in broker.conf or standalone.conf.
  2. For the Presto configuration, set pulsar.bookkeeper-explicit-interval > 0 and pulsar.bookkeeper-use-v2-protocol=false.

However, using BookKeeper V3 protocol introduces additional GC overhead to BK as it uses Protobuf.

  1. $ wget https://archive.apache.org/dist/pulsar/pulsar-2.10.0/apache-pulsar-2.10.0-bin.tar.gz

Since Pulsar SQL is powered by Trino (formerly Presto SQL), the configuration for deployment is the same for the Pulsar SQL worker.

note

For how to set up a standalone single node environment, refer to Query data.

You can use the same CLI args as the Presto launcher.

  1. $ ./bin/pulsar sql-worker --help
  2. Usage: launcher [options] command
  3. Commands: run, start, stop, restart, kill, status
  4. Options:
  5. -h, --help show this help message and exit
  6. -v, --verbose Run verbosely
  7. --etc-dir=DIR Defaults to INSTALL_PATH/etc
  8. Defaults to INSTALL_PATH/bin/launcher.properties
  9. --node-config=FILE Defaults to ETC_DIR/node.properties
  10. --jvm-config=FILE Defaults to ETC_DIR/jvm.config
  11. --config=FILE Defaults to ETC_DIR/config.properties
  12. --log-levels-file=FILE
  13. Defaults to ETC_DIR/log.properties
  14. --data-dir=DIR Defaults to INSTALL_PATH
  15. --pid-file=FILE Defaults to DATA_DIR/var/run/launcher.pid
  16. --launcher-log-file=FILE
  17. Defaults to DATA_DIR/var/log/launcher.log (only in
  18. daemon mode)
  19. --server-log-file=FILE
  20. Defaults to DATA_DIR/var/log/server.log (only in
  21. daemon mode)
  22. -D NAME=VALUE Set a Java system property

The default configuration for the cluster is located in ${project.root}/conf/presto. You can customize your deployment by modifying the default configuration.

You can set the worker to read from a different configuration directory, or set a different directory to write data.

  1. $ ./bin/pulsar sql-worker start

You can deploy a Pulsar SQL cluster or Presto cluster on multiple nodes. The following example shows how to deploy a cluster on three-node cluster.

  1. Copy the Pulsar binary distribution to three nodes.

The first node runs as Presto coordinator. The minimal configuration requirement in the ${project.root}/conf/presto/config.properties file is as follows.

  1. coordinator=true
  2. node-scheduler.include-coordinator=true
  3. query.max-memory=50GB
  4. discovery-server.enabled=true
  5. discovery.uri=<coordinator-url>

The other two nodes serve as worker nodes, you can use the following configuration for worker nodes.

  1. coordinator=false
  2. http-server.http.port=8080
  3. query.max-memory=50GB
  4. query.max-memory-per-node=1GB
  5. discovery.uri=<coordinator-url>
  1. Modify pulsar.web-service-url and pulsar.zookeeper-uri configuration in the ${project.root}/conf/presto/catalog/pulsar.properties file accordingly for the three nodes.

  2. Start the coordinator node.

  1. Start worker nodes.
  1. $ ./bin/pulsar sql-worker run
  1. Start the SQL CLI and check the status of your cluster.
  1. $ ./bin/pulsar sql --server <coordinate_url>
  1. Check the status of your nodes.
  1. presto> SELECT * FROM system.runtime.nodes;
  2. node_id | http_uri | node_version | coordinator | state
  3. ---------+-------------------------+--------------+-------------+--------
  4. 1 | http://192.168.2.1:8081 | testversion | true | active
note

The broker does not advance LAC, so when Pulsar SQL bypass broker to query data, it can only read entries up to the LAC that all the bookies learned. You can enable periodically write LAC on the broker by setting “bookkeeperExplicitLacIntervalInMills” in the broker.conf.