Apache Beam Spark Pipeline Engine

The Spark Runner executes Beam pipelines on top of Apache Spark, providing:

Batch and streaming (and combined) pipelines.
The same security features Spark provides.
Built-in metrics reporting using Spark’s metrics system, which reports Beam Aggregators as well.
Native support for Beam side-inputs via spark’s Broadcast variables

Check the Apache Beam Spark runner docs for more information.

Since execution of a pipeline on Spark is only possible from the Spark Master it is possible to start a Hop server on the master. Then you can remotely execute from anywhere on your Spark master of choice. Make sure that any referencable artifacts like the fat-jar you want to use is available to the Hop server.

You can also execute using the ‘spark-submit’ tool. There is a main class you can use:

It accepts 3 arguments:

Argument	Description
1	The filename of the pipeline to execute.
2	The filename of the metadata to load (JSON). You can export metadata in the Hop GUI under the tools menu (part of the Beam plugin in )
3	The name of the pipeline run configuration to use

Spark-submit also needs a fat jar. This can be generated in the Hop GUI under the tools menu or using command:

sh hop-config.sh -fj /path/to/fat.jar

Important : project configurations, environments and these things are not valid in the context of the Spark runtime. This is a TODO for the Hop community to think how we can do this best. Your input is welcome. In the meantime pass variables to the JVM with the option:

In general, it is better not to use relative paths like ${Internal.Entry.Current.Folder} when specifying filenames when executing pipelines remotely. It’s usually better to pick a few root folders as variables. PROJECT_HOME is as good as any variable to use.

An example spark-submit command might look like this:

spark-submit \
  --driver-java-options '-DPROJECT_HOME=/my/project/home' \
  hop-0.70-fat.jar \
  /my/project/home/pipeline.hpl \
  SparkRunConfig

You can specify a master of local[4] to run using an embedded Spark engine. It’s primarily used for testing locally. The number 4 in the example is the desired number of threads to use when executing. You can also specify * to automatically figure that out for your system.

Please note that you can get an error like the following:

export SPARK_LOCAL_IP="127.0.0.1"