Apache Beam Spark Pipeline Engine
The Spark Runner executes Beam pipelines on top of Apache Spark, providing:
Batch and streaming (and combined) pipelines.
The same security features Spark provides.
Built-in metrics reporting using Spark’s metrics system, which reports Beam Aggregators as well.
Native support for Beam side-inputs via spark’s Broadcast variables
Check the Apache Beam Spark runner docs for more information.
Since execution of a pipeline on Spark is only possible from the Spark Master it is possible to start a Hop server on the master. Then you can remotely execute from anywhere on your Spark master of choice. Make sure that any referencable artifacts like the fat-jar you want to use is available to the Hop server.
You can also execute using the ‘spark-submit’ tool. There is a main class you can use:
It accepts 3 arguments:
Argument | Description |
---|---|
1 | The filename of the pipeline to execute. |
2 | The filename of the metadata to load (JSON). You can export metadata in the Hop GUI under the tools menu (part of the Beam plugin in ) |
3 | The name of the pipeline run configuration to use |
Spark-submit also needs a fat jar. This can be generated in the Hop GUI under the tools menu or using command:
sh hop-config.sh -fj /path/to/fat.jar
Important : project configurations, environments and these things are not valid in the context of the Spark runtime. This is a TODO for the Hop community to think how we can do this best. Your input is welcome. In the meantime pass variables to the JVM with the option:
In general, it is better not to use relative paths like ${Internal.Entry.Current.Folder}
when specifying filenames when executing pipelines remotely. It’s usually better to pick a few root folders as variables. PROJECT_HOME is as good as any variable to use.
An example spark-submit command might look like this:
spark-submit \
--driver-java-options '-DPROJECT_HOME=/my/project/home' \
hop-0.70-fat.jar \
/my/project/home/pipeline.hpl \
SparkRunConfig
You can specify a master of local[4]
to run using an embedded Spark engine. It’s primarily used for testing locally. The number 4 in the example is the desired number of threads to use when executing. You can also specify *
to automatically figure that out for your system.
Please note that you can get an error like the following:
export SPARK_LOCAL_IP="127.0.0.1"