Apache Beam Spark Pipeline Engine

    The Spark Runner executes Beam pipelines on top of Apache Spark, providing:

    • Batch and streaming (and combined) pipelines.

    • The same security features Spark provides.

    • Built-in metrics reporting using Spark’s metrics system, which reports Beam Aggregators as well.

    • Native support for Beam side-inputs via spark’s Broadcast variables

    Check the Apache Beam Spark runner docs for more information.

    Since execution of a pipeline on Spark is only possible from the Spark Master it is possible to start a Hop server on the master. Then you can remotely execute from anywhere on your Spark master of choice. Make sure that any referencable artifacts like the fat-jar you want to use is available to the Hop server.

    You can also execute using the ‘spark-submit’ tool. There is a main class you can use:

    It accepts 3 arguments:

    ArgumentDescription

    1

    The filename of the pipeline to execute.

    2

    The filename of the metadata to load (JSON). You can export metadata in the Hop GUI under the tools menu (part of the Beam plugin in )

    3

    The name of the pipeline run configuration to use

    Spark-submit also needs a fat jar. This can be generated in the Hop GUI under the tools menu or using command:

    1. sh hop-config.sh -fj /path/to/fat.jar

    Important : project configurations, environments and these things are not valid in the context of the Spark runtime. This is a TODO for the Hop community to think how we can do this best. Your input is welcome. In the meantime pass variables to the JVM with the option:

    In general, it is better not to use relative paths like ${Internal.Entry.Current.Folder} when specifying filenames when executing pipelines remotely. It’s usually better to pick a few root folders as variables. PROJECT_HOME is as good as any variable to use.

    An example spark-submit command might look like this:

    1. spark-submit \
    2. --driver-java-options '-DPROJECT_HOME=/my/project/home' \
    3. hop-0.70-fat.jar \
    4. /my/project/home/pipeline.hpl \
    5. SparkRunConfig

    You can specify a master of local[4] to run using an embedded Spark engine. It’s primarily used for testing locally. The number 4 in the example is the desired number of threads to use when executing. You can also specify * to automatically figure that out for your system.

    Please note that you can get an error like the following:

    1. export SPARK_LOCAL_IP="127.0.0.1"