Apache Spark

    For Scala 2.11:

    1. <dependency>
    2. <groupId>com.yugabyte.spark</groupId>
    3. <artifactId>spark-cassandra-connector_2.11</artifactId>
    4. <version>2.0.5-yb-2</version>
    5. </dependency>

    You can run our Spark-based sample app with:

    1. $ java -jar yb-sample-apps.jar --workload CassandraSparkWordCount --nodes 127.0.0.1:9042

    It reads data from a table with sentences — by default, it generates an input table ybdemo_keyspace.lines, computes the frequencies of the words, and writes the result to the output table ybdemo_keyspace.wordcounts.

    Examining the source code

    • the source file in our GitHub source repo
    • untar the jar java/yb-sample-apps-sources.jar in the download bundle

    Most of the logic is in the method of the CassandraSparkWordCount class (in the file src/main/java/com/yugabyte/sample/apps/CassandraSparkWordCount.java). Some of the key portions of the sample program are explained in the sections below.

    The SparkConf object is configured as follows:

    Setting the input source

    To set the input data for Spark, you can do one of the following.

    • Reading from a table with a column line as the input
    1. // Read rows from table and convert them to an RDD.
    2. JavaRDD<String> rows = javaFunctions(sc).cassandraTable(keyspace, inputTable)
    3. .map(row -> row.getString("line"));
    • Reading from a file as the input:
    1. // Read the input file and convert it to an RDD.
    2. JavaRDD<String> rows = sc.textFile(inputFile);

    Setting the output table

    The output is written to the table.

    1. // Create the output table.
    2. session.execute("CREATE TABLE IF NOT EXISTS " + outTable +
    3. " (word VARCHAR PRIMARY KEY, count INT);");
    4. // Save the output to the CQL table.
    5. javaFunctions(counts).writerBuilder(keyspace, outputTable, mapTupleToRow(String.class, Integer.class))
    6. .withColumnSelector(someColumns("word", "count"))
    7. .saveToCassandra();

    Start PySpark with for Scala 2.10:

      For Scala 2.11: