Glue

    The Glue API in LocalStack Pro allows you to run ETL (Extract-Transform-Load) jobs locally, maintaining table metadata in the local Glue data catalog, and using the Spark ecosystem (PySpark/Scala) to run data processing workflows.

    Note: In order to run Glue jobs, some additional dependencies have to be fetched from the network, including a Docker image of apprx. 1.5GB which includes Spark, Presto, Hive and other tools. These dependencies are automatically fetched when you start up the service, so please make sure you’re on a decent internet connection when pulling the dependencies for the first time.

    The commands below illustrate the creation of some very basic entries (databases, tables) in the Glue data catalog:

    Running Scripts with Scala and PySpark

    Assuming we would like to deploy a simple PySpark script in the local folder, we can first copy the script to an S3 bucket:

    1. $ awslocal s3 mb s3://glue-test
    2. $ awslocal s3 cp job.py s3://glue-test/job.py

    Next, we can create a job definition:

    1. $ awslocal glue create-job --name job1 --role r1 \
    2. --command '{"Name": "pythonshell", "ScriptLocation": "s3://glue-test/job.py"}'

    … and finally start the job:

    1. $ awslocal glue start-job-run --job-name job1
    2. {
    3. "JobRunId": "733b76d0"
    4. }

    For a more detailed example illustrating how to run a local Glue PySpark job, please refer to this .

    The Glue data catalog is integrated with Athena, and the database/table definitions can be imported via the import-catalog-to-glue API.

    Assume you are running the following Athena queries to create databases and table definitions:

    1. CREATE DATABASE db2
    2. CREATE EXTERNAL TABLE db2.table1 (a1 Date, a2 STRING, a3 INT) LOCATION 's3://test/table1'
    3. CREATE EXTERNAL TABLE db2.table2 (a1 Date, a2 STRING, a3 INT) LOCATION 's3://test/table2'

    Then this command will import these DB/table definitions into the Glue data catalog:

    1. $ awslocal glue import-catalog-to-glue

    … and finally they will be available in Glue:

    1. $ awslocal glue get-databases
    2. {
    3. "DatabaseList": [
    4. ...
    5. {
    6. "Name": "db2",
    7. "Description": "Database db2 imported from Athena",
    8. "TargetDatabase": {
    9. "CatalogId": "000000000000",
    10. "DatabaseName": "db2"
    11. }
    12. }
    13. }
    14. {
    15. "TableList": [
    16. {
    17. "Name": "table1",
    18. "DatabaseName": "db2",
    19. "Description": "Table db2.table1 imported from Athena",
    20. "CreateTime": ...
    21. },
    22. {
    23. "Name": "table2",
    24. "DatabaseName": "db2",
    25. "Description": "Table db2.table2 imported from Athena",
    26. "CreateTime": ...
    27. }
    28. ]
    29. }

    Crawlers

    Glue crawlers allow extracting metadata from structured data sources. The example below illustrates crawling tables and partition metadata from S3 buckets.

    Then we can create and trigger the crawler:

    1. $ awslocal glue create-database --database-input '{"Name":"db1"}'
    2. $ awslocal glue create-crawler --name c1 --database-name db1 --role r1 --targets '{"S3Targets": [{"Path": "s3://test/table1"}]}'
    3. $ awslocal glue start-crawler --name c1

    Finally, we can query the table and partitions metadata that has been created by the crawler:

    1. $ awslocal glue get-tables --database-name db1
    2. {
    3. "TableList": [{
    4. "Name": "table1",
    5. "DatabaseName": "db1",
    6. "PartitionKeys": [ ... ]
    7. ...
    8. $ awslocal glue get-partitions --database-name db1 --table-name table1
    9. {
    10. "Partitions": [{
    11. "DatabaseName": "db1",
    12. "TableName": "table1",
    13. ...

    The Glue Schema Registry allows you to centrally discover, control, and evolve data stream schemas. With the Schema Registry, you can manage and enforce schemas and schema compatibilities in your streaming applications. It integrates nicely with .

    Note: Currently, LocalStack supports the AVRO dataformat for the Glue Schema Registry. Support for other dataformats will be added in the future.

    1. $ awslocal glue create-registry --registry-name demo-registry
    2. {
    3. "RegistryArn": "arn:aws:glue:us-east-1:000000000000:file-registry/demo-registry",
    4. "RegistryName": "demo-registry"
    5. }
    6. $ awslocal glue create-schema --schema-name demo-schema --registry-id RegistryName=demo-registry --data-format AVRO --compatibility FORWARD \
    7. --schema-definition '{"type":"record","namespace":"Demo","name":"Person","fields":[{"name":"Name","type":"string"}]}'
    8. {
    9. "RegistryName": "demo-registry",
    10. "RegistryArn": "arn:aws:glue:us-east-1:000000000000:file-registry/demo-registry",
    11. "SchemaName": "demo-schema",
    12. "SchemaArn": "arn:aws:glue:us-east-1:000000000000:schema/demo-registry/demo-schema",
    13. "DataFormat": "AVRO",
    14. "Compatibility": "FORWARD",
    15. "SchemaCheckpoint": 1,
    16. "LatestSchemaVersion": 1,
    17. "NextSchemaVersion": 2,
    18. "SchemaStatus": "AVAILABLE",
    19. "SchemaVersionId": "546d3220-6ab8-452c-bb28-0f1f075f90dd",
    20. "SchemaVersionStatus": "AVAILABLE"
    21. }
    22. $ awslocal glue register-schema-version --schema-id SchemaName=demo-schema,RegistryName=demo-registry \
    23. --schema-definition '{"type":"record","namespace":"Demo","name":"Person","fields":[{"name":"Name","type":"string"}, {"name":"Address","type":"string"}]}'
    24. {
    25. "SchemaVersionId": "ee38732b-b299-430d-a88b-4c429d9e1208",
    26. "VersionNumber": 2,
    27. }

    You can find a more advanced sample in our localstack-pro-samples repository on GitHub, which showcases the integration with AWS MSK and automatic schema registrations (including schema rejections based on the compatibilities).

    Further Reading

    The AWS Glue API is a fairly comprehensive service - more details can be found in the official AWS Glue Developer Guide.