To create an external table definition, you specify the format of your input files and the location of your external data sources. For information about input file formats, see .
Use one of the following protocols to access external table data sources. You cannot mix protocols in statements:
gpfdist://
points to a directory on the file host and serves external data files to all Greenplum Database segments in parallel. See gpfdist:// Protocol.gpfdists://
is the secure version ofgpfdist
. See .- The
pxf://
protocol accesses object store systems (Azure, Google Cloud Storage, Minio, S3), external Hadoop systems (HDFS, Hive, HBase), and SQL databases using the Greenplum Platform Extension Framework (PXF). See pxf:// Protocol. s3://
accesses files in an Amazon S3 bucket. See .
The pxf://
and s3://
protocols are custom data access protocols, where the file://
, gpfdist://
, and gpfdists://
protocols are implemented internally in Greenplum Database. The custom and internal protocols differ in these ways:
pxf://
ands3://
are custom protocols that must be registered using theCREATE EXTENSION
command (pxf
) or theCREATE PROTOCOL
command (s3
). Registering the PXF extension in a database creates thepxf
protocol. (See Accessing External Data with PXF.) To use the protocol, you must configure the database and register thes3
protocol. (See .) Internal protocols are always present and cannot be unregistered.- When a custom protocol is registered, a row is added to the
pg_extprotocol
catalog table to specify the handler functions that implement the protocol. The protocol’s shared libraries must have been installed on all Greenplum Database hosts. The internal protocols are not represented in thepg_extprotocol
table and have no additional libraries to install. - To grant users permissions on custom protocols, you use
GRANT [SELECT | INSERT | ALL] ON PROTOCOL
. To allow (or deny) users permissions on the internal protocols, you useCREATE ROLE
orALTER ROLE
to add theCREATEEXTTABLE
(orNOCREATEEXTTABLE
) attribute to each user’s role.
External tables access external files from within the database as if they are regular database tables. External tables defined with the gpfdist
/gpfdists
, pxf
, and s3
protocols utilize Greenplum parallelism by using the resources of all Greenplum Database segments to load or unload data. The pxf
protocol leverages the parallel architecture of the Hadoop Distributed File System to access files on that system. The s3
protocol utilizes the Amazon Web Services (AWS) capabilities.
The steps for using external tables are:
Define the external table.
To use the
pxf
ors3
protocol, you must also configure Greenplum Database and enable the protocol. See pxf:// Protocol or .Do one of the following:
- Start the Greenplum Database file server(s) when using the or
gpdists
protocols. - Verify the Greenplum Database configuration for the
s3
protocol.
- Start the Greenplum Database file server(s) when using the or
- Place the data files in the correct locations.
- Query the external table with SQL commands.
Readable external tables for data loading. Readable external tables support:
- Basic extraction, transformation, and loading (ETL) tasks common in data warehousing
- Reading external table data in parallel from multiple Greenplum database segment instances, to optimize large load operations
- Filter pushdown. If a query contains a
WHERE
clause, it may be passed to the external data source. Refer to the gp_external_enable_filter_pushdown server configuration parameter discussion for more information. Note that this feature is currently supported only with thepxf
protocol (see ). Readable external tables allow onlySELECT
operations.
Writable external tables for data unloading. Writable external tables support:
- Selecting data from database tables to insert into the writable external table
- Sending data to an application as a stream of data. For example, unload data from Greenplum Database and send it to an application that connects to another database or ETL tool to load the data elsewhere
External tables can be file-based or web-based. External tables using the file://
protocol are read-only tables.
- Regular (file-based) external tables access static flat files. Regular external tables are rescannable: the data is static while the query runs.
- Web (web-based) external tables access dynamic data sources, either on a web server with the
http://
protocol or by running OS commands or scripts. External web tables are not rescannable: the data can change while the query runs.
Greenplum Database backup and restore operations back up and restore only external and external web table definitions, not the data source data.
- file:// Protocol
Thefile://
protocol is used in a URI that specifies the location of an operating system file.
Thegpfdist://
protocol is used in a URI to reference a runninggpfdist
instance.- gpfdists:// Protocol
Thegpfdists://
protocol is a secure version of thegpfdist:// protocol
.
You can use the Greenplum Platform Extension Framework (PXF)pxf://
protocol to access data residing in object store systems (Azure, Google Cloud Storage, Minio, S3), external Hadoop systems (HDFS, Hive, HBase), and SQL databases.- s3:// Protocol
Thes3
protocol is used in a URL that specifies the location of an Amazon S3 bucket and a prefix to use for reading or writing files in the bucket.
A custom protocol allows you to connect Greenplum Database to a data source that cannot be accessed with thefile://
,gpfdist://
, orpxf://
protocols.- Handling Errors in External Table Data
By default, if external table data contains an error, the command fails and no data loads into the target database table.
External web tables allow Greenplum Database to treat dynamic data sources like regular database tables. Because web table data can change as a query runs, the data is not rescannable.