4.9. Presto Verifier
During each Presto release, Verifier is run to ensure that there is no correctness regression.
In a MySQL database, create the following table and load it with the queries you would like to run:
Next, create a file:
Verifier Procedures
The following steps summarize the workflow of Verifier.
- Importing Source Queries
- Reads the list of source queries (query pairs with configuration) from the MySQL table.
- Query Pre-processing and Filtering
- Applies overrides to the catalog, schema, username, and password of each query.
- Filters queries according to whitelist and blacklist. Whitelist is applied before blacklist.
- Filters out queries with invalid syntax.
- Filters out queries not supported for validation.
Select
,Insert
, andCreateTableAsSelect
are supported.
- Query rewriting
- Rewrites queries before execution to ensure that production data is not modified.
- Rewrites
Select
queries toCreateTableAsSelect
- Column names are determined by running the
Select
query withLIMIT 0
. - Artificial names are used for unnamed columns.
- Column names are determined by running the
- Rewrites
- Rewrites
Insert
andCreateTableAsSelect
queries to have their table names replaced. - Constructs a setup query to create the table necessary for an
Insert
query.
- Constructs a setup query to create the table necessary for an
- Rewrites
- Query Execution
- For each source query, executes the following queries in order.
- Control setup queries
- Control query
- Test setup queries
- Test query
- Control and test teardown queries
- Queries are subject to timeouts and retries.
- Cluster connection failures and transient Presto failures are retried.
- Query retries may conceal reliability issues, and therefore Verifier records alloccurred Presto query failures, including the retries.
- Certain query failures are automatically submitted for re-validation, such as partitiondropped or table dropped during query.
- See for auto-resolving of query failures.
- Results Comparison
- For
Select
,Insert
, andCreateTableAsSelect
queries, results are written intotemporary tables. - Constructs and runs the checksum queries for both control and test.
- Verifies table schema and row count are the same for the control and the test result table.
- Verifies checksums are matching for each column. See Column Checksums for special handlingof different column types.
- See for handling of non-deterministic queries.
- For
- Emitting Results
- Verification results can be exported as
JSON
, or human readable text.
- Verification results can be exported as
For each column in the control/test query, one or more columns are generated in the checksumqueries.
- Floating Point Columns
- For
DOUBLE
andREAL
columns, 4 columns are generated for verification: - Sum of the finite values of the column
NAN
count of the column- Positive infinity count of the column
- Negative infinity count of the column
- For
- Checks if
NAN
count, positive and negative infinity count matches. - Checks the nullity of control sum and test sum.
- Checks the relative error between control sum and test sum.
- Array Columns
- 2 columns are generated for verification:
- Sum of the cardinality
- Array checksum
- For an array column
arr
of typearray(E)
: - If
E
is not orderable, array checksum ischecksum(arr)
. - If
E
is orderable, array checksumcoalesce(checksum(try(array_sort(arr))), checksum(arr))
.
- If
- For an array column
- Map Columns
- 4 columns are generated for verification:
- Sum of the cardinality
- Checksum of the map
- Array checksum of the key set
- Array checksum of the value set
- Row Columns
- Checksums row fields recursively according to the type of the fields.
- For all other column types, generates a simple checksum using the
checksum()
function.
Determinism
A result mismatch, either a row count mismatch or a column mismatch, can be caused bynon-deterministic query features. To avoid false alerts, we perform determinism analysisfor the control query. If a query is found non-deterministic, we skip the verification as itdoes not provide insights.
- Non-deterministic catalogs can be specified with
determinism.non-deterministic-catalog
.If a query references any table from those catalogs, the query is considered non-deterministic. - Runs the control query again and compares the results with the initial control query run.
- If a query has a
LIMIT n
clause but noORDER BY
clause at the top level: - Runs a query to count the number of rows produced by the control query without the
LIMIT
clause. - If the resulting row count is greater than
n
, treats the control query asnon-deterministic.
- Runs a query to count the number of rows produced by the control query without the
- If a query has a
The differences in configuration, including cluster size, can cause a query to succeed on thecontrol cluster but fail on the test cluster. A checksum query can also fail, which may be due tolimitation of Presto or Presto Verifier. Thus, we allow Verifier to automatically resolve certainquery failures.
EXCEEDED_GLOBAL_MEMORY_LIMIT
: Resolves if the control query uses more memory than the testquery.EXCEEDED_TIME_LIMIT
: Resolves unconditionally.- : Resolves if the test cluster does not have enough workers to makesure the number of partitions assigned to each worker stays within the limit.
Extending Verifier
Verifier can be extended for further behavioral changes in addition to configuration properties.
shows the components that be extended. Implement the abstract class and create a command line wrapper similar toPrestoVerifier.
The following configurations control the behavior of query execution on the control cluster.Counterparts are also available for test clusters with prefix control
being replaced with test
.