13. Berkeley DB Replication - Transactional guarantees - 《Oracle Berkeley DB Programmer's Reference Guide （Version 18.1.32）》

If a database environment does not require the log be flushed to stable storage on transaction commit (using the flag to increase performance at the cost of sacrificing transactional durability), Berkeley DB recovery will only be able to restore the system to the state of the last commit found on stable storage. In this case, information may have been lost (for example, the changes made by some committed transactions may not appear in the databases after recovery).

Further, if there is database or log file loss or corruption (for example, if a disk drive fails), then catastrophic recovery is necessary, and Berkeley DB recovery will only be able to restore the system to the state of the last archived log file. In this case, information may also have been lost.

Replicating the database environment extends this model, by adding a new component to “stable storage”: the client’s replicated information. If a database environment is replicated, there is no lost information in the case of database or log file loss, because the replicated system can be configured to contain a complete set of databases and log records up to the point of failure. A database environment that loses a disk drive can have the drive replaced, and it can then rejoin the replication group.

Because of this new component of stable storage, specifying DB_TXN_NOSYNC in a replicated environment no longer sacrifices durability, as long as one or more clients have acknowledged receipt of the messages sent by the master. Since network connections are often faster than local synchronous disk writes, replication becomes a way for applications to significantly improve their performance as well as their reliability.

Applications using Replication Manager are free to use at the master and/or clients as they see fit. The behavior of the send function that Replication Manager provides on the application’s behalf is determined by an “acknowledgement policy”, which is configured by the DB_ENV->repmgr_set_ack_policy() method. Clients always send acknowledgements for messages (unless the acknowledgement policy in effect indicates that the master doesn’t care about them). For a DB_REP_PERMANENT message, the master blocks the sending thread until either it receives the proper number of acknowledgements, or the expires. In the case of timeout, Replication Manager returns an error code from the send function, causing Berkeley DB to flush the transaction log before returning to the application, as previously described. The default acknowledgement policy is DB_REPMGR_ACKS_QUORUM, which ensures that the effect of a permanent record remains durable following an election.

The rest of this section discusses transactional guarantee considerations in Base API applications.

The return status from the application’s send function must be set by the application to ensure the transactional guarantees the application wants to provide. Whenever the send function returns failure, the local database environment’s log is flushed as necessary to ensure that any information critical to database integrity is not lost. Because this flush is an expensive operation in terms of database performance, applications should avoid returning an error from the send function, if at all possible.

The only interesting message type for replication transactional guarantees is when the application’s send function was called with the flag specified. There is no reason for the send function to ever return failure unless the DB_REP_PERMANENT flag was specified — messages without the flag do not make visible changes to databases, and the send function can return success to Berkeley DB as soon as the message has been sent to the client(s) or even just copied to local application memory in preparation for being sent.

An application relying on a client’s ability to become a master and guarantee that no data has been lost will need to write the send function to return an error whenever it cannot guarantee the site that will win the next election has the record. Applications not requiring this level of transactional guarantees need not have the send function return failure (unless the master’s database environment has been configured with DB_TXN_NOSYNC), as any information critical to database integrity has already been flushed to the local log before send was called.

To sum up, the only reason for the send function to return failure is when the master database environment has been configured to not synchronously flush the log on transaction commit (that is, was configured on the master), the DB_REP_PERMANENT flag is specified for the message, and the send function was unable to determine that some number of clients have received the current message (and all messages preceding the current message). How many clients need to receive the message before the send function can return success is an application choice (and may not depend as much on a specific number of clients reporting success as one or more geographically distributed clients).

If, however, the application does require on-disk durability on the master, the master should be configured to synchronously flush the log on commit. If clients are not configured to synchronously flush the log, that is, if a client is running with configured, then it is up to the application to reconfigure that client appropriately when it becomes a master. That is, the application must explicitly call DB_ENV->set_flags() to disable asynchronous log flushing as part of re-configuring the client as the new master.

Of course, it is important to ensure that the replicated master and client environments are truly independent of each other. For example, it does not help matters that a client has acknowledged receipt of a message if both master and clients are on the same power supply, as the failure of the power supply will still potentially lose information.

Configuring a Base API application to achieve the proper mix of performance and transactional guarantees can be complex. In brief, there are a few controls an application can set to configure the guarantees it makes: specification of for the master environment, specification of DB_TXN_NOSYNC for the client environment, the priorities of different sites participating in an election, and the behavior of the application’s send function.

First, it is rarely useful to write and synchronously flush the log when a transaction commits on a replication client. It may be useful where systems share resources and multiple systems commonly fail at the same time. By default, all Berkeley DB database environments, whether master or client, synchronously flush the log on transaction commit or prepare. Generally, replication masters and clients turn log flush off for transaction commit using the flag.

Consider two systems connected by a network interface. One acts as the master, the other as a read-only client. The client takes over as master if the master crashes and the master rejoins the replication group after such a failure. Both master and client are configured to not synchronously flush the log on transaction commit (that is, DB_TXN_NOSYNC was configured on both systems). The application’s send function never returns failure to the Berkeley DB library, simply forwarding messages to the client (perhaps over a broadcast mechanism), and always returning success. On the client, any returns from the client’s DB_ENV->rep_process_message() method are ignored, as well. This system configuration has excellent performance, but may lose data in some failure modes.

If both the master and the client crash at once, it is possible to lose committed transactions, that is, transactional durability is not being maintained. Reliability can be increased by providing separate power supplies for the systems and placing them in separate physical locations.

Use a reliable network protocol (for example, TCP/IP instead of UDP).

Further, systems may want to guarantee message delivery to the client(s) (for example, to prevent a network connection from simply discarding messages). Some systems may want to ensure clients never return out-of-date information, that is, once a transaction commit returns success on the master, no client will return old information to a read-only query. Some of the following changes to a Base API application may be used to address these issues:

Write the application’s send function to not return to Berkeley DB until one or more clients have acknowledged receipt of the message. The number of clients chosen will be dependent on the application: you will want to consider likely network partitions (ensure that a client at each physical site receives the message) and geographical diversity (ensure that a client on each coast receives the message).

There is one final pair of failure scenarios to consider. First, it is not possible to abort transactions after the application’s send function has been called, as the master may have already written the commit log records to disk, and so abort is no longer an option. Second, a related problem is that even though the master will attempt to flush the local log if the send function returns failure, that flush may fail (for example, when the local disk is full). Again, the transaction cannot be aborted as one or more clients may have committed the transaction even if send returns failure. Rare applications may not be able to tolerate these unlikely failure modes. In that case the application may want to: