Distributed transactions

Basic theory

CAP

States that any distributed data store can provide only two of the following three guarantees.

Consistency

Every read receives the most recent write or an error.

Availability

Every request receives a (non-error) response, without the guarantee that it contains the most recent write.

Partition tolerance

The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.

Typical architecture of distributed systems

When a network partition failure happens, it must be decided

whether to do one of the following:

CP: cancel the operation and thus decrease the availability but ensure consistency

AP: proceed with the operation and thus provide availability but risk inconsistency.

BASE

Basically-available, soft-state, eventual consistency.

Base theory is the practical application of CAP theory, that is, under the premise of the existence of partitions and copies, through certain system design methods

The solution is to give up strong consistency and achieve basic availability. This is the choice of most distributed systems.

Basically available

Reading and writing operations are available as much as possible (using all nodes of a database cluster), but might not be consistent (the write might not persist after conflicts are reconciled, and the read might not get the latest write)

Soft-state

Without consistency guarantees, after some amount of time, we only have some probability of knowing the state, since it might not yet have converged

Eventually Consistent

If we execute some writes and then the system functions long enough, we can know the state of the data; any further reads of that data item will return the same value

Distributed transaction model

Strong consistency model

DTP model

The X/Open Distributed Transaction Processing (DTP) model includes a number of interrelated components that control how distributed transactions are processed.

Application program (AP):

The application program (AP) defines transaction boundaries and defines the application-specific actions that make up the transaction.

Resource managers (RM):

RM refers to a database such as MySQL or Oracle or a corresponding database driver or accessible file system or printer server to provide an interface for accessing database resources

Transaction manager (TM):

TM is the coordinator of distributed transactions. It is responsible for assigning transaction IDs to distributed transactions, monitoring the execution process of transactions, and responsible for transaction completion and fault tolerance.

Distributed transactions managed by TM can span multiple RMs. TM also manages the 2PC protocol to coordinate the commit/rollback decisions of distributed transactions.

X/Open XA

The goal of XA is to guarantee atomicity in "global transactions" that are executed across heterogeneous components. A transaction is a unit of work such as transferring money from one person to another. Distributed transactions update multiple data stores (such as databases, application servers, message queues, transactional caches, etc.) To guarantee integrity, XA uses a two-phase commit (2PC) to ensure that all of a transaction's changes either take effect (commit) or do not (rollback), i.e., atomically.

2PC(Two-phase commit protocol)

Phase 1:

The TM (Transaction Manager) notifies each RM (Resource Manager) to prepare to commit their transaction branches. If RM judges that the work it is doing can be submitted, it will persist the work content and give TM a positive reply; if other situations occur, it will give TM a negative reply.

Phase 2:

TM decides whether to commit or rollback the transaction based on the results of each RM prepared in phase 1. If all RMs are prepared successfully,

Then TM notifies all RMs to commit; if RM preparation fails, TM notifies all RMs to roll back their own transaction splits.

branch.

3PC(Three-phase commit protocol)

Three-Phase Commit is an extension of the Two-Phase Commit (2PC) protocol, with an additional phase to improve fault tolerance. It is a blocking protocol with three main steps:

Can-Commit: The coordinator sends a Can-Commit message, asking each participant if they're ready to commit the transaction. Participants reply with either Yes or No.

Pre-Commit: If all participants send a Yes, the coordinator broadcasts a Pre-Commit message. Participants acknowledge the message and prepare to commit the transaction.

Do-Commit: Once the coordinator receives all acknowledgements, it sends a Do-Commit message, instructing participants to finalize the transaction.

Three-Phase Commit helps avoid blocking in case of a coordinator failure by allowing participants to reach a decision independently, reducing the chances of a global deadlock.

Eventual consistency model

TCC(Try-Confirm-Cancel)

Try-Confirm-Cancel (TCC) mode is the most classic distributed transaction solution, which divides distributed transactions into two phases to execute. In the try phase, resources are reserved for each branch transaction. If all branch transactions are reserved, the global transaction is committed in the commit phase. If a node fails to reserve resources, the global transaction is rolled back into the cancel phase.

TCC vs XA

XA is a distributed transaction at the resource level with strong consistency. During the entire two-phase submission process, the resource lock will always be held. TCC is

Distributed transactions at the business level are eventually consistent and will not always hold resource locks.

Reliable message eventual consistency

Local massage table

The local message table is a business coupling design. The message producer needs to build an additional transaction message table and record the message-sending status. The message consumer needs to process this message and complete its own business logic. In addition, there will be an asynchronous mechanism for Periodic scan of outstanding

message to ensure eventual consistency.

Transactional messages(Rocket MQ)

Normal messages and order transactions cannot be consistent because normal messages fail to be committed, rolled back, and coordinated like stand-alone database transactions.

The distributed transactional message feature based on Message Queue for Apache RocketMQ supports two-phase commits based on normal messages. Bind a two-phase commit to a local transaction to achieve consistency in global commit results.

The solution of transactional messages in Message Queue for Apache RocketMQ provides the advantages of high performance, scalability, and simple business development.

Data consistency algorithm

Paxos

Paxos is an algorithm that is used to achieve consensus among a distributed set of computers that communicate via an asynchronous network. One or more clients propose a value to Paxos and we have a consensus when a majority of systems running Paxos agree on one of the proposed values. Paxos is widely used and is legendary in computer science since it is the first consensus algorithm that has been rigorously proved to be correct.

Basic Paxos

In most deployments of Paxos, each participating process acts in three roles; Proposer, Acceptor, and Learner. This reduces the message complexity significantly, without sacrificing correctness.

Paxos is a two-phase protocol, meaning that the proposers interact with the acceptors twice. At a high level:

Phase 1:

A proposer asks all the working acceptors whether anyone already received a proposal. If the answer is no, propose a value.

Phase 2:

If a majority of acceptors agree to this value then that is our consensus.

Multi-Paxos

A typical deployment of Paxos requires a continuous stream of agreed values acting as commands to a distributed state machine. If each command is the result of a single instance of the Basic Paxos protocol, a significant amount of overhead would result.

If the leader is relatively stable, phase 1 becomes unnecessary. Thus, it is possible to skip phase 1 for future instances of the protocol with the same leader.

To achieve this, the round number I is included along with each value which is incremented in each round by the same Leader. Multi-Paxos reduces the failure-free message delay (proposal to learning) from 4 delays to 2 delays.

ZAB(ZooKeeper Atomic Broadcast)

The ZAB protocol ensures that the Zookeeper replication is done in order and is also responsible for the election of leader nodes and the restoration of any failed nodes. In a Zookeeper ecosystem, the leader node is the heart of everything; every cluster has one leader node and the rest of the nodes are followers. All incoming client requests and state changes are received first by the leader with the responsibility to replicate it across all its followers (and itself). All incoming read requests are also load-balanced by the leader within itself and its followers.

RAFT

It is based on Lambert's Multi-Paxos idea with some simplifications and restrictions.

For example, I would like to add that the log must be continuous and only support three states: leader, follower, and candidate. The algorithm is relatively easy to understand and implement. The Raft algorithm is now the consensus algorithm of choice for distributed system development.

The Raft algorithm essentially achieves consensus on a series of values and the consistency of each node's log through all leader-based methods.

GOSSIP

A gossip protocol or epidemic protocol is a procedure or process of computer peer-to-peer communication that is based on the way epidemics spread. Some distributed systems use peer-to-peer gossip to ensure that data is disseminated to all members of a group. Some ad-hoc networks have no central registry and the only way to spread common data is to rely on each member to pass it along to their neighbors.

reference:

https://en.wikipedia.org/wiki/CAP_theorem

https://en.wikipedia.org/wiki/Eventual_consistency

https://www.ibm.com/docs/en/db2/10.5?topic=managers-designing-xa-compliant-transaction

https://en.wikipedia.org/wiki/X/Open_XA

https://www.alibabacloud.com/blog/seata-v1-5-1-solves-the-idempotence-suspension-and-empty-rollback-problems-of-tcc-mode_599647

https://www.alibabacloud.com/blog/use-transactional-messages-to-understand-rocketmq-service-messages_599698

https://www.researchgate.net/figure/X-Open-Distributed-Transaction-Processing-Model_fig3_2771409

https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.dc36123_1251/html/xainterf/X60196.htm

https://www.researchgate.net/figure/The-basic-two-phase-commit-protocol_fig1_3686726

https://en.wikipedia.org/wiki/File:Three-phase_commit_diagram.svg

https://swe.auspham.dev/docs/system-designs/tcc-try-confirm-cancel/