跳到主要內容

Distributed transactions

Table of contents [ hide ] Basic theory  CAP States that any distributed data store can provide only two of the following three guarantees. Consistency Every read receives the most recent write or an error. Availability Every request receives a (non-error) response, without the guarantee that it contains the most recent write. Partition tolerance The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes. Typical architecture of distributed systems When a network partition failure happens, it must be decided  whether to do one of the following: CP: cancel the operation and thus decrease the availability but ensure consistency AP: proceed with the operation and thus provide availability but risk inconsistency. BASE Basically-available, soft-state, eventual consistency. Base theory is the practical application of CAP theory, that is, under the premise of the existence of partitions and copies, through certain syste

Distributed transactions

Basic theory 

CAP

States that any distributed data store can provide only two of the following three guarantees.

Consistency

Every read receives the most recent write or an error.

Availability

Every request receives a (non-error) response, without the guarantee that it contains the most recent write.

Partition tolerance

The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.

Typical architecture of distributed systems

When a network partition failure happens, it must be decided 
whether to do one of the following:

CP: cancel the operation and thus decrease the availability but ensure consistency
AP: proceed with the operation and thus provide availability but risk inconsistency.

BASE

Basically-available, soft-state, eventual consistency.
Base theory is the practical application of CAP theory, that is, under the premise of the existence of partitions and copies, through certain system design methods
The solution is to give up strong consistency and achieve basic availability. This is the choice of most distributed systems.

Basically available

Reading and writing operations are available as much as possible (using all nodes of a database cluster), but might not be consistent (the write might not persist after conflicts are reconciled, and the read might not get the latest write)

Soft-state

Without consistency guarantees, after some amount of time, we only have some probability of knowing the state, since it might not yet have converged

Eventually Consistent

If we execute some writes and then the system functions long enough, we can know the state of the data; any further reads of that data item will return the same value

Distributed transaction model

Strong consistency model

DTP model



The X/Open Distributed Transaction Processing (DTP) model includes a number of interrelated components that control how distributed transactions are processed.

Application program (AP):
The application program (AP) defines transaction boundaries and defines the application-specific actions that make up the transaction.

Resource managers (RM):
RM refers to a database such as MySQL or Oracle or a corresponding database driver or accessible file system or printer server to provide an interface for accessing database resources

Transaction manager (TM):
TM is the coordinator of distributed transactions. It is responsible for assigning transaction IDs to distributed transactions, monitoring the execution process of transactions, and responsible for transaction completion and fault tolerance.
Distributed transactions managed by TM can span multiple RMs. TM also manages the 2PC protocol to coordinate the commit/rollback decisions of distributed transactions. 
 

X/Open XA



The goal of XA is to guarantee atomicity in "global transactions" that are executed across heterogeneous components. A transaction is a unit of work such as transferring money from one person to another. Distributed transactions update multiple data stores (such as databases, application servers, message queues, transactional caches, etc.) To guarantee integrity, XA uses a two-phase commit (2PC) to ensure that all of a transaction's changes either take effect (commit) or do not (rollback), i.e., atomically.

2PC(Two-phase commit protocol)



Phase 1:
The TM (Transaction Manager) notifies each RM (Resource Manager) to prepare to commit their transaction branches. If RM judges that the work it is doing can be submitted, it will persist the work content and give TM a positive reply; if other situations occur, it will give TM a negative reply.

Phase 2:
TM decides whether to commit or rollback the transaction based on the results of each RM prepared in phase 1. If all RMs are prepared successfully,
Then TM notifies all RMs to commit; if RM preparation fails, TM notifies all RMs to roll back their own transaction splits.
branch.

3PC(Three-phase commit protocol)



Three-Phase Commit is an extension of the Two-Phase Commit (2PC) protocol, with an additional phase to improve fault tolerance. It is a blocking protocol with three main steps:

Can-Commit: The coordinator sends a Can-Commit message, asking each participant if they're ready to commit the transaction. Participants reply with either Yes or No.

Pre-Commit: If all participants send a Yes, the coordinator broadcasts a Pre-Commit message. Participants acknowledge the message and prepare to commit the transaction.

Do-Commit: Once the coordinator receives all acknowledgements, it sends a Do-Commit message, instructing participants to finalize the transaction.
Three-Phase Commit helps avoid blocking in case of a coordinator failure by allowing participants to reach a decision independently, reducing the chances of a global deadlock.

Eventual consistency model

TCC(Try-Confirm-Cancel)



Try-Confirm-Cancel (TCC) mode is the most classic distributed transaction solution, which divides distributed transactions into two phases to execute. In the try phase, resources are reserved for each branch transaction. If all branch transactions are reserved, the global transaction is committed in the commit phase. If a node fails to reserve resources, the global transaction is rolled back into the cancel phase.

TCC vs XA

XA is a distributed transaction at the resource level with strong consistency. During the entire two-phase submission process, the resource lock will always be held. TCC is
Distributed transactions at the business level are eventually consistent and will not always hold resource locks.

Reliable message eventual consistency

Local massage table

The local message table is a business coupling design. The message producer needs to build an additional transaction message table and record the message-sending status. The message consumer needs to process this message and complete its own business logic. In addition, there will be an asynchronous mechanism for Periodic scan of outstanding
message to ensure eventual consistency.

Transactional messages(Rocket MQ)



Normal messages and order transactions cannot be consistent because normal messages fail to be committed, rolled back, and coordinated like stand-alone database transactions.

The distributed transactional message feature based on Message Queue for Apache RocketMQ supports two-phase commits based on normal messages. Bind a two-phase commit to a local transaction to achieve consistency in global commit results.

The solution of transactional messages in Message Queue for Apache RocketMQ provides the advantages of high performance, scalability, and simple business development.

Data consistency algorithm

Paxos



Paxos is an algorithm that is used to achieve consensus among a distributed set of computers that communicate via an asynchronous network. One or more clients propose a value to Paxos and we have a consensus when a majority of systems running Paxos agree on one of the proposed values. Paxos is widely used and is legendary in computer science since it is the first consensus algorithm that has been rigorously proved to be correct.

Basic Paxos

In most deployments of Paxos, each participating process acts in three roles; Proposer, Acceptor, and Learner. This reduces the message complexity significantly, without sacrificing correctness.

Paxos is a two-phase protocol, meaning that the proposers interact with the acceptors twice. At a high level:

Phase 1:
A proposer asks all the working acceptors whether anyone already received a proposal. If the answer is no, propose a value.

Phase 2:
If a majority of acceptors agree to this value then that is our consensus.

Multi-Paxos



A typical deployment of Paxos requires a continuous stream of agreed values acting as commands to a distributed state machine. If each command is the result of a single instance of the Basic Paxos protocol, a significant amount of overhead would result.

If the leader is relatively stable, phase 1 becomes unnecessary. Thus, it is possible to skip phase 1 for future instances of the protocol with the same leader.

To achieve this, the round number I is included along with each value which is incremented in each round by the same Leader. Multi-Paxos reduces the failure-free message delay (proposal to learning) from 4 delays to 2 delays.

ZAB(ZooKeeper Atomic Broadcast)



The ZAB protocol ensures that the Zookeeper replication is done in order and is also responsible for the election of leader nodes and the restoration of any failed nodes. In a Zookeeper ecosystem, the leader node is the heart of everything; every cluster has one leader node and the rest of the nodes are followers. All incoming client requests and state changes are received first by the leader with the responsibility to replicate it across all its followers (and itself). All incoming read requests are also load-balanced by the leader within itself and its followers.

RAFT



It is based on Lambert's Multi-Paxos idea with some simplifications and restrictions.
For example, I would like to add that the log must be continuous and only support three states: leader, follower, and candidate. The algorithm is relatively easy to understand and implement. The Raft algorithm is now the consensus algorithm of choice for distributed system development.
The Raft algorithm essentially achieves consensus on a series of values and the consistency of each node's log through all leader-based methods.

GOSSIP



A gossip protocol or epidemic protocol is a procedure or process of computer peer-to-peer communication that is based on the way epidemics spread. Some distributed systems use peer-to-peer gossip to ensure that data is disseminated to all members of a group. Some ad-hoc networks have no central registry and the only way to spread common data is to rely on each member to pass it along to their neighbors.


reference:



留言

這個網誌中的熱門文章

ShardingSphere

Table of contents [ hide ]  ShardingSphere The distributed SQL transaction & query engine for data sharding, scaling, encryption, and more - on any database. ShardingJDBC ShardingSphere-JDBC is a lightweight Java framework that provides additional services at Java’s JDBC layer. ShardingProxy ShardingSphere-Proxy is a transparent database proxy, providing a database server that encapsulates database binary protocol to support heterogeneous languages. Core Concept 1. Virtual Database Provides a virtual database with sharding capabilities, allowing applications to be easily used as a single database 2. Real Database The database that stores real data in the shardingShereDatasource instance for use by ShardingSphere 3. Logic Table Tables used by the application 4. Real Table In the table that stores real data, the data structure is the same as the logical table. The application maintains the mapping between the logical table and the real table. All real tables map to ShardingSpher

Program design template

Table of contents [ hide ] Designing programs is a common job for every engineer, so templates help simplify the job of designing each program. Update History Let everyone know the latest version and update information. background Write down why we need this program and what is the background for building this program. Target Target functionality The goal of the program, what function to achieve, the main module and submodule, and these modules' relationship. Target performance Specific benchmarks such as QPS or milliseconds to evaluate programs. Target architecture Stability. Readability. Maintainability. Extendability. ... Others Target Overall design Design principles and thinking Explain how and why the program was designed. Overall architecture An overall architectural picture. Dependency Module dependencies on other modules or programs. Detail design Program flow design Program flow design diagram. API design The details of the API, and how to interact with the frontend

Virtual memory

Table of contents [ hide ] Virtual memory Separation of user logical memory from physical memory. To run an extremely large process. Logical address space can be much larger than physical address space. To increase CPU/resource utilization. A higher degree of multiprogramming degree. To simplify programming tasks. A free programmer from memory limitation. To run programs faster. Less I/O would be needed to load or swap. Process & virtual memory Demand paging: only bring in the page containing the first instruction. Copy-on-write: the parent and the child process share the same frames initially, and frame-copy. when a page is written. Allow both the parent and the child process to share the same frames in memory. If either process modifies a frame, only then a frame is copied. COW allows efficient process creation(e.g., fork()). Free frames are allocated from a pool of zeroed-out frames(for security reasons). The content of the frame is erased to 0 Memory-Mapped File: map a fi