跳到主要內容

Distributed transactions

Table of contents [ hide ] Basic theory  CAP States that any distributed data store can provide only two of the following three guarantees. Consistency Every read receives the most recent write or an error. Availability Every request receives a (non-error) response, without the guarantee that it contains the most recent write. Partition tolerance The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes. Typical architecture of distributed systems When a network partition failure happens, it must be decided  whether to do one of the following: CP: cancel the operation and thus decrease the availability but ensure consistency AP: proceed with the operation and thus provide availability but risk inconsistency. BASE Basically-available, soft-state, eventual consistency. Base theory is the practical application of CAP theory, that is, under the premise of the existence of partitions and copies, through certain syste

Kafka

Kafka

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Why use a Message Queue like Kafka

1. Asynchronous

The program can continue execution without waiting for I/O to complete, increasing throughput.

2. Decoupling

It refers to reducing the dependencies between different parts of the system so that each component of the system can be developed, maintained, and evolved relatively independently. The main goal of decoupling is to reduce tight coupling between components to improve system flexibility, maintainability, and scalability.

3. Peak clipping

Peak clipping is essential to delay user requests more, filter user access needs layer by layer, and follow the principle of "the number of requests that ultimately land on the database is as small as possible".

Basic Concept



1. Client

Including producer and consumer.

2. Consumer Group

Each consumer can specify the consumer group to which he belongs. Each message will be consumed by multiple interested consumer groups.
But within each consumer group, a message means that it will be consumed by the consumer once.

3. Server-side Broker

A Kafka Server is a broker.

4. Topic

In a logical concept, a topic can be viewed as a set of business information that clients can produce or consume by binding specific topics of interest.

Producer   Workflow



1. Serialize

The serialization mechanism is a very important optimization mechanism in high-concurrency scenarios. Efficient serialization implementation can greatly improve distributed systems
Systematic network transmission and data disk capabilities.

2. Partition

According to the specified key, the message is allocated to a specific partition through a specific algorithm.

3. Compression

In this step, the producer record is compressed before it’s written to the record accumulator.

4. Record accumulator

The message to be sent by KafkaProducer will be in
Cached in ReocrdAccumulator and then sent to Kafka Broker in batches

5. Sender 

The sender is an independent thread in KafkaProducer used to send messages. As you can see here, each KafkaProducer object corresponds to a sender thread. He will be responsible for sending messages from the RecordAccumulator to Kafka.

The Sender does not send all the messages cached in the RecordAccumulator at once, but only takes out a part of the messages at a time. He only obtains the ProducerBatch messages whose cache content in the RecordAccumulator reaches the BATCH_SIZE_CONFIG size.

Of course, if there are relatively few messages, the message size in ProducerBatch cannot reach BATCH_SIZE_CONFIG for a long time, and the Sender will not wait forever. The maximum waiting time is LINGER_MS_CONFIG. Then the messages in ProducerBatch will be read.

6. ACK

Determine whether the message is successfully sent to the Broker.

Acks=0

The producer does not care whether the Broker writes the message to the Partition, it just sends the message and then forgets it. Highest throughput, but lowest data security.

Acks=all or -1

The producer needs to wait for all Partitions (Leader Partition and its corresponding FollowerPartition) on the Broker side to be written before getting the return result. This method is the most secure for data, but it will take longer to send messages each time. swallow
Throughput is the lowest.

Acks = 1

Is a relatively neutral strategy. After the Leader Partition writes its own message, it returns the result to the producer.

7. Producer message idempotence

Kafka ensures that no matter what the Producer sends to the Broker, No matter how many times the data is repeated, the Broker only retains one message.


8. Producer message transaction

This transaction mechanism ensures that this batch of messages can successfully maintain idempotence at the same time. Or this batch of messages fails at the same time so that the producer can start retrying as a whole and the messages will not be repeated.

Cluster MetaData

The main metadata of the Kafka cluster is stored in Zookeeper. There are two main states.

Controller

Among multiple Brokers, one Broker needs to be elected to serve as
Controller role. The Controller role manages the partition and replica status of the entire cluster.

Leader

In multiple Partitions under the same Topic, a Leader role needs to be elected. The Partition of the Leader role is responsible for data interaction with the client.


留言

這個網誌中的熱門文章

ShardingSphere

Table of contents [ hide ]  ShardingSphere The distributed SQL transaction & query engine for data sharding, scaling, encryption, and more - on any database. ShardingJDBC ShardingSphere-JDBC is a lightweight Java framework that provides additional services at Java’s JDBC layer. ShardingProxy ShardingSphere-Proxy is a transparent database proxy, providing a database server that encapsulates database binary protocol to support heterogeneous languages. Core Concept 1. Virtual Database Provides a virtual database with sharding capabilities, allowing applications to be easily used as a single database 2. Real Database The database that stores real data in the shardingShereDatasource instance for use by ShardingSphere 3. Logic Table Tables used by the application 4. Real Table In the table that stores real data, the data structure is the same as the logical table. The application maintains the mapping between the logical table and the real table. All real tables map to ShardingSpher

Program design template

Table of contents [ hide ] Designing programs is a common job for every engineer, so templates help simplify the job of designing each program. Update History Let everyone know the latest version and update information. background Write down why we need this program and what is the background for building this program. Target Target functionality The goal of the program, what function to achieve, the main module and submodule, and these modules' relationship. Target performance Specific benchmarks such as QPS or milliseconds to evaluate programs. Target architecture Stability. Readability. Maintainability. Extendability. ... Others Target Overall design Design principles and thinking Explain how and why the program was designed. Overall architecture An overall architectural picture. Dependency Module dependencies on other modules or programs. Detail design Program flow design Program flow design diagram. API design The details of the API, and how to interact with the frontend

Virtual memory

Table of contents [ hide ] Virtual memory Separation of user logical memory from physical memory. To run an extremely large process. Logical address space can be much larger than physical address space. To increase CPU/resource utilization. A higher degree of multiprogramming degree. To simplify programming tasks. A free programmer from memory limitation. To run programs faster. Less I/O would be needed to load or swap. Process & virtual memory Demand paging: only bring in the page containing the first instruction. Copy-on-write: the parent and the child process share the same frames initially, and frame-copy. when a page is written. Allow both the parent and the child process to share the same frames in memory. If either process modifies a frame, only then a frame is copied. COW allows efficient process creation(e.g., fork()). Free frames are allocated from a pool of zeroed-out frames(for security reasons). The content of the frame is erased to 0 Memory-Mapped File: map a fi