Cassandra database storage structure _ cassandra database data write, read and delete

**Foreword** This article focuses on the data storage format in Cassandra, covering both in-memory and disk-based storage. Cassandra is known for its excellent write performance, but this is not solely due to its data structure. Instead, it's primarily driven by its efficient writing mechanism. In addition, we will analyze factors that affect read performance and discuss the improvements Cassandra has made. Cassandraâ€™s data storage structure is divided into three main components: 1. **CommitLog**: This records all client-submitted data and operations. It ensures that data can be recovered if it hasn't been persisted to disk yet. 2. **Memtable**: This stores user-written data in memory. The object structure will be explained in detail later. There is also a BinaryMemtable format, which is currently not used in Cassandra and wonâ€™t be discussed here. 3. **SSTable**: This represents data that has been written to disk, consisting of Data, Index, and Filter files. **CommitLog Data Format** The CommitLog stores data as byte groups formatted according to a specific structure. These are written to an I/O buffer and periodically flushed to disk. As mentioned in a previous article, there are two persistence methods: Periodic and Batch. While their data formats are the same, the former is asynchronous, and the latter is synchronous, with different frequencies of data being written to disk. The class structure related to CommitLog is shown in Figure 1. Figure 1. Related class structure diagram for CommitLog ![Cassandra database storage structure](http://i.bosscdn.com/blog/23/62/48/6-1P102130U2552.png) The persistence strategy is simple: the RowMutation object submitted by the user is first serialized into a byte array. Then, this object and the byte array are passed to the LogRecordAdder object, which calls the write method of the CommitLogSegment to complete the write operation. The code for this method is as follows: Listing 1. CommitLogSegment.write ```java public CommitLogSegment.CommitLogContext write(RowMutation rowMutation, Object serializedRow) { Long currentPosition = -1L; ... Checksum checksum = new CRC32(); if (serializedRow instanceof DataOutputBuffer) { DataOutputBuffer buffer = (DataOutputBuffer) serializedRow; logWriter.writeLong(buffer.getLength()); logWriter.write(buffer.getData(), 0, buffer.getLength()); checksum.update(buffer.getData(), 0, buffer.getLength()); } else { Assert serializedRow instanceof byte[]; byte[] bytes = (byte[]) serializedRow; logWriter.writeLong(bytes.length); logWriter.write(bytes); checksum.update(bytes, 0, bytes.length); } logWriter.writeLong(checksum.getValue()); ... } ``` The main function of this code is to generate a CommitLogHeader object based on the columnFamily ID if it hasnâ€™t been serialized. The position in the current CommitLog file is recorded, and the header is serialized, overwriting the previous one. If it already exists, the serialization result of the RowMutation object is directly written into the CommitLog file buffer, along with a CRC32 checksum. The byte array format is as follows: Figure 2. CommitLog file array structure ![Cassandra database storage structure](http://i.bosscdn.com/blog/23/62/48/6-1P102131002550.png) Each different columnFamily ID is included in the header. This helps determine which data hasnâ€™t been serialized. To recover from the CommitLog, the recover method is used: Listing 2. CommitLog.recover ```java public static void recover(File[] clogs) throws IOException { ... final CommitLogHeader clHeader = CommitLogHeader.readCommitLogHeader(reader); int lowPos = CommitLogHeader.getLowestPosition(clHeader); if (lowPos == 0) break; reader.seek(lowPos); while (!reader.isEOF()) { try { bytes = new byte[(int) reader.readLong()]; reader.readFully(bytes); claimedCRC32 = reader.readLong(); } ... ByteArrayInputStream bufIn = new ByteArrayInputStream(bytes); Checksum checksum = new CRC32(); checksum.update(bytes, 0, bytes.length); if (claimedCRC32 != checksum.getValue()) continue; final RowMutation rm = RowMutation.serializer().deserialize(new DataInputStream(bufIn)); } ... } ``` The idea behind this code is to deserialize the CommitLog header into a CommitLogHeader object, find the minimum RowMutation position that hasnâ€™t been written back, fetch the serialized RowMutation data, and then deserialize it back into a RowMutation object. This object is then saved to Memtable instead of being written directly to disk. The process of CommitLog data changes is illustrated in the following figure: Figure 3. Change process of the CommitLog data format ![Cassandra database storage structure](http://i.bosscdn.com/blog/23/62/48/6-1P102131022639.png) **Memtable In-Memory Data Structure** The Memtable data structure is relatively simple. Each ColumnFamily corresponds to a unique Memtable object. Memtable mainly maintains a ConcurrentSkipListMap structure, where keys are "decoratedkey" and values are "columnfamily". When a new RowMutation object is added, Memtable checks if the key-columnFamily pair already exists. If not, it adds it; if so, it retrieves the existing ColumnFamily and merges the columns. The class structure related to Memtable is shown in Figure 4: Figure 4. Memtable related class structure diagram ![Cassandra database storage structure](http://i.bosscdn.com/blog/23/62/48/6-1P102131041194.png) The data in the Memtable is flushed to disk based on configuration parameters specified in the configuration file, which were detailed in a previous article. It has been mentioned many times that Cassandra has excellent write performance. The reason is that Cassandra writes data to Memtable first, which is an in-memory data structure. Therefore, Cassandraâ€™s write operation is essentially a memory write. The following figure illustrates how a key/value data is written to the Memtable data structure in Cassandra: Figure 5. Data is written to Memtable ![Cassandra database storage structure](http://i.bosscdn.com/blog/23/62/48/6-1P10213105G46.png) **SSTable Data Format** Each time data is added to the Memtable, the program checks whether the Memtable meets the conditions to be written to disk. If the condition is met, the Memtable is written to disk. Letâ€™s look at the classes involved in this process. The related class diagram is shown in Figure 6: Figure 6. SSTable persistence class structure diagram ![Cassandra database storage structure](http://i.bosscdn.com/blog/23/62/48/6-1P10213111J01.png) After the Memtable condition is satisfied, an SSTableWriter object is created. All "decoratedkey, columnfamily" entries in the Memtable are taken out and serialized into a DataOutputBuffer. Then, the SSTableWriter writes this data to the Data, Index, and Filter files based on the DecoratedKey and DataOutputBuffer. The Data file format is as follows: Figure 7. Data file structure of SSTable ![Cassandra database storage structure](http://i.bosscdn.com/blog/23/62/48/6-1P102131134625.png) The Data file is organized according to the above byte array. When data is written to the Data file, it is also written to the Index file. What data is stored in the Index? In fact, the Index file records the offset address of all Keys and their corresponding positions in the Data file, as shown in Figure 8: Figure 8. Index file structure ![Cassandra database storage structure](http://i.bosscdn.com/blog/23/62/48/6-1P102131155Z3.png) The Index file is actually an index of the Key. Currently, only the Key is indexed. There is no index for the super column or the column, so matching the column is slower than matching the Key. After the Index file is written, the Filter file is written. The contents of the Filter file are the serialization result of the BloomFilter object. Its file structure is shown in Figure 9: Figure 9. Filter file structure ![Cassandra database storage structure](http://i.bosscdn.com/blog/23/62/48/6-1P1021312131D.png) The BloomFilter object corresponds to a Hash algorithm, which quickly determines whether a given Key is not in the current SSTable. The BloomFilter object for each SSTable is kept in memory, and the Filter file is a persistent copy of this BloomFilter. The data format of the three files is clearly shown in the following figure: Figure 10. SSTable data format conversion ![Cassandra database storage structure](http://i.bosscdn.com/blog/23/62/48/6-1P102131232463.png) After the three files are written, the CommitLog file needs to be updated to inform the CommitLog that the current ColumnFamily stored in the header is not written to the minimum position on disk. During the process of writing Memtable to disk, the Memtable is placed in the memtablesPendingFlush container to ensure correct data reading during subsequent reads. This will be discussed further in the data reading section. **Data Writing** There are two steps to writing data to Cassandra: 1. Find the node that should store the data. 2. Write the data to that node. When a client writes data, it must specify the Keyspace, ColumnFamily, Key, Column Name, and Value. It can also specify a Timestamp and the security level of the data. The main classes involved in data writing are shown in the following figure: Figure 11. Insert related class diagram ![Cassandra database storage structure](http://i.bosscdn.com/blog/23/62/48/6-1P102131250b3.png) The big write logic is as follows: When CassandraServer receives the data to be written, it creates a RowMutation object and a QueryPath object that holds the ColumnFamily, Column Name, or Super Column Name. It then saves all the user-submitted data in the Map structure of the RowMutation object. Next, it calculates which node in the cluster should store the data based on the submitted Key. The calculation rule is to convert the Key into a Token and find the closest node to the given Token using binary search in the Token ring of the entire cluster. If the user specifies multiple backups, the nodes with the same number of backups are returned in sequence in the Token ring. This forms a basic list of nodes, after which Cassandra checks if these nodes are working properly. If not, replacement nodes are found. It also checks if a node is starting up and includes it in the consideration, eventually forming a list of target nodes. Finally, the data is sent to these nodes. The next step is to save the data to Memtable and CommitLog. The result return can be asynchronous or synchronous depending on the security level specified by the user. If a node returns a failure, the data is resent. The following figure shows the timing diagram for writing data to Memtable when Cassandra receives a piece of data: Figure 12. Timing diagram of the Insert operation ![Cassandra database storage structure](http://i.bosscdn.com/blog/23/62/48/6-1P102131309216.png)
7-16 Connector
7-16 Connector,7/16 Din Coaxial Connector,L29 Din To Mini Din Adapter,7/16 Din Rf Connector
Changzhou Kingsun New Energy Technology Co., Ltd. , https://www.aioconn.com