"Dark Horse Redis Principles" 1. Data Structure

Redis Low-Level C Language

Data Structures#

Dynamic String SDS#

Redis does not use the C language's String, as it is actually a character array.

Example:
char* s="hello" is {'h', 'e', 'l', 'l', 'o', '\0'}

Thus, Redis has built a new string structure called Simple Dynamic String, abbreviated as SDS.

image.png|500

Here, reading does not depend on '\0' just to conform to C language; the actual usage is len.

image.png|500

When less than 1M, memory allocation increases quickly; after that, it applies for memory in intervals of 1M, noting that memory allocation is very costly.

Binary safety means reading based on len, and it will not suddenly stop due to the appearance of '\0' in between.

IntSet#

IntSet is an implementation of a set collection in Redis, based on an integer array, with variable length and ordered characteristics.

image.png|500

|500

Encoding determines the size occupied by each element; here, 16 bits mean 2 bytes, and length determines the number of elements stored in the array.

If the encoding is 16 but an element exceeding the encoding is added, elements are copied in reverse order to the correct position (reverse to avoid overwriting), then the new element is added, and finally, the encoding and length in the Header are modified.

image.png|500

IntSet can be seen as a special integer array with some characteristics:

Redis ensures that the elements in IntSet are unique and ordered.
It has a type upgrade mechanism that can save memory space.
It uses binary search for querying at the bottom layer.

Dict#

Dict is a key-value pair, and the size of the hash table is always $2^n$. The used count may exceed the size, in which case a bucket will degenerate into a linked list.

Union refers to any one of them.

image.png|500

sizemask = size - 1, which is n bits of 1.

image.png|500

Actual composition.

image.png|500

Expansion

The HashTable in Dict is an array combined with a linked list, so when considering too much data to solve data conflicts, it expands to the next power of $2^n$.

image.png|500

Shrinkage

Every time an element is deleted, it checks if loadFactor < 0.1, and it will shrink to the next power of $2^n$, with a minimum of 4.

image.png|500

Rebuilding

Whether expanding or shrinking, it involves rebuilding, changing size and sizemask, so rehashing is required. This is why dict has ht[2], one in use and one empty.

image.png|500

Example of expansion and rebuilding.

image.png|500

|500

When the data volume is very large, it cannot be rebuilt all at once, so it is actually an incremental hash, migrating slowly. Previous data may be in ht[0]/ht[1], so new additions are placed in ht[1], while ht[0] only decreases.

image.png|500

ZipList#

ZipList is a special "double-ended linked list" composed of a series of specially encoded contiguous memory blocks. It allows push/pop operations from either end, with a time complexity of O(1).

image.png|500

Size.

image.png|500

You can obtain the starting address of the previous entry for reverse traversal by the length occupied by the previous entry, and you can obtain the starting address of the next entry for forward traversal by calculating the length in the structure.

image.png|500

Note that Redis uses little-endian byte order here.
String type.

image.png|500

Integer types from 0-12 can omit content and directly save the value in the last 4 bits of encoding.

image.png|500

ZipList's Chain Update Problem#

The main issue is that the prevlen field, which records the length of the previous entry, will change from 1 byte to 5 bytes when the length is greater than or equal to 254 bytes, causing a chain update in the next entry's prevlen.

When inserting or modifying an entry1 in ZipList to be greater than or equal to 254 bytes, the prevlen field in entry2 following entry1 needs to change from 1 byte to 5 bytes to record the new length of entry1.

This change causes entry3 following entry2 to also need to modify the prevlen3 field to record the new length after entry2 changes. In extreme cases, if every entry changes from 1 byte to 5 bytes when they are all greater than or equal to 254 bytes, there will be a series of chain updates, causing insertion O(1) to degrade to O(n).

image.png|500

Why is it greater than or equal to 254 and not 255 bytes when prevlen uses 5 bytes?
prevlen has 8 bits, which can represent 0-255, but 255, i.e., 0xff, is occupied by the special marker zlend, which marks the end of the compressed list ZipList, while prevlen happens to be the start of an entry. Using 255 as a boundary would conflict with the zlend end marker.

ZipList Characteristics

The compressed list can be seen as a "doubly linked list" of contiguous memory space.
The nodes in the list are not connected by pointers but by recording the lengths of the previous and current nodes for addressing, resulting in lower memory usage.
If the list data is too much, leading to a long linked list, it may affect query performance.
Adding or deleting large data may cause chain update issues.

QuickList#

Problems with ZipList:

Saving the front and back linked list pointers requires 16 bytes, while ZipList saves a lot of memory but requires continuous memory space, making memory allocation efficiency relatively low; it limits ZipList length and entry size.
To store a large amount of data exceeding ZipList's optimal storage limit; multiple ZipList shards are created for storage.
After data splitting, it becomes too scattered, making management and searching inconvenient; QuickList double-ended linked list is introduced, with nodes recording ZipList.

image.png|500

Configuration item list-max-ziplist-size=-2 defaults to a ZipList memory usage not exceeding 8kb.

image.png|500

Configuration item for whether to compress nodes, i.e., the number of uncompressed nodes at both ends, explained as frequently used head and tail nodes, while middle nodes are not often used.

image.png|500

Here, one node at each end is uncompressed, so the two middle ZipLists are re-encoded.

image.png|500

QuickList Characteristics:

It is a double-ended linked list with nodes as ZipList; it has more memory fragmentation and less continuous occupation.
Nodes use ZipList, solving the memory usage problem of traditional linked lists; ZipList saves memory.
It controls the size of ZipList, solving the efficiency problem of continuous memory space allocation; it shards ZipList, avoiding the need for large continuous memory.
Intermediate nodes can be compressed, further saving memory.

SkipList#

SkipList is primarily a linked list.

Elements are arranged in ascending order.
Nodes may contain multiple pointers with different spans; this is at least faster than traversing one by one, allowing up to 32.

image.png|500

Backward pointers point back, while forward pointers use the level[] array, with at least one span pointer, allowing for quick forward jumps and then step-by-step exploration to the correct node.

image.png|500

During searching, it retrieves from higher levels downwards, similar to binary search, trading space for time, as it stores nodes after certain spans in advance.

image.png|500

SkipList Characteristics:

The SkipList is a doubly linked list, with each node containing a score and an element.
Nodes are sorted by score value; if scores are the same, they are sorted by element dictionary order.
Each node can contain multiple layers of pointers, with the number of layers being a random number between 1 and 32.
The spans to the next node differ for different layer pointers; the higher the layer, the larger the span.
The efficiency of adding, deleting, modifying, and querying is roughly the same as that of a red-black tree, but the implementation is simpler.

RedisObject#

Any data type's key and value in Redis will be encapsulated as RedisObject, also known as Redis object.

image.png|500

Type, encoding, and lru together occupy 32 bits, 4 bytes; refcount 4 bytes; *ptr 8 bytes, making the RedisObject header occupy 16 bytes.

For example, for String type, one key-value consumes one RedisObject header, which indeed takes up space.

Encoding method encoding.

image.png|500

Five Data Types#

image.png|500

What about BitMap and HyperLogLog?
These derived data structures are actually implemented based on the String data structure, not new types.
Also, GEO is based on SortedSet.

String#

String is the most common data storage type in Redis.

String RAW Encoding#

The basic encoding method for String is RAW, implemented based on Simple Dynamic String SDS, with a storage limit of 512mb.

image.png|500

String EmbStr Encoding#

If the length of the stored SDS is less than or equal to 44 bytes, EMBSTR encoding will be used; at this time, the RedisObject Head and SDS are in a contiguous space; only one memory allocation function call is needed, making it more efficient; this is related to Redis's memory allocation algorithm. In short, if the SDS length is 44 bytes, Head + SDS is exactly 64 bytes, avoiding memory fragmentation.

image.png|500

String INT Encoding#

If the stored string is an integer value and its size is within the LONG_MAX range, INT encoding will be used, directly saving the data at the ptr pointer location of RedisObject, which is exactly 8 bytes, no longer requiring SDS.

String raw, embstr, int encoding#

image.png|500

List#

The List data type can operate on elements at both ends, using a QuickList data structure that combines LinkedList + ZipList.

image.png|500

List Structure.

image.png|500

Set#

The Set data type is a single-column collection that does not guarantee order but ensures element uniqueness, allowing for intersection, union, and difference; for querying, SISMEMBER is very frequent, so HashTable is used.

image.png|500

Using HT encoding Dict, with key storing elements and value uniformly set to null.
When all stored elements are integers and the number of elements does not exceed set-max-intset-entries, Set uses IntSet encoding to save memory.

image.png|500

If using IntSet exceeds the maximum number or includes non-integers, it converts to HT encoding Dict.

image.png|500

IntSet encoded Set.

image.png|500

HashTable encoded Set.

image.png|500

ZSET#

ZSet, or SortedSet, has each element with a score value and a member value, which can be sorted by score, ensuring member uniqueness and allowing for member score queries; the relationship between member and score is key-value, meaning ZSet requires key-value storage, unique keys, and sortable, so it can choose between SkipList and HashTable.

SkipList sorts by score, satisfying key-value storage and sortable, but when querying score by member, it degenerates into a linked list, and member uniqueness is also inconvenient.

HashTable sets key-value as member-score, but it cannot satisfy the function of sorting by score.

Thus, both data structures are used, with Dict ensuring member uniqueness and ZSet ensuring sortable and range queries.

image.png|500

ZSet uses a combination of Dict and SkipList, with encoding set to SKIPLIST.

image.png|500

However, when there are not many elements, using Dict + ZipList consumes a lot of memory and is not very obvious, so ZSet will use ZipList structure to save memory when the following conditions are met:

The number of elements is less than zset_max_ziplist_entries, with a default value of 128.
Each element is less than zset_max_ziplist_value bytes, with a default value of 64.

When initializing, if both initial conditions are met, then ZipList is used.

image.png|500

When adding elements, if it is ZipList, it will check whether to convert to SkipList + HashTable structure.

image.png|500

However, ZipList does not meet the three requirements of ZSet; why can it use ZipList for small data and small data volume?
It can simulate business logic to support sorting, key-value pair sorting, and continuous memory space, with element/member in front and value/score behind.

image.png|500

Hash#

The Hash structure is similar to the ZSet structure, requiring key-value storage, unique keys, and allowing values to be obtained based on keys; it must be a hashTable.

Differences from ZSet:

In ZSet, the key is member, and the value is score; in Hash, both key and value can be any value.
ZSet needs to be sorted by score; Hash does not require sorting.

The Hash structure defaults to using ZipList encoding to save memory, with two adjacent entries in ZipList storing field and value respectively.
When the data volume is large, the Hash structure will convert to HT encoding, i.e., Dict, triggered by two conditions:

The number of elements in the zipList exceeds hash-max-ziplist-entries, defaulting to 512.
Any entry size in the ZipList exceeds hash-max-ziplist-value, defaulting to 64 bytes.

image.png|500

Breaking any one condition will convert to HT encoding.

image.png|500

Hash source code shows that it is initialized as ZipList by default.

image.png|500

Then, when trying to add, it checks whether to convert to HT encoding based on element size.

image.png|500

Finally, during addition, it checks for equality; if it is ZipList, it checks the length after insertion, and if it exceeds, it converts to HT encoding.

image.png|500

This article is updated synchronously to xLog by Mix Space. The original link is https://blog.0xling.cyou/posts/redis/redis-3