在RocksDB中,每个ColumnFamily都有自己的Memtable,互不影响.而在RocksDB中Memtable有多种实现(SkipList/HashSkipList/HashLinkList/Vector),具体的区别可以看这里,我们这次主要来分析默认的实现skiplist(只有skiplist是可以并发插入的).

    实现

    首先从创建Memtable开始,Memtable的创建(ColumnFamilyData::CreateNewMemtable)是在创建ColumnFamily(VersionSet::CreateColumnFamily)的时候创建的.这里就是创建memtable,然后设置到ColumnFamilyData的mem_域中.

    上面所提及的,RocksDB有多种MemTable的实现,那么它是如何来做的呢,RocksDB通过memtable_factory来根据用户的设置来创建不同的memtable.这里要注意的是核心的memtable实现是在MemTable这个类的table_域中.

    1. table_(ioptions.memtable_factory->CreateMemTableRep(
    2. comparator_, &arena_, ioptions.prefix_extractor, ioptions.info_log,
    3. column_family_id)),
    4. class MemTableRepFactory {
    5. public:
    6. virtual ~MemTableRepFactory() {}
    7. virtual MemTableRep* CreateMemTableRep(const MemTableRep::KeyComparator&,
    8. Allocator*, const SliceTransform*,
    9. Logger* logger) = 0;
    10. virtual MemTableRep* CreateMemTableRep(
    11. const MemTableRep::KeyComparator& key_cmp, Allocator* allocator,
    12. const SliceTransform* slice_transform, Logger* logger,
    13. uint32_t /* column_family_id */) {
    14. return CreateMemTableRep(key_cmp, allocator, slice_transform, logger);
    15. }
    16. ........................
    1. MemTableRep* SkipListFactory::CreateMemTableRep(
    2. const MemTableRep::KeyComparator& compare, Allocator* allocator,
    3. }

    最终就是创建SkipListRep对象,在这个对象里面会创建SkipList(class InlineSkipList).

    这里SkipList就不分析实现了,如果对这个数据结构不了解的,可以去看一下.这里我们只需要知道最终所有的memtable数据都是保存在SkipList中就可以了.

    在之前的分析中我们知道Memtable的插入是通过WriteBatch然后遍历ColumnFamily来插入的,而最终则是会调用MemTable::Add这个函数.

    1. bool MemTable::Add(SequenceNumber s, ValueType type,
    2. const Slice& key, /* user key */
    3. const Slice& value, bool allow_concurrent,
    4. MemTablePostProcessInfo* post_process_info) {
    5. bool res = table->InsertKeyConcurrently(handle);
    6. if (UNLIKELY(!res)) {
    7. return res;
    8. }
    9. ..............................
    10. }
    1. template <class Comparator>
    2. bool InlineSkipList<Comparator>::InsertConcurrently(const char* key) {
    3. Node* prev[kMaxPossibleHeight];
    4. Node* next[kMaxPossibleHeight];
    5. Splice splice;
    6. splice.prev_ = prev;
    7. splice.next_ = next;
    8. return Insert<true>(key, &splice, false);
    9. }

    看到这里或许会有疑问了,那就是skiplist里面只有key,而RocksDB是一个KV存储,那么这个KV是如何存储的呢,这里是这样的,RocksDB会将KV打包成一个key传递给SkipList, 对应的KEY的结构是这样的.

    而数据的格式化就在之前的MemTable::Add中实现的.

    1. uint32_t key_size = static_cast<uint32_t>(key.size());
    2. uint32_t val_size = static_cast<uint32_t>(value.size());
    3. const uint32_t encoded_len = VarintLength(internal_key_size) +
    4. internal_key_size + VarintLength(val_size) +
    5. val_size;
    6. char* buf = nullptr;
    7. std::unique_ptr<MemTableRep>& table =
    8. type == kTypeRangeDeletion ? range_del_table_ : table_;
    9. KeyHandle handle = table->Allocate(encoded_len, &buf);
    10. char* p = EncodeVarint32(buf, internal_key_size);
    11. memcpy(p, key.data(), key_size);
    12. Slice key_slice(p, key_size);
    13. p += key_size;
    14. uint64_t packed = PackSequenceAndType(s, type);
    15. EncodeFixed64(p, packed);
    16. p += 8;
    17. p = EncodeVarint32(p, val_size);
    18. memcpy(p, value.data(), val_size);

    而对于真正的KEY的解析是在SkipList的Comparator中实现的(compare_).下面的代码片段可以看到会解析出来真正的key,然后再进行查找以及插入.

    1. bool InlineSkipList<Comparator>::Insert(const char* key, Splice* splice,
    2. bool allow_partial_splice_fix) {
    3. Node* x = reinterpret_cast<Node*>(const_cast<char*>(key)) - 1;
    4. const DecodedKey key_decoded = compare_.decode_key(key);
    5. ...............................