Skip Lists

2021-05-18

0. 前言

Reference: 9.3

Dictionary 有时候也被称作 Map，存的是 key-value pairs。

key 必须 unique，用 key 可以找到对应的 value
所有的 dictionary 都必须提供 key-value pair 的增改查删功能以及根据 key 找 value 的功能

比如下图中给到了key 93，我们就需要输出这个 key 对应的 value ->be

An Example Map

The abilities to put a key-value pair into the map and then look up a data value associated with a given key are the fundamental operations that all maps must provide.

1. Map (ADT)

Map 是一个抽象数据类型，它的定义是：

a collection of key-value pairs where values can be accessed via their associated key

Operation 包括：

Constructor: Map()
增加新的键值对：put(key, value) 当 map 中不存在同样的 key 时，把这个新的键值对加进去；
查找：get(key) 在 map 中查找这个 key，返回对应的 value
……

2. Implementing a Dictionary in Python

我们可以用 hash table 来实现 Map，我们把 key 用 hash function 来 hash，然后把这些 key 放在一个 collection 里面，方便我们 search 和 retrieve value。

这种查找方法的时间复杂度大概是 O(1)，但是根据我们的 table size，collision 以及 collision resolution strategy 可能会稍有不同。

Our analysis showed that this technique could potentially yield a 𝑂(1)O(1) search. However, performance degraded due to issues such as table size, collisions, and collision resolution strategy.

我们在讲树结构的时候也说过可以用二叉查找树来实现 Map，key-value pair 就是一个 node，当树平衡的时候查找时间复杂度大概是 $O(log^n)$，但是显然，当树不平衡的时候，我们的查找效率就会降低。

如何避免以上实现方式的缺点呢？

隆重推出一个新的数据结构—— skip list（跳表）。

3. Skip List 跳表

Skip List 是什么？

它是一种 Probabilistic data structures。

概率数据结构（Probabilistic data structures）是一组数据结构，它们对于大数据和流式应用来说非常有用。一般而言，这类数据结构使用哈希函数（Hash function）来随机化并紧凑地表示一个项的集合。忽略掉碰撞（Collision）的情况，但错误可以在一定的阈值下得到很好的控制。

先来张图感受一下。

An Example Skip List

跳表就是一个 2d linked list，link 可以向右或者向下走；

list 的 head（头）在左上角，这是唯一的跳表入口。

A skip list is basically a two-dimensional linked list where the links all go forward (to the right) or down. The head of the list can be seen in the upper-left corner. Note that this is the only entry point into the skip list structure.

##组成元素：Data Node

holds key and associated value
每个data node 有两个reference，一个向下，一个向右

The Body of the Skip List Is Made Up of Data Nodes

A Single Data Node

组成元素：Header Nodes

这幅图的左边和中间各有一个框框，左边的框里面包含的是 a linked list of header nodes.

Header nodes 只有两个 reference - down and next.

next 指向 a linked list of data nodes.

down 指向 the next lower header node.

Header Nodes and Towers

Each Header Node Holds Two References

Towers

上图的中间框框包含一系列 data nodes，这个 column 就被称为 tower（想象一下 excel 的表格，是不是很像）。

tower 是被 data node 的 down reference 联系在一起的；
tower 的高度可能是不同的；

Towers are linked together by the down reference in the data node.

Level

每个横向的 collection 其实就是一个 ordered linked list of data nodes。

排序是用 key 来排序的；
我们把这种 collection 叫做 level，level 从 0 开始，0 是底层；
level 0 是一个包含所有 node 的 collection；
当我们往 level 更高一点走的时候，我们的 node 数量会减少（注意！！！），这也是 skip list 的一个重要特性，能帮助我们更有效地进行查找；
每个 level 的 node 数量是直接跟 tower 的高度相关的。

Levels are named starting with 0 at the lowest row. Level 0 consists of the entire collection of nodes. Every key-value pair must be present in the level-0 linked list. However, as we move up to higher levels, we see that the number of nodes decreases. This is one of the important characteristics of a skip list and will lead to our efficient search. Again, it can be seen that the number of nodes at each level is directly related to the heights of the towers.

Each Horizontal Group of Data Nodes Is a Level

4. Skip List - Code

两种 node （HeaderNode 和 DataNode）的实现方式和之前学的 simple linked list 相似，reference 一开始都是 None。

SkipList 的 constructor 只需要设置一个 head，初始化为 None。

When a skip list is created there are no data and therefore no header nodes. The head of the skip list is set to None. As key-value pairs are added to the structure, the list head refers to the first header node which in turn provides access to a linked list of data nodes as well as access to lower levels.

class HeaderNode:
    def __init__(self):
        self._next = None
        self._down = None

    @property
    def next(self):
        return self._next

    @property
    def down(self):
        return self._down

    @next.setter
    def next(self, value):
        self._next = value

    @down.setter
    def down(self, value):
        self._down = value
        
class DataNode:
    def __init__(self, key, value):
        self._key = key
        self._data = value
        self._next = None
        self._down = None

    @property
    def key(self):
        return self._key

    @property
    def data(self):
        return self._data

    @property
    def next(self):
        return self._next

    @property
    def down(self):
        return self._down

    @data.setter
    def data(self, value):
        self._data = value

    @next.setter
    def next(self, value):
        self._next = value

    @down.setter
    def down(self, value):
        self._down = value

class SkipList:
    def __init__(self):
        self._head = None

5. Searching a Skip List

下图展示了找 key=77 的值的过程（带星星的是经过的 node）。

start with head node
go horizontal - key=31, 31<77 => move forward
there’s no node from 31 at level 3, drop down to level 2
look to the right, node with key=77 => search successfully found node
return “of”

这个过程中，31 < 77 这个对比，让我们可以跳过 17 和 26 这两个 node，我们也不需要经过 lv 1 的 54，直接在 lv 2 就能找到想要的 node，通过分层的方法，我们的查找效率大大提高了！！

Note: 只有 level 0 才包含所有 node。

Since each level is an ordered linked list, a key mismatch provides us with very useful information.

Searching for the Key 77

Code

假如 current next 是 None，向下；

假如 current next 是 search key，return data；

假如 current next key > search key，向下；

假如 current next key <= search key，向前；

def search(self, key):
    current = self._head

    while current:
        if current.next is None:  
            current = current.down
        else:
            if current.next.key == key:
                return current.next.data
            else:
                if key < current.next.key:
                    current = current.down
                else:
                    current = current.next
    return None

6. Adding Key-Value Pairs to a Skip List

那么问题来了，怎么构造这样一个 Skip List 呢？

如何把一个新的 key-value pair 加入 Skip List 中呢？

假设这个 key 不存在，我们要加入 key = 65，需要两步：

search skip list，找到这个新 key 应该所在的位置（就是做一次 search）；

我们最终会到达 level 0
然后 create a new data node，加入 level 0 的 linked list 里面
同时我们还要记录整个查找的路径（利用 stack），方便我们在 insert 的时候把经过的 level 都加上新的 data node

The height of the tower for the new entry will not be predetermined but instead will be completely probabilistic. In essence, we will “flip a coin” to decide whether to add another level to the tower. Each time the coin comes up heads, we will add one more level to the current tower.

Searching for the Key 65

Adding the Data Node and Tower for 65

Tower 高度 - 模拟抛硬币

return 1 => heads

return 0 => tail

每次加一个 level，我们都需要加一个 header node。

因为这个随机性的存在，即使 key 一样，我们每次生成的跳表也可能是不同高度的。

Each time a new level is added to the tower, a new data node and a new header node are created. Depending on the random nature of the coin flip, the height of the towers for any particular key is bound to change each time we build the skip list.

from random import randrange
def flip():
    return randrange(2)

Code for inserting new data node

def insert(self, key, value):
    if self._head is None:
        self._head = HeaderNode()
        temp = DataNode(key, value)
        self._head.next = temp
        top = temp
        while flip() == 1:
            if tower.is_empty():
                newhead = HeaderNode()
                temp = DataNode(key, value)
                temp.down = top
                newhead.next = temp
                newhead.down = self._head
                self._head = newhead
                top = temp
            else:
                next_level = tower.pop()
                temp = DataNode(key, value)
                temp.down = top
                temp.next = next_level.next
                next_level.next = temp
                top = temp
    else:

Code (TBC)

This time we pop the insert stack to get the next higher insertion point as the tower grows. Only after the stack becomes empty will we need to return to creating new header nodes.

# 这段代码感觉有点奇怪...
tower = Stack()
current = self._head
while current:
    if current.next is None:
        tower.push(current)
        current = current.down
    else:
        if current.next.key > key:
            tower.push(current)
            current = current.down
        else:
            current = current.next

lowest_level = tower.pop()
temp = DataNode(key, value)
temp.next = lowest_level.next
lowest_level.next = temp
top = temp

7. Building the Map

之前提到 Map 的两个核心 function 就是 insert + search（也就是 put 和 get），我们已经用 Skip List 实现了这两个功能，现在我们可以来实现 Map 了。

class Map:
    def __init__(self):
        self.collection = SkipList()

    def put(self, key, value):
        self.collection.insert(key, value)

    def get(self, key):
        return self.collection.search(key)

8. Analysis of a Skip List

用 ordered linked list 来存 key-value pair 的时候，Search 的时间复杂度是 O(n)，那么 Skip List 的时间复杂度是否更低一些呢？

因为 Skip list 是个概率数据结构，所以在分析复杂度的时候我们也只是做一个大概的分析。

因为 flip = 1 的概率是 1/2，每个 node 的高度至少是 1，我们总共有 n 个 node，那么 node 的 tower 高度为 1 的可能性是 1，高度为 2 的可能性是 n/2，高度为 3 的可能性是 n/4，以此类推。

Tower 的最大高度 = $log2^n + 1$，用 big O 来表示就是 $O(logn)$.

Search method

我们在 search 的时候会向下走和向右走，worst case 就是走到底层 $O(logn)$才能找到 key，而每个 level 横向查找的 node 是一个常数（详细解释见英文），所以 search 的效率是$O(logn)$。

We drop down a level when one of two events occurs. Either we find a data node with a key that is greater than the key we are looking for or we find the end of a level.

If we are currently looking at some data node, the probability that one of those two events will happen in the next link is 1/2. This means that after looking at 2 links, we would expect to drop to the next lower level (we expect to get heads after two coin flips).

Insert method

因为 insert 和 search 的操作基本一样（需要先找到位置），所以它的时间复杂度也是$O(logn)$。

9. 结语

Maps (dictionaries) are associative memory organization structures.