Binary Search Tree（二叉查找树）

2021-03-09

0. 前言

Collections 可以用来实现 Associative Array（即 Map）这个 ADT，从 collection 里面获取 key-value pair 有两种方式：

binary search on a list and hash tables
binary search trees（BST）

这篇文章主要关注如何用 BST 来做高效率的查找。

1. Search Tree Operations

Map()
put(key, val) - replace if key exist
get(key) - return value or None
del
size()
in - returns True or False for key existence

2. Search Tree Implementation

BST property: 小于 parent 的 key 在左边，大于 parent 的 key 在右边;

需要两个 class，一个是 BinarySearchTree，一个是 TreeNode.

Insertion

加入新 node 的时候，假如 Tree 是空的，就把这个 node 作为 root。假如 Tree 已经有 root 了，就要进行以下的操作（_put function 的逻辑）：

从 root 开始，把 new key 和 current node key 进行对比，假如 new key < current key, 继续查找左边的 subtree，假如 new key > current key，继续查找右边的 subtree，假如 new key = current key，用新的 value 去替换 current key 的 value。
当没有左/右 child 可以 search 时，我们就发现了新 node 可以被放置的节点；
加入新节点 = 创建一个新的 TreeNode object 并且把这个 object 插入到它所应该在的位置；

Retrival (search)

只需要去 recursively 搜索树，找到一个 non-matching leaf node 或者 matching key 就可以了。

_get 这个 helper method 可以被用于 in operator, check existence.

if "Northfield" in my_zip_tree:
    print("oom ya ya")

Deletion

首先要找到想要删除的 Node，假如 tree 有 > 1 个 Node，我们就要用 _get 来找 Node；假如 tree 只有一个 node 且和想要删除的 Node 一致，就 remove root。

假如找不到 Node，就报错。

当我们找到了 Node，可能会有以下三种情况：

Node 没有 children；
Node 只有一个 child；
Node 有两个 child；

Node 没有 children

直接删掉 node，并且删掉 reference。

delete the node and remove the reference to this node in the parent

Node 有一个 child

我们需要把这个 child 放到被删的 parent 的位置。

child - current - parent of current

child 指向 parent of current; parent of current 指向 child;

如果 current node 没有 parent，那么说明它是 root node，这种情况我们用 replace_value 来进行 root node 的更新。

Node 有两个 child

我们需要找到一个 node 来代替这个被删掉的 node，这个继承者就被称为 successor。

successor 需要满足什么条件呢？最重要的一点就是：successor 换上去之后仍然满足二叉平衡树的定义（左 < 右）

假如被删 node 有 right child，那么 successor 就是右 subtree 里面最小的那个 key；
假如被删 node 没有 right child，并且是 parent node 的 left child，那么 parent 就是 successor；
假如被删 node 是 parent node 的 right child 并且它本身没有 right child，那么它的 successor 就是它 parent 的 successor（不包括这个 node）

只有第一条与我们在二叉查找树里面删 node 有关（其他两条我也没看懂 = =）。

找最小值

左小右大，一直向左就能找到最小的，当找到一个没有 left child 的 node 时，就找到了最小的。

Iterate（遍历）

yield 类似 return，会给 caller return value，但是还多了一步，就是 freezing the state of the function，这样做的好处是，下次 call 这个 function 的时候，它会从前面那个被冻结的点开始执行。

Functions that create objects that can be iterated are called generator functions.


class BinarySearchTree:
    def __init__(self):
        self.root = None
        self.size = 0

    def __len__(self):
        return self.size

    def size(self):
        return self.size

    def __iter__(self):
        return self.root.__iter__()

    def put(self, key, value):
        if self.root:
            self._put(key, value, self.root)
        else:
            self.root = TreeNode(key, value)
        self.size = self.size + 1

    # recursive
    # when a new child is inserted into the tree, the current_node is passed to the new tree as the parent.
    # Todo: handle the insertion of a duplicate key is for the value associated with the new key to replace the old value.
    # check if key exist
    def _put(self, key, value, current_node):
        if key < current_node.key:
            if current_node.left_child:
                self._put(key, value, current_node.left_child)
            else:
                current_node.left_child = TreeNode(key, value, parent=current_node)
        else:
            if current_node.right_child:
                self._put(key, value, current_node.right_child)
            else:
                current_node.right_child = TreeNode(key, value, parent=current_node)
    
    # overload [] operator
    # my_zip_tree['Plymouth'] = 55446
    def __setitem__(self, key, value):
        self.put(key, value)

    def get(self, key):
        if self.root:
            result = self._get(key, self.root)
            if result:
                return result.value
        return None

    # return TreeNode, can be used for other retrieval methods
    def _get(self, key, current_node):
        if not current_node:
            return None
        if current_node.key == key:
            return current_node
        elif key < current_node.key:
            return self._get(key, current_node.left_child)
        else:
            return self._get(key, current_node.right_child)

    # can use this syntax for getting item
    # z = my_zip_tree["Fargo"]
    def __getitem__(self, key):
        return self.get(key)
    
    # implement the `in` operator
    def __contains__(self, key):
        return bool(self._get(key, self.root))

    def delete(self, key):
        if self.size > 1:
            node_to_remove = self._get(key, self.root)
            if node_to_remove:
                self._delete(node_to_remove)
                self.size = self.size - 1
            else:
                raise KeyError("Error, key not in tree")
        elif self.size == 1 and self.root.key == key:
            self.root = None
            self.size = self.size - 1
        else:
            raise KeyError("Error, key not in tree")

    def __delitem__(self, key):
        self.delete(key)

    def find_successor(self):
        successor = None
        if self.right_child:
            successor = self.right_child.find_min()
        else:
            if self.parent:
                if self.is_left_child():
                    successor = self.parent
                else:
                    self.parent.right_child = None
                    successor = self.parent.find_successor()
                    self.parent.right_child = self
        return successor

    def find_min(self):
        current = self
        while current.left_child:
            current = current.left_child
        return current

    def splice_out(self):
        if self.is_leaf():
            if self.is_left_child():
                self.parent.left_child = None
            else:
                self.parent.right_child = None
        elif self.has_a_child():
            if self.left_child:
                if self.is_left_child():
                    self.parent.left_child = self.left_child
                else:
                    self.parent.right_child = self.left_child
                self.left_child.parent = self.parent
            else:
                if self.is_left_child():
                    self.parent.left_child = self.right_child
                else:
                    self.parent.right_child = self.right_child
                self.right_child.parent = self.parent


# has helper functions help to classify a node according to its own position as a child, (left or right) and the kind of children the node has.
# also keep track of the parent as an attribute of each node, important for del operator
class TreeNode:
    def __init__(self, key, value, left=None, right=None, parent=None):
        self.key = key
        self.value = value
        self.left_child = left
        self.right_child = right
        self.parent = parent

    def is_left_child(self):
        return self.parent and self.parent.left_child is self

    def is_right_child(self):
        return self.parent and self.parent.right_child is self

    def is_root(self):
        return not self.parent

    def is_leaf(self):
        return not (self.right_child or self.left_child)

    def has_a_child(self):
        return self.right_child or self.left_child

    def has_children(self):
        return self.right_child and self.left_child

    def replace_value(self, key, value, left, right):
        self.key = key
        self.value = value
        self.left_child = left
        self.right_child = right
        if self.left_child:
            self.left_child.parent = self
        if self.right_child:
            self.right_child.parent = self

    # overrides the `for x in` operation for iteration, so it's recursive
    # Because it is recursive over TreeNode instances the __iter__ method is defined in the TreeNode class.
    def __iter__(self):
        if self:
            if self.left_child:
                for elem in self.left_child:
                    yield elem
            yield self.key
            if self.right_child:
                for elem in self.right_child:
                    yield elem

3. Search Tree Analysis

put: performance 和树的高度相关，因为我们在树的每个 level 都要做一次比较。

The height is the limiting factor because when we are searching for the appropriate place to insert a node into the tree, we will need to do at most one comparison at each level of the tree.

其他的 operation, get, in, del，也是跟树的高度相关的，比如 get 就是遍历树的每一层，worst case 是到最后还没找到。

del 的 worst case 就是 delete + 找 successor（树的高度），也就是乘以常数 2。

树的高度是多少呢？

这取决于 insert key 的顺序，假如是乱序加入树，那么高度大约是 Log2 n，n = number of nodes.

因为是乱序，所以大约有一半 key < node，一半 key > node。

二叉树的 total number of node = 2^(h+1) - 1，h 代表树的高度。

理想情况是左右平衡，在平衡二叉树里面，put 的 worst-case 是 O(log2 n)，也就是O(树的高度)。

不平衡的情况

假如所有 node 都在一边就不平衡了，可能 key 之前已经排序过了，树的高度 = node 数量，put 的时间复杂度 = O(n)

结语

本文讲了二叉查找树的特征，operation 和时间复杂度。