Access Path Selection in Relation Database Management System

本文是介绍数据库查询计划的鼻祖论文，基于 IBM 早起的 System R 试验性关系数据库，在现在看来，应该是关系数据库的“旧石器时代”。

1. Introduction

4 phases:

Relation: Set of Tuple
Page(4K): Set of Tuple( each tuple tagged with relation-id, so can contain different relation's tuple)
- no tuple spans a page
Segment: Set of Page
Index: B-trees
- clustered index
SCAN: open, next, close
- Segment scan
- Index scan
- scan filter

COST = Page_Fetches + W * RSI_Calls

2-way join
- selection 1: outer(probe side) and inner(build side)
- selection 2: join method
  - nested loops
  - merging scan: 两个表在 join column 上已排序
  - 这篇文章没有提到 hash-join，这是现在主流数据库 join 的主要方式。
N-way join is a recur 2-way join
- paths: 最多 N! 种可能性
optimization
- 只考虑：inner side 与 outer side 有 join 连接的情况。
- 构建一颗 access path tree，对每一条path，评估其 cost
- N: C-outer的基数： N = (product of the cardinalities of all relations T of the join so far) * (product of the selectivity factors of al 1 applicable predicates)
- C-nested-loop-join(path1, path2) = C-outer(path1) + N * C-inner(path2)
- C-merge(path1, path2) = C-outer(path1) + N * C-inner(path2) -- 不太理解这个 cost 函数，merge 时，C-inner(path2) 会显著小于 nested loop。
思考：对最新的关系数据库而言，优化应考虑：
- filter 尽量下压到 scan 阶段
- 如果不能下压到 scan 阶段，则尽量提前到 join 之前。
- 使用 hash-join 时，尽可能选择技术小的作为 build side，技术大的作为 input-side
- 如果左右两侧已按 join column 排序，则可选择 merge-sort join