Understanding How Databases Use Indexes
This article discusses B+ tree indexes, the most common type of index.
Beyond B+ Trees
An index is like a book’s table of contents, helping you find what you need. A well-designed index improves search efficiency, while a poorly designed one hinders it. Therefore, index design is crucial. The key to designing efficient indexes is understanding how databases use them to find data.
When I was learning about database indexes, every book and article started with B+ trees, making me want to give up before even starting. But in reality, most of the time we don’t need to understand them deeply.
Here’s a simple B+ tree:
+----+----+
| 10 | 20 |
+----+----+
/ | \
+---+ +----+----+ +----+
| 5 | | 15 | 18 | | 25 |
+---+ +----+----+ +----+
/ | \ / | \ / | \
1 6 9 11 16 19 21 26 30
It looks complex, but except for leaf nodes, other nodes exist only to quickly locate leaf nodes. If we focus only on leaf nodes and ignore the rest, it becomes a simple ordered doubly-linked list:
+---+---+---+----+----+----+----+----+----+
| 1 | 6 | 9 | 11 | 16 | 19 | 21 | 26 | 30 |
+---+---+---+----+----+----+----+----+----+
Just remember these three properties of B+ trees:
- Leaf nodes are ordered
- You can quickly locate leaf nodes
- Leaf nodes are connected, allowing fast forward/backward traversal
From now on, I’ll use linked lists to represent indexes. If you’re interested in how B+ trees maintain order, insert, or delete, feel free to explore on your own.
How Databases Use Indexes
Let’s assume the database has already chosen what it considers an appropriate index. The simplified process of using an index to find data is:
- Using the index data structure’s properties, quickly locate the first index node that might match the condition. This node’s corresponding row may not match the query condition, because the index columns may not include all columns needed by the query condition.
- Starting from the node located in step 1, traverse in one direction, using index node values for preliminary filtering. Because B+ tree leaf nodes are ordered and connected, one-directional traversal is fast. For nodes passing preliminary filtering, read the corresponding row data from disk.
- For rows from step 2, perform further filtering. This step isn’t always necessary—if the index columns include all columns needed by the query condition, no further filtering is needed, which is the ideal case.
Below are some common query scenarios and how databases use indexes to find data.
Quick Positioning
Querying by primary key id is the simplest case. When we query SELECT * FROM table WHERE id = 2, the database can quickly find the matching row through the index. Very efficient with no extra steps.
index on (id)
|
v
+------++-----+-----+-----+-----+-----+
| id || 1 | 2 | 3 | 4 | 5 |
+------++-----+-----+-----+-----+-----+
✅
One-directional Traversal
When we want to find the 3 youngest people aged 20 or above with SELECT * FROM table WHERE age >= 20 LIMIT 3, the database can quickly find the first matching node through the index, then traverse in one direction to find the first 3 matching rows.
index on (age)
|
v ------------>
+-------++------+------+------+------+------+
| age || 18 | 20 | 21 | 22 | 23 |
+-------++------+------+------+------+------+
✅ ✅ ✅
Multi-column Indexes (Composite Indexes)
Multi-column indexes aren’t much different from single-column indexes—they just contain multiple column values in leaf nodes, still ordered. Nodes in multi-column indexes are sorted similarly to strings: comparing column by column from left to right, using the first differing column’s comparison result. For example, with a multi-column index (a, b, c), node (1, 2, 3) is less than (1, 2, 4) and greater than (1, 1, 4). When we query SELECT * FROM table WHERE a = 1 AND b = 2 AND c = 3, like with single-column indexes, the database can quickly find matching nodes through the multi-column index.
index on (a, b, c)
|
v
+-----++-----+-----+-----+-----+-----+
| a || 1 | 1 | 1 | 1 | 2 |
+-----++-----+-----+-----+-----+-----+
| b || 1 | 1 | 2 | 2 | 5 |
+-----++-----+-----+-----+-----+-----+
| c || 1 | 3 | 2 | 3 | 4 |
+-----++-----+-----+-----+-----+-----+
✅
These queries can fully utilize the (a, b, c) index:
SELECT * FROM table WHERE a = 1 AND b = 2 AND c = 3SELECT * FROM table WHERE a = 1 AND b = 2SELECT * FROM table WHERE a = 1
In other words, having an (a, b, c) index is equivalent to also having (a, b) and (a) indexes.
The rule for databases using multi-column indexes to quickly locate matching nodes is: following index column order, match from left to right—if a column doesn’t match or isn’t in the query condition, matching stops.
For example, when querying SELECT * FROM table WHERE a = 1 AND c = 3 using the (a, b, c) index, from left to right, first a = 1 matches, then since b isn’t in the query condition, matching stops. In this query, the (a, b, c) index is equivalent to an (a) index for locating the first matching node, because different b values might have rows matching c = 3. So the database must traverse all rows where a = 1 for filtering.
index on (a, b, c)
|
v ---------------->
+-----++-----+-----+-----+-----+-----+
| a || 1 | 1 | 1 | 1 | 2 |
+-----++-----+-----+-----+-----+-----+
| b || 1 | 1 | 2 | 2 | 5 |
+-----++-----+-----+-----+-----+-----+
| c || 1 | 3 | 2 | 3 | 4 |
+-----++-----+-----+-----+-----+-----+
❌ ✅ ❌ ✅
Note However, during one-directional traversal filtering, the
(a, b, c)index is faster than the(a)index because the(a, b, c)index node containsc’s value, eliminating the need to readc’s value from disk. In-memory reads are much faster than random disk reads.
If we design a new multi-column index (a, c) for this query, we can find matching rows more efficiently.
index on (a, c)
|
v ---->
+-----++-----+-----+-----+-----+-----+
| a || 1 | 1 | 1 | 1 | 2 |
+-----++-----+-----+-----+-----+-----+
| c || 1 | 2 | 3 | 3 | 4 |
+-----++-----+-----+-----+-----+-----+
✅ ✅
Multi-column Index Order
The order of columns in multi-column indexes matters. For example, consider a grades table with columns id, class, score, and a multi-column index (score, class). If we query SELECT * FROM table WHERE class = 2 AND score >= 60, the (score, class) index seems helpful. But actually, its help is limited. After the database quickly locates the first matching node, it needs to traverse all nodes with scores above 60 to find all matching nodes. This may include many non-matching nodes.
index on (score, class)
|
v ------------------->
+---------++------+------+------+------+------+------+
| score || 45 | 52 | 60 | 64 | 75 | 95 |
+---------++------+------+------+------+------+------+
| class || 1 | 2 | 2 | 1 | 1 | 2 |
+---------++------+------+------+------+------+------+
✅ ❌ ❌ ✅
But with a (class, score) index, we can find all matching rows more efficiently. During one-directional traversal, there are no non-matching nodes.
index on (class, score)
|
v ----->
+---------++------+------+------+------+------+------+
| class || 1 | 1 | 1 | 2 | 2 | 2 |
+---------++------+------+------+------+------+------+
| score || 45 | 64 | 75 | 52 | 60 | 95 |
+---------++------+------+------+------+------+------+
✅ ✅
Not only should you design indexes that the database will use, but you should also design indexes that can be used efficiently based on actual query needs.
Leveraging Index Order
Indexes are inherently ordered, so you can design indexes that don’t require additional sorting. For the grades table example above, using the (class, score) index for SELECT * FROM table WHERE class = 2 ORDER BY score ASC, nodes found through one-directional traversal are already ordered, so no additional sorting is needed. If you can’t leverage index order, even if you only need one row of data, the database must find all matching data first, then sort—very inefficient.
index on (class, score)
|
v ------------>
+---------++------+------+------+------+------+------+
| class || 1 | 1 | 1 | 2 | 2 | 2 |
+---------++------+------+------+------+------+------+
| score || 45 | 64 | 75 | 52 | 60 | 95 |
+---------++------+------+------+------+------+------+
✅ ✅ ✅
Because index leaf nodes form a doubly-linked list, querying SELECT * FROM table WHERE class = 2 ORDER BY score DESC is also efficient.
index on (class, score)
|
<----------- v
+---------++------+------+------+------+------+------+
| class || 1 | 1 | 1 | 2 | 2 | 2 |
+---------++------+------+------+------+------+------+
| score || 45 | 64 | 75 | 52 | 60 | 95 |
+---------++------+------+------+------+------+------+
✅ ✅ ✅
Summary
- When designing indexes, base them on actual query needs. Try to make query result index leaf nodes adjacent to reduce non-matching nodes during traversal.
- Include filter condition columns in the index as much as possible to reduce disk reads during filtering.
- If results need sorting, leverage index order to avoid additional sorting operations.
Future topics I plan to cover: how the query optimizer chooses indexes, how join/group by/subqueries use indexes, and common index pitfalls.
Comments