2023-04-07

Index in Database

What is an Index in Database

In the world of databases, data retrieval and query execution are critical operations. When dealing with massive datasets, searching through all the data to find the desired information can be time-consuming. This is where indexes come in. Just as the index in a book helps you quickly find specific information without reading the entire book, a database index helps the database engine to locate the desired data without scanning every row in a table.

Purpose of Indexes

Indexes serve several essential purposes in database management systems.

Speeding up Queries
One of the primary purposes of an index is to improve the speed of data retrieval operations. Without an index, the database management system would have to perform a full table scan for each query, which means it would have to go through every record in the table. Indexes store data in a way that allows the system to locate the required data much faster, often reducing the number of disk I/O operations drastically. This can make a significant difference in query performance, especially with large datasets.
Enforcing Uniqueness
Indexes can be used to ensure that no two rows in a database have the same value for specific columns. This is known as a unique index. For example, in a table of users, you might want to ensure that no two users have the same email address. By creating a unique index on the email column, the database will automatically prevent any new data from being inserted if it would result in duplicate email addresses.
Facilitating Sorting and Grouping
Apart from speeding up data retrieval, indexes can also facilitate faster sorting and grouping of data. When data is indexed, it's often stored in a sorted structure. Consequently, the database can sometimes use this sorted data directly rather than having to sort the data again when executing a query that requires sorted output.

How Indexes Work

Indexes are implemented as data structures that store a subset of the data in a database. The most common form of an index contains copies of the primary key of each row in a table and pointers to the location of each key in the data file. The index is structured in such a way that allows it to be searched very efficiently.

At its core, an index consists of two components:

Keys
These are the values from the indexed columns in the table. They are typically stored in a sorted manner.
Pointers
These are references to the location of each key in the data file.

The combination of keys and pointers enables the database engine to use the index to quickly find the location of the data in the data file without having to scan the entire table.

Types of Indexes

I will explain types of indexes, focusing on the three most common ones: B-tree Index, Bitmap Index, and Hash Index.

B-tree Index

One of the most popular types of indexes used in databases is the B-tree index. B-tree stands for Balanced Tree, and it is a self-balancing tree data structure that maintains sorted data and allows searches, insertions, and deletions in logarithmic time.

Structure of B-tree Index

The B-tree index organizes keys in a hierarchical and balanced structure known as a B-tree. The tree consists of a series of nodes, with the topmost node being called the root. Each node in the tree contains a certain number of keys and child pointers, sorted in a specific order.

Nodes are categorized as either internal nodes or leaf nodes. Internal nodes have keys and child pointers, while leaf nodes contain the keys and pointers to the actual data records. The B-tree is designed so that the leaf nodes are always at the same depth, ensuring balance and efficient access.

Advantages of B-tree Index

Efficient in handling both range queries and equality queries.
Supports ordering, which is beneficial for sorting and retrieval in a particular order.
Automatically rebalances itself as entries are added or removed.

Bitmap Index

Bitmap Indexes are used when a column has a low cardinality, which means that the column has a very limited number of distinct values.

Structure of Bitmap Index

In Bitmap Index, each unique value in the column has a corresponding bitmap (array of bits). Each bit in the bitmap represents a single row in the table. The bit is set to 1 if the column value for that row matches the associated unique value, and 0 if it does not.

Advantages of Bitmap Index

Bitmap indexes are extremely space-efficient.
They can be used efficiently in queries that have multiple conditions (AND, OR operations).

Hash Index

Hash Indexes are used for scenarios where the search criteria are looking for an exact match.

Structure of Hash Index

In Hash Indexes, a hash function is used to map keys (index key values) to addresses (location of data records in the data file). The output of the hash function, known as the hash value, indicates where the data can be found. This structure is known as a hash table, and it contains an array of buckets.

Advantages of Hash Index

Extremely efficient for equality searches where you are looking for an exact match.
Can be faster than B-tree indexes for certain lookup patterns, especially when the key distribution is uniform.

Examples

Let's illustrate the B-tree, Bitmap, and Hash indexes with simple examples. Imagine we have a simple table named Employees with the following data:

EmployeeID	Name	Department
1	Alice	HR
2	Bob	Sales
3	Carol	HR
4	Dave	IT
5	Eve	Sales
6	Frank	IT

B-tree Index Example

If we create a B-tree index on the EmployeeID column, the index structure might look something like this:

    [3]
   /    \
 [1,2]   [4,5,6]

Here, the numbers represent Employee IDs. The B-tree index organizes these IDs in a tree-like structure which makes searching for a specific ID efficient. For instance, if you are looking for an EmployeeID of 5, the database would first compare 5 to 3, determine that it's larger, and then look at the right child node [4,5,6] to find the record.

Bitmap Index Example

If we create a Bitmap index on the Department column, the index might look something like this:

Department: HR    -> 1 0 1 0 0 0
            Sales -> 0 1 0 0 1 0
            IT    -> 0 0 0 1 0 1

Each bit corresponds to a row in the table. For example, in the HR bitmap, the first and third bits are set to 1, indicating that the first and third employees are in the HR department. Bitmap indexes are efficient when the cardinality is low.

Hash Index Example

If we create a Hash index on the EmployeeID column, it might look like this:

Hashed Value of EmployeeID: 1 -> [Address of Record with EmployeeID 1]
                            2 -> [Address of Record with EmployeeID 2]
                            3 -> [Address of Record with EmployeeID 3]
                            4 -> [Address of Record with EmployeeID 4]
                            5 -> [Address of Record with EmployeeID 5]
                            6 -> [Address of Record with EmployeeID 6]

The hash index uses a hash function to directly map the Employee IDs to the address or location of the records in the data file. This allows for very fast access in case of equality searches but is not useful for range queries.

Choosing the Right Type of Index

When managing a database, it’s critical to choose the right type of index to optimize performance. The choice of an index can significantly affect how quickly and efficiently queries are executed. Here I will introduce the factors that influence the choice of index and how to compare B-tree, Bitmap, and Hash indexes to make an informed decision.

Query Patterns
Understanding the types of queries that will be run against the database is crucial. For instance, if your application mostly performs exact match lookups, a Hash index could be ideal. However, if your application frequently runs range queries, a B-tree index is more suitable.
Cardinality
Bitmap indexes are particularly useful for columns with low cardinality, whereas B-tree indexes are more efficient for high-cardinality data.
Read and Write Ratio
Consider the read and write operations' ratio on your database. If your database is read-heavy, optimizing for faster reads with additional indexes may be beneficial. However, if your application involves frequent write operations (inserts, updates, and deletes), be cautious with indexing as it can slow down write performance.
Disk Space
Indexes consume disk space. It’s important to consider the amount of disk space that is available and how much an index will consume.
Maintenance
Indexes need maintenance, especially in write-heavy environments. B-tree indexes, for example, may need to be rebuilt or reorganized periodically to maintain performance.

Disadvantages of Indexes

While indexes are powerful tools for optimizing database performance, they are not without drawbacks. Understanding these drawbacks is essential for effective database management.

Increased Disk Space Usage
Every index created consumes disk space. Depending on the size of the database and the number of indexes, this can quickly add up, requiring significant amounts of storage.
Slower Write Performance
Indexes can slow down the performance of insert, update, and delete operations. Each time data is modified, corresponding indexes also need to be updated. This additional work can cause write operations to take longer.
Maintenance Overhead
Indexes require maintenance. As data changes, indexes can become fragmented or bloated, requiring periodic rebuilding or reorganizing to maintain performance. This maintenance can be resource-intensive and needs to be planned carefully to avoid impacting production workloads.
Complexity
Using multiple indexes or complex index structures can introduce additional complexity to the database design. This complexity can make it harder to predict how queries will perform and can complicate troubleshooting performance issues.

While indexes are indispensable for optimizing query performance, they should be used judiciously.

Normalization in Database

Column-Oriented Database

Descriptive Statistics

Differential Equation

Dimensionality Reduction

Discrete Choice Model

Google Search Console

Hugging Face

Hypothesis Testing

Inferential Statistics

Probability Distribution

Ryusei Kakujo

Weave the future of cities through data

Transportation modeling/ Urban planning/ Machine learning/ Computer science/ GIS