2023-04-08

Column-Oriented Database

What is Column-oriented Databases

Column-oriented databases, as the name implies, store data by columns rather than by rows. Unlike traditional relational databases such as Oracle, PostgreSQL, and MySQL that handle data by rows, these databases focus on handling data by columns.

Data Handling in Column-oriented Databases

Row-oriented databases store and manipulate data in rows. This approach to data storage treats each row as a record, handling all the data within the row as a single unit.

Consider the following sales data from an bookstore:

Date Store Employee Book Category Quantity Sold Total Sales
2023/6/1 NYC John Fiction 10 $200
2023/6/1 NYC Sarah Non-fiction 5 $150
2023/6/1 NYC Paul Fiction 8 $80
2023/6/2 NYC John Fiction 7 $140
2023/6/2 NYC John Non-fiction 4 $80

In a row-oriented database, each sale is treated as a single row, with all related data elements (Date, Store, Employee, Book Category, Quantity Sold, Total Sales) grouped together.

Column-oriented databases, on the other hand, handle data by columns. For each data element, such as Date or Total Sales, the database stores all instances of that element in a single column.

If we consider the same sales data from the bookstore, the data handling in a column-oriented database would look something like this:

Date Store Employee Book Category Quantity Sold Total Sales
2023/6/1 * 3 NYC * 5 John * 3 Fiction * 3 10 * 1 $200 * 1
2023/6/2 * 2 Sarah * 1 Non-fiction * 2 5 * 1 $150 * 1
Paul * 1 8 * 1 $80 * 1
7 * 1 $140 * 1
4 * 1 $80 * 1

For example, if the task is to find the total sales for the NYC store, a row-oriented database would need to read the entire data row by row, checking the Store field in each row and then adding up the Total Sales field for each NYC store entry. This process might involve reading unnecessary data if all we need is the Store and Total Sales data.

In contrast, a column-oriented database would read the Store column and Total Sales column only, reducing the amount of data read from the disk, thereby potentially speeding up the operation.

Disadvantages of Column-oriented Databases

While column-oriented databases offer many advantages, especially for data analysis tasks, they also have their drawbacks. One of the key disadvantages is their lack of optimization for managing online transactions.

Column-oriented databases face substantial overhead and inefficiencies when dealing with new data additions, updates, and deletions, making them less suited for transaction processing.

Suppose you want to add a new row of data to the table. In a row-oriented database, this is a straightforward task—you simply append the new row to the table. However, in a column-oriented database, the process is more complex. The compressed data must first be decompressed. Then the values for each column need to be loaded and added to the appropriate place. Finally, the column must be compressed again. This process not only adds computational overhead and complexity, but it can also lead to inefficiencies when many small changes to the database need to be made frequently, such as in a transaction processing scenario.

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!