What is Column-oriented Databases
Column-oriented databases, as the name implies, store data by columns rather than by rows. Unlike traditional relational databases such as Oracle, PostgreSQL, and MySQL that handle data by rows, these databases focus on handling data by columns.
Data Handling in Column-oriented Databases
Row-oriented databases store and manipulate data in rows. This approach to data storage treats each row as a record, handling all the data within the row as a single unit.
Consider the following sales data from an bookstore:
Date | Store | Employee | Book Category | Quantity Sold | Total Sales |
---|---|---|---|---|---|
2023/6/1 | NYC | John | Fiction | 10 | $200 |
2023/6/1 | NYC | Sarah | Non-fiction | 5 | $150 |
2023/6/1 | NYC | Paul | Fiction | 8 | $80 |
2023/6/2 | NYC | John | Fiction | 7 | $140 |
2023/6/2 | NYC | John | Non-fiction | 4 | $80 |
In a row-oriented database, each sale is treated as a single row, with all related data elements (Date
, Store
, Employee
, Book Category
, Quantity Sold
, Total Sales
) grouped together.
Column-oriented databases, on the other hand, handle data by columns. For each data element, such as Date
or Total Sales
, the database stores all instances of that element in a single column.
If we consider the same sales data from the bookstore, the data handling in a column-oriented database would look something like this:
Date | Store | Employee | Book Category | Quantity Sold | Total Sales |
---|---|---|---|---|---|
2023/6/1 * 3 | NYC * 5 | John * 3 | Fiction * 3 | 10 * 1 | $200 * 1 |
2023/6/2 * 2 | Sarah * 1 | Non-fiction * 2 | 5 * 1 | $150 * 1 | |
Paul * 1 | 8 * 1 | $80 * 1 | |||
7 * 1 | $140 * 1 | ||||
4 * 1 | $80 * 1 |
For example, if the task is to find the total sales for the NYC store, a row-oriented database would need to read the entire data row by row, checking the Store
field in each row and then adding up the Total Sales
field for each NYC store entry. This process might involve reading unnecessary data if all we need is the Store
and Total Sales
data.
In contrast, a column-oriented database would read the Store
column and Total Sales
column only, reducing the amount of data read from the disk, thereby potentially speeding up the operation.
Disadvantages of Column-oriented Databases
While column-oriented databases offer many advantages, especially for data analysis tasks, they also have their drawbacks. One of the key disadvantages is their lack of optimization for managing online transactions.
Column-oriented databases face substantial overhead and inefficiencies when dealing with new data additions, updates, and deletions, making them less suited for transaction processing.
Suppose you want to add a new row of data to the table. In a row-oriented database, this is a straightforward task—you simply append the new row to the table. However, in a column-oriented database, the process is more complex. The compressed data must first be decompressed. Then the values for each column need to be loaded and added to the appropriate place. Finally, the column must be compressed again. This process not only adds computational overhead and complexity, but it can also lead to inefficiencies when many small changes to the database need to be made frequently, such as in a transaction processing scenario.