This is the third article of a series to summarize the key concepts of Chapter 3. Storage and Retrieval in the Designing Data Intensive Application book. The series consists of 3 articles, including log-structure storage engine (SSTables and LSM Tree), page-oriented storage engine (B Tree) and column based storage engine.
Landscape of database storages
In short, database storages are divided into two categories, row-based and column based. In this article we will focus on the core concept of column-based storage engine, and compare the pros and cons between row-based storage engine.
Column-based storage engine
Before introducing column-based storage engine, we need to know 2 terms first:
OLTP v.s. OLAP
|OLTP (OnLine Transactional Processing)||OLAP (OnLine Analytical Processing)|
|Primarily used by||End user/customer, via web application||Internal analyst, for decision support performance reporting|
|Main read pattern||Small number of records per query, fetched by key||Aggregate over large number of records|
|Main write pattern||Random-access, low-latency writes from user input||Bulk import (ETL) or event stream|
|What data represents||Latest state of data (current point in time)||History of events that happened over time|
|Dataset size||Gigabytes to terabytes||Terabytes to petabytes|
Since the analytical requirements in large companies arise, and this two different scenarios actually have quite different access pattern to database, new database was evolved to enhance its performance. And one of the most important one is column-based storage engine.
While business analysts or data analysts are analyzing business data, it’s very common that there is a fact table (e.g. orders, transactions, events), and several metadata tables that fact tables will reference to them (e.g. product, customer table). The shape of schema and their foreign keys will be like a “star”. So we call it start schema.
The fact table is usually very very large (> trillion data for large company), this makes analytical operation difficult for traditional row-based database.
Column-based storage engine
The core concept of column-based storage engine is quite intuitive. We store data by column instead of by row. Imaging that we have millions of customer data, if we store data by row, then while doing aggregation (e.g. summation, average), it will be very time consuming because we may need to load each rows (Unless we have index).
On the other hand, if we store data by column, then calculating all customers’ average age will only require to load the file for age column. It will tremendously enhance the performance.
- reduce query latency: when we want to access few columns but large portions of records, we don’t need to load every record from disk into memory and select what we want, this can extremely increase speed
- column compression → less disk space and faster execution: some of column have limited outcomes, we can compress it by encoding like bitmap / running length. Then we require less disk space, and have faster execution because data can be fitted in L1 cache & AND/OR operation
- worse write performance: insert one record need to update every column, so it’s better to do this in batch (or use memory for temporally storing data, then writing to DB in batch)
Real world DB
- DBs: Clickhouse / Vertica
Comparison matrix of Row-based & Column based storage engine
|Pros & Cons||Real world DB|
|Row-based||- Better write performance - Easier to handle transactional operation||Cassandra / LevelDB / HBase / Lucence MySQL / PostgreSQL / MongoDB|
|Column-based||- Largely better aggregation performance of few columns but on large portion of data Worse - write performance, usually we will write data in batch||Clickhoise / Vertica|
If also read the previous 2 articles about storage and retrieval, then congratulation! Now we should be able to have some thoughts in our mind to determine what databases to use for our application.
Let’s recall the landscape of database storage engines. Now, how will you determine the selection of databases in your application ?
Mine is pretty simple, for analytical requirements like internal analysis or performance reporting, we should firstly consider column-based storage engine. Then, for write intensive application, we can consider using log-structured storage engine, and on the other hands we can consider page-oriented storage engine if number of reads are usually larger than writes.
Yes, this is just the first step of how to design data intensive application. But everything comes from the first step. Let’s proceed to the future chapters of this book.