Storage and Retrieval (3) - Column-based storage engine

Preface

This is the third article of a series to summarize the key concepts of Chapter 3. Storage and Retrieval in the Designing Data Intensive Application book. The series consists of 3 articles, including log-structure storage engine (SSTables and LSM Tree), page-oriented storage engine (B Tree) and column based storage engine.

Landscape of database storages

In short, database storages are divided into two categories, row-based and column based. In this article we will focus on the core concept of column-based storage engine, and compare the pros and cons between row-based storage engine.

Column-based storage engine

Before introducing column-based storage engine, we need to know 2 terms first:

OLTP v.s. OLAP

	OLTP (OnLine Transactional Processing)	OLAP (OnLine Analytical Processing)
Primarily used by	End user/customer, via web application	Internal analyst, for decision support performance reporting
Main read pattern	Small number of records per query, fetched by key	Aggregate over large number of records
Main write pattern	Random-access, low-latency writes from user input	Bulk import (ETL) or event stream
What data represents	Latest state of data (current point in time)	History of events that happened over time
Dataset size	Gigabytes to terabytes	Terabytes to petabytes

Since the analytical requirements in large companies arise, and this two different scenarios actually have quite different access pattern to database, new database was evolved to enhance its performance. And one of the most important one is column-based storage engine.

Star Schema

While business analysts or data analysts are analyzing business data, it’s very common that there is a fact table (e.g. orders, transactions, events), and several metadata tables that fact tables will reference to them (e.g. product, customer table). The shape of schema and their foreign keys will be like a “star”. So we call it start schema.

The fact table is usually very very large (> trillion data for large company), this makes analytical operation difficult for traditional row-based database.

Column-based storage engine

The core concept of column-based storage engine is quite intuitive. We store data by column instead of by row. Imaging that we have millions of customer data, if we store data by row, then while doing aggregation (e.g. summation, average), it will be very time consuming because we may need to load each rows (Unless we have index).

On the other hand, if we store data by column, then calculating all customers’ average age will only require to load the file for age column. It will tremendously enhance the performance.

The benefits

reduce query latency: when we want to access few columns but large portions of records, we don’t need to load every record from disk into memory and select what we want, this can extremely increase speed
column compression → less disk space and faster execution: some of column have limited outcomes, we can compress it by encoding like bitmap / running length. Then we require less disk space, and have faster execution because data can be fitted in L1 cache & AND/OR operation

The drawbacks

worse write performance: insert one record need to update every column, so it’s better to do this in batch (or use memory for temporally storing data, then writing to DB in batch)

Real world DB

DBs: Clickhouse / Vertica

Comparison matrix of Row-based & Column based storage engine

	Pros & Cons	Real world DB
Row-based	- Better write performance - Easier to handle transactional operation	Cassandra / LevelDB / HBase / Lucence MySQL / PostgreSQL / MongoDB
Column-based	- Largely better aggregation performance of few columns but on large portion of data Worse - write performance, usually we will write data in batch	Clickhoise / Vertica

Summary

If also read the previous 2 articles about storage and retrieval, then congratulation! Now we should be able to have some thoughts in our mind to determine what databases to use for our application.

Let’s recall the landscape of database storage engines. Now, how will you determine the selection of databases in your application ?

Mine is pretty simple, for analytical requirements like internal analysis or performance reporting, we should firstly consider column-based storage engine. Then, for write intensive application, we can consider using log-structured storage engine, and on the other hands we can consider page-oriented storage engine if number of reads are usually larger than writes.

Yes, this is just the first step of how to design data intensive application. But everything comes from the first step. Let’s proceed to the future chapters of this book.

Published Dec 19, 2021

Be a human-centric engineer making world class productWei-Ming on LinkedIn