Modern Data Engineering Concepts: Multidimensional Cubes

Modern Data Engineering Concepts: Multidimensional Cubes

2021, Dec 23    

EDITORS NOTE, I FOUND THIS BLOGPOST HALF WRITTEN FROM 2021 IN 2026 AND DECIDED TO PUBLISH IT ANYWAYS.

OLAP Cubes / Multidimensional Cubes

What the heck are they and why are they useful? This blog post is inspired by the fact that it is really difficult to find articles that intuitively describe this concept.

History of OLAP

OLAP stands for “On Line Analytical Processing”. The term arose during the periods where engineers would do Business Intelligence (abreviated BI) for the use of user tracking to gain intuition on what users were doing on a platform. Back in the 1990s, we relied heavily on relational databases. Everything was a model and we cared a lot about individual components being logically correct all the time. If someone wanted to drill down to the specifics of a customer, they would be able to string everything together through the tight relationships that the keys provided. The flipside of this is that doing aggregate analysis is incredibly expensive. Back in the 1990s 512MB of RAM was considered substantial. Aggregate queries were expensive and even sometimes unfeasible. Aggregate queries could take hours or even days. This is where the OG data engineers came into play. They would try to create generalized summarized views on top of the tightly relational databases. In order to reduce future work they would develop batch process that routinely dumped the current database state into a queriable format. The format they chose was to aggregate as much information as they could into a single view that could be held in memory. If the business need was to get a generalized view of customers and the things they bought, a relationship model might look something like this:

users

id country age
1 USA 23
2 Canada 56
3 USA 59

sessions

time session_id user_id duration is_mobile
2020-01-01 S1 1 100 true
2020-01-02 S2 2 1500 false
2020-01-02 S3 2 500 true
2020-01-03 S4 3 5900 false

item_sales

session_id user_id item_id quantity
S1 1 10 4
S2 2 30 9
S2 2 10 9
S4 3 20 8
S4 3 10 4

Building a summarized view or table to analyze general behavior

A summarized table or view might look something like so:

user_behavior

user_id country age mainly_mobile_app session_duration_avg item_sales_sum
1 USA 23 true 100 4
2 Canada 56 false 1000 18
3 USA 59 false 5900 12

The query to generate this table using SQL would look something like:

SELECT 
     users.id as user_id,
     users.country as country,
     users.age as age,
     desktop_duration.duration < mobile_duration.duration as mainly_mobile_app,
     session_duration_avg,
     sales.total_sold as item_sales_sum,
FROM users
LEFT JOIN (
    SELECT user_id, sum(duration) as duration FROM sessions GROUP BY user_id 
      WHERE is_mobile=true
    ) AS mobile_duration
    ON users.id = mobile_duration.user_id
LEFT JOIN (
    SELECT user_id, sum(duration) AS duration FROM sesssions GROUP BY user_id WHERE is_mobile=false
    ) AS desktop_duration
    ON users.id = desktop_duration.user_id
LEFT JOIN (
    SELECT user_id, avg(duration) AS avg_duration FROM sessions GROUP BY user_id
    ) as durations
    ON users.id = durations.user_id
LEFT JOIN (
    SELECT user_id, sum(quantity) AS total_sold FROM item_sales GROUP BY user_id
    ) AS sales
    ON users.id = sales.user_id

This is a fairly simple table, but it cost a LOT of memory and time to generate it. It helps answer general questions about our platform like “what age group purchaces the most/spends the most time on the website” but fails to answer any questions it was not designed for. Specifically, since this table is user oriented, it fails to answer questions with specific timestamp constraints. So how do we build up systems that are data-analysis-friendly?

Introducing the OLAP Cube and Current Multidimensional Analysis Model

In a modern system we will aim to create a table that contains enough information that captures all the information above, in a recordized format. We will duplicate some information, but since disk space is much cheaper than RAM we can get away with this. Each time a sale occurs, we can quickly and computationally cheaply generate a summary of what happened. An example table below shows what things might look like.

extended_user_behavior

time user_id country age session_is_mobile session_duration item_sales_sum
2020-01-01 1 USA 23 true 100 4
2020-01-02 2 Canada 56 false 1500 18
2020-01-02 2 Canada 56 false 500 12
2020-01-03 3 USA 59 false 5900 12

Using this table we can immediately make use of this. We can immediately answer the previous questions straight out of the box.

  • “What age group purchaces the most”/”what age group spends the most time?”
SELECT age/10 as agegroup, sum(item_sales_sum) FROM extended_user_behavior GROUP BY agegroup

We can even try to answer this question for certain periods of time, unlike before.

SELECT age/10 as agegroup, sum(item_sales_sum) FROM extended_user_behavior 
WHERE time BETWEEN '2020-01-01' AND '2020-01-02'
GROUP BY agegroup

Same table. Same columns. Different question entirely. That is the whole point.

So Where Does the “Cube” Come In?

Look at extended_user_behavior again. Each column that you can filter or group by is a dimension. Time is a dimension. Country is a dimension. Age is a dimension. session_is_mobile is a dimension. The numeric columns like session_duration and item_sales_sum are your measures; the things you actually want to measure.

Now imagine each dimension as an axis. Time on one axis, country on another, maybe age on a third. You can slice this space along any combination of those axes. Filter to a single country and group by time? That is a slice of the cube. Group by country and age while ignoring time? Now you have squashed the cube along the time axis and created another flat “slice”. Every combination of dimensions gives you a different cut of the same underlying data.

Why This Matters

Referencing back to the user_behavior table, it was user-oriented and flat. It answered the questions it was designed for and nothing else. The cube approach attempts to multiply and expand upon this idea. Instead of pre-answering specific questions, you store enough dimensional context that analysts can ask their own questions on the fly.

This is the foundation that modern tools like Hadoop, Apache Druid, ClickHouse, and Google BigQuery are built on. They store data in columnar formats optimized for exactly these kinds of slice-and-dice operations. The expensive joins and aggregations that took hours now take seconds. The underlying concept is the same as before; flatten, denormalize, and precompute. But modern tooling expands upon this idea and made it faster.

The Takeaway

A multidimensional cube is a denormalized dataset where each record carries its own dimensional context. You trade disk space for analytical flexibility. You duplicate some data so that you never have to join your way back to an answer. The result is a single table you can slice along any axis, at any granularity, for any time range.