A practical example of denormalization in a SQL database?

11,563

Solution 1

Yes, you're showing one type of denormalization.

There are three types of denormalization:

Join rows from different tables, so you don't have to use queries with JOIN.
Perform aggregate calculations like SUM() or COUNT() or MAX() or others, so you don't have to use queries with GROUP BY.
Pre-calculate expensive calculations, so you don't have to use queries with complex expressions in the select-list.

You're showing an example of the first type. At least you can avoid one of the two joins you intend to do.

Why not make the denormalized table store the result of joining all three tables?

What's the downside of using denormalization? You are now storing data redundantly: once in the normalized tables, and a copy in the denormalized table. Suppose you get into work tomorrow and find that the data in these different tables doesn't exactly match up. What happened?

Maybe someone inserted a row to the normalized tables without adding the corresponding data to the denormalized table.
Maybe someone deleted a row from the normalied tables, without deleting the corresponding row from the denormalized table.
Maybe someone inserted or deleted a row in the denormalized table, without the corresponding change in the normalized table.

How can you tell what happened? Which table is "correct"? This is the risk of denormalization.

Solution 2

Consider the image below. The top contains several distinct tables that encapsulate logically separate bits of info. The bottom shows the results of those tables joined together. This is denormalization.

In the case of BigQuery, and especially using BQ as a backend for a BI platform, denormalized data provides for a quicker user experience because it doesn't have to do the joins when a user hits 'run'.

If you leave the tables as is, if a user needs several of the fields, you might end up doing up to 7 joins and then doing aggregations (sums, counts, etc). However, if you do all 7 joins and store it in 1 table, then the user would be querying only 1 table and doing only aggregations. This is the power of BigQuery. It is scalable, so grouping and aggregating on huge columns of data is relatively 'easy' compared to joins, making the end user experience much faster.

People/companies that go in this direction typically do so in ETL processes (commonly overnight), so joins only have to happen 1 time (when users typically aren't using the database), then during the day, users and BI tools are just aggregating and slicing data without the joins! This does result in 'redundant' data and does incur extra storage costs, but often is worth it for downstream user experience

Solution 3

Below is BigQuery specific answer!

BigQuery performs best when your data is denormalized. Rather than preserving a relational schema such as a star or snowflake schema, you can improve performance by denormalizing your data and taking advantage of nested and repeated fields. Nested and repeated fields can maintain relationships without the performance impact of preserving a relational (normalized) schema.

The storage savings from normalized data are less of a concern in modern systems. Increases in storage costs are worth the performance gains from denormalizing data. Joins require data coordination (communication bandwidth). Denormalization localizes the data to individual slots so execution can be done in parallel.

If you need to maintain relationships while denormalizing your data, use nested and repeated fields instead of completely flattening your data. When relational data is completely flattened, network communication (shuffling) can negatively impact query performance.

For example, denormalizing an orders schema without using nested and repeated fields may require you to group by a field like order_id (when there is a one-to-many relationship). Because of the shuffling involved, grouping the data is less performant than denormalizing the data using nested and repeated fields.

Note: In some circumstances, denormalizing your data and using nested and repeated fields may not result in increased performance.

You can see more in Denormalize data whenever possible section of BigQuery docs

Finally: BigQuery doesn't require a completely flat denormalization. You can use nested and repeated fields to maintain relationships.

Below is an example of producing denormalized table out of initial three normalized tables in your question

#standardSQL
SELECT ANY_VALUE(c).*,
  ARRAY_AGG((SELECT AS STRUCT p.*, s.product_storage_building)) products
FROM `project.dataset.customers` c
LEFT JOIN `project.dataset.storage` s USING (customer_id)
LEFT JOIN `project.dataset.products` p USING (product_id)
GROUP BY FORMAT('%t', c)

this will produce table with below schema

Obviously, this is more customer-focused schema. Depends on your needs you can similarly create product centric one. Or actually both and use appropriate based on use case

You can test, play with above using dummy data as in below example

#standardSQL
WITH `project.dataset.customers` AS (
  SELECT 1 customer_id, 'country 1' country, 'city 1' city, 'street 1' street, 1 house_number UNION ALL
  SELECT 2, 'country 1', 'city 2', 'street 2', 2 UNION ALL
  SELECT 3, 'country 1', 'city 3', 'street 3', 3 UNION ALL
  SELECT 4, 'country 2', 'city 4', 'street 4', 4 UNION ALL
  SELECT 5, 'country 2', 'city 5', 'street 5', 5 
), `project.dataset.products` AS (
  SELECT 1 product_id, 'product 1' product_name, 'color 1' product_color, 'origin 1' product_origin UNION ALL
  SELECT 2, 'product 2', 'color 2', 'origin 2' UNION ALL
  SELECT 3, 'product 3', 'color 3', 'origin 3' UNION ALL
  SELECT 4, 'product 4', 'color 4', 'origin 4' 
), `project.dataset.storage` AS (
  SELECT 1 product_id, 1 customer_id, 'building 1' product_storage_building UNION ALL
  SELECT 2, 1, 'building 1' UNION ALL
  SELECT 3, 1, 'building 1' UNION ALL
  SELECT 2, 2, 'building 2' UNION ALL
  SELECT 3, 2, 'building 3' UNION ALL
  SELECT 4, 2, 'building 3' UNION ALL
  SELECT 1, 3, 'building 1' UNION ALL
  SELECT 3, 3, 'building 1' 
)
SELECT ANY_VALUE(c).*,
  ARRAY_AGG((SELECT AS STRUCT p.*, s.product_storage_building)) products
FROM `project.dataset.customers` c
LEFT JOIN `project.dataset.storage` s USING (customer_id)
LEFT JOIN `project.dataset.products` p USING (product_id)
GROUP BY FORMAT('%t', c)

with output

11,563

Author by

cget

Updated on August 02, 2022

Comments

cget over 1 year
I've been reading about denormalization for the last 20 minutes but can't get a concise example with code.

Is this what denormalization is?

1. We have a normalized database:

Table_1:
customer_id (Primary key)
country
city
street
house_number

Table_2:
product_id (Primary Key)
customer_id (Foreign key)
product_storage_building

Table_3:
product_id (Foreign Key)
product_name
product_color
product_origin
1. However, joining all three tables is taking far too long to run let's say
```
    SELECT a.*, b.*, c.*
    FROM 
    TABLE_1 AS a
    LEFT JOIN TABLE_2 AS b
    ON a.customer_id = b.customer_id
    LEFT JOIN TABLE_3 AS c
    ON b.product_id = c.product_id
```
So I create a new table out of Table_1 and Table_2
```
    CREATE OR REPLACE TABLE Denormalized_Data AS
    (
     SELECT customer_id, 
            country, 
            city,
            street, 
            house_number,
            product_id,
            product_storage_building
     FROM Table_1
          LEFT JOIN Table_2
          ON Table_1.cusomter_id = Table_2.customer_id
    )
```
1. Then join to Table_3 as follows
```
 SELECT customer_id, 
        country, 
        city,
        street, 
        house_number,
        product_storage_building,
        Denormalized_Data.product_id
        product_name,
        product_color,
     FROM Denormalized_Data
     LEFT JOIN Table_3
     ON Denormalized_Data.product_id = Table_3.product_id
```
Now this will make my query run faster - can the whole process above be described as denormalization?

Thanks
- Mikhail Berlyant over 4 years
  
  BTW, most likely you should swap keys labeling for product_id in Table 2 and 3
- Mikhail Berlyant over 4 years
  
  @pentium10 - while presence of keys labeling could indicate that question is not BigQuery related (as it does not have indexes) - on the other hand de-normalization is quite a BigQuery-ish topic. I would add this tag back!
- cget over 4 years
  
  Hi there, basically I'm trying to get the concept of denormalization in my head. The tables are just an illustration of the table schema. Is denormalization essentially just creating a new table of data to lessen the workload on a complex join?
- Pentium10 over 4 years
  
  @cget what database engine your run?
- cget over 4 years
  
  I use BigQuery, standard SQL
- Pentium10 over 4 years
  
  What's the too long time you mentioning. What is taking too long? Show example query.
- cget over 4 years
  
  Have added an example in step 2 - basically joining all 3 normalized tables
Bill Karwin over 4 years

FWIW, using a view doesn't make the query any faster, which sounds like it's the goal of the OP.
Gary Kephart over 4 years

I'm not used to seeing CREATE TABLE AS ... SELECT and thought maybe it was an error. Or is that a BigQuery thing? And by "better" I meant it's programmatically better to let the database do the joins instead of your code using the raw tables and specifying the joins. Sure, using views won't make your query run faster, unless the original tables had some associations that caused more fetches.
Bill Karwin over 4 years

CREATE TABLE AS ... SELECT is supported by all popular SQL implementations. I checked MySQL, PostgreSQL, SQLite, Oracle, Microsoft, and IBM DB2, and they all support it. I also find this syntax in a copy of the ISO SQL 2011 standard.
rtenha over 4 years

CREATE VIEW probably helps more in an ad-hoc scenario (think analyst or data scientist) writing queries. A view would prevent them from having to write out the joins every time they write a query. CREATE TABLE helps more in a scenario like ETL where you need to materialize a table for repeated querying, such as a backend for Looker, Tableau, etc. that has many users all querying the same dataset. The pre-joined table will perform much faster. Using a view for this purpose would still do all of the joins every time a user queries the data.