A practical example of denormalization in a SQL database?

11,563

Solution 1

Yes, you're showing one type of denormalization.

There are three types of denormalization:

  • Join rows from different tables, so you don't have to use queries with JOIN.
  • Perform aggregate calculations like SUM() or COUNT() or MAX() or others, so you don't have to use queries with GROUP BY.
  • Pre-calculate expensive calculations, so you don't have to use queries with complex expressions in the select-list.

You're showing an example of the first type. At least you can avoid one of the two joins you intend to do.

Why not make the denormalized table store the result of joining all three tables?

What's the downside of using denormalization? You are now storing data redundantly: once in the normalized tables, and a copy in the denormalized table. Suppose you get into work tomorrow and find that the data in these different tables doesn't exactly match up. What happened?

  • Maybe someone inserted a row to the normalized tables without adding the corresponding data to the denormalized table.
  • Maybe someone deleted a row from the normalied tables, without deleting the corresponding row from the denormalized table.
  • Maybe someone inserted or deleted a row in the denormalized table, without the corresponding change in the normalized table.

How can you tell what happened? Which table is "correct"? This is the risk of denormalization.

Solution 2

Consider the image below. The top contains several distinct tables that encapsulate logically separate bits of info. The bottom shows the results of those tables joined together. This is denormalization.

Normalized to Denormalized

In the case of BigQuery, and especially using BQ as a backend for a BI platform, denormalized data provides for a quicker user experience because it doesn't have to do the joins when a user hits 'run'.

If you leave the tables as is, if a user needs several of the fields, you might end up doing up to 7 joins and then doing aggregations (sums, counts, etc). However, if you do all 7 joins and store it in 1 table, then the user would be querying only 1 table and doing only aggregations. This is the power of BigQuery. It is scalable, so grouping and aggregating on huge columns of data is relatively 'easy' compared to joins, making the end user experience much faster.

People/companies that go in this direction typically do so in ETL processes (commonly overnight), so joins only have to happen 1 time (when users typically aren't using the database), then during the day, users and BI tools are just aggregating and slicing data without the joins! This does result in 'redundant' data and does incur extra storage costs, but often is worth it for downstream user experience

Solution 3

Below is BigQuery specific answer!

BigQuery performs best when your data is denormalized. Rather than preserving a relational schema such as a star or snowflake schema, you can improve performance by denormalizing your data and taking advantage of nested and repeated fields. Nested and repeated fields can maintain relationships without the performance impact of preserving a relational (normalized) schema.

The storage savings from normalized data are less of a concern in modern systems. Increases in storage costs are worth the performance gains from denormalizing data. Joins require data coordination (communication bandwidth). Denormalization localizes the data to individual slots so execution can be done in parallel.

If you need to maintain relationships while denormalizing your data, use nested and repeated fields instead of completely flattening your data. When relational data is completely flattened, network communication (shuffling) can negatively impact query performance.

For example, denormalizing an orders schema without using nested and repeated fields may require you to group by a field like order_id (when there is a one-to-many relationship). Because of the shuffling involved, grouping the data is less performant than denormalizing the data using nested and repeated fields.

Note: In some circumstances, denormalizing your data and using nested and repeated fields may not result in increased performance.

You can see more in Denormalize data whenever possible section of BigQuery docs

Finally: BigQuery doesn't require a completely flat denormalization. You can use nested and repeated fields to maintain relationships.

Below is an example of producing denormalized table out of initial three normalized tables in your question

#standardSQL
SELECT ANY_VALUE(c).*,
  ARRAY_AGG((SELECT AS STRUCT p.*, s.product_storage_building)) products
FROM `project.dataset.customers` c
LEFT JOIN `project.dataset.storage` s USING (customer_id)
LEFT JOIN `project.dataset.products` p USING (product_id)
GROUP BY FORMAT('%t', c)

this will produce table with below schema

Obviously, this is more customer-focused schema. Depends on your needs you can similarly create product centric one. Or actually both and use appropriate based on use case

enter image description here

You can test, play with above using dummy data as in below example

#standardSQL
WITH `project.dataset.customers` AS (
  SELECT 1 customer_id, 'country 1' country, 'city 1' city, 'street 1' street, 1 house_number UNION ALL
  SELECT 2, 'country 1', 'city 2', 'street 2', 2 UNION ALL
  SELECT 3, 'country 1', 'city 3', 'street 3', 3 UNION ALL
  SELECT 4, 'country 2', 'city 4', 'street 4', 4 UNION ALL
  SELECT 5, 'country 2', 'city 5', 'street 5', 5 
), `project.dataset.products` AS (
  SELECT 1 product_id, 'product 1' product_name, 'color 1' product_color, 'origin 1' product_origin UNION ALL
  SELECT 2, 'product 2', 'color 2', 'origin 2' UNION ALL
  SELECT 3, 'product 3', 'color 3', 'origin 3' UNION ALL
  SELECT 4, 'product 4', 'color 4', 'origin 4' 
), `project.dataset.storage` AS (
  SELECT 1 product_id, 1 customer_id, 'building 1' product_storage_building UNION ALL
  SELECT 2, 1, 'building 1' UNION ALL
  SELECT 3, 1, 'building 1' UNION ALL
  SELECT 2, 2, 'building 2' UNION ALL
  SELECT 3, 2, 'building 3' UNION ALL
  SELECT 4, 2, 'building 3' UNION ALL
  SELECT 1, 3, 'building 1' UNION ALL
  SELECT 3, 3, 'building 1' 
)
SELECT ANY_VALUE(c).*,
  ARRAY_AGG((SELECT AS STRUCT p.*, s.product_storage_building)) products
FROM `project.dataset.customers` c
LEFT JOIN `project.dataset.storage` s USING (customer_id)
LEFT JOIN `project.dataset.products` p USING (product_id)
GROUP BY FORMAT('%t', c)    

with output

enter image description here

Share:
11,563
cget
Author by

cget

Updated on August 02, 2022

Comments

  • cget
    cget over 1 year

    I've been reading about denormalization for the last 20 minutes but can't get a concise example with code.

    Is this what denormalization is?


    1. We have a normalized database:

    Table_1:
    customer_id (Primary key)
    country
    city
    street
    house_number

    Table_2:
    product_id (Primary Key)
    customer_id (Foreign key)
    product_storage_building

    Table_3:
    product_id (Foreign Key)
    product_name
    product_color
    product_origin

    1. However, joining all three tables is taking far too long to run let's say

          SELECT a.*, b.*, c.*
          FROM 
          TABLE_1 AS a
          LEFT JOIN TABLE_2 AS b
          ON a.customer_id = b.customer_id
          LEFT JOIN TABLE_3 AS c
          ON b.product_id = c.product_id
      

    So I create a new table out of Table_1 and Table_2

        CREATE OR REPLACE TABLE Denormalized_Data AS
        (
         SELECT customer_id, 
                country, 
                city,
                street, 
                house_number,
                product_id,
                product_storage_building
         FROM Table_1
              LEFT JOIN Table_2
              ON Table_1.cusomter_id = Table_2.customer_id
        )
    
    1. Then join to Table_3 as follows

       SELECT customer_id, 
              country, 
              city,
              street, 
              house_number,
              product_storage_building,
              Denormalized_Data.product_id
              product_name,
              product_color,
           FROM Denormalized_Data
           LEFT JOIN Table_3
           ON Denormalized_Data.product_id = Table_3.product_id
      

    Now this will make my query run faster - can the whole process above be described as denormalization?

    Thanks

    • Mikhail Berlyant
      Mikhail Berlyant over 4 years
      BTW, most likely you should swap keys labeling for product_id in Table 2 and 3
    • Mikhail Berlyant
      Mikhail Berlyant over 4 years
      @pentium10 - while presence of keys labeling could indicate that question is not BigQuery related (as it does not have indexes) - on the other hand de-normalization is quite a BigQuery-ish topic. I would add this tag back!
    • cget
      cget over 4 years
      Hi there, basically I'm trying to get the concept of denormalization in my head. The tables are just an illustration of the table schema. Is denormalization essentially just creating a new table of data to lessen the workload on a complex join?
    • Pentium10
      Pentium10 over 4 years
      @cget what database engine your run?
    • cget
      cget over 4 years
      I use BigQuery, standard SQL
    • Pentium10
      Pentium10 over 4 years
      What's the too long time you mentioning. What is taking too long? Show example query.
    • cget
      cget over 4 years
      Have added an example in step 2 - basically joining all 3 normalized tables
  • Bill Karwin
    Bill Karwin over 4 years
    FWIW, using a view doesn't make the query any faster, which sounds like it's the goal of the OP.
  • Gary Kephart
    Gary Kephart over 4 years
    I'm not used to seeing CREATE TABLE AS ... SELECT and thought maybe it was an error. Or is that a BigQuery thing? And by "better" I meant it's programmatically better to let the database do the joins instead of your code using the raw tables and specifying the joins. Sure, using views won't make your query run faster, unless the original tables had some associations that caused more fetches.
  • Bill Karwin
    Bill Karwin over 4 years
    CREATE TABLE AS ... SELECT is supported by all popular SQL implementations. I checked MySQL, PostgreSQL, SQLite, Oracle, Microsoft, and IBM DB2, and they all support it. I also find this syntax in a copy of the ISO SQL 2011 standard.
  • rtenha
    rtenha over 4 years
    CREATE VIEW probably helps more in an ad-hoc scenario (think analyst or data scientist) writing queries. A view would prevent them from having to write out the joins every time they write a query. CREATE TABLE helps more in a scenario like ETL where you need to materialize a table for repeated querying, such as a backend for Looker, Tableau, etc. that has many users all querying the same dataset. The pre-joined table will perform much faster. Using a view for this purpose would still do all of the joins every time a user queries the data.