In terms of databases, is "Normalize for correctness, denormalize for performance" a right mantra?

105

Solution 1

The two most common reasons to denormalize are:

  1. Performance
  2. Ignorance

The former should be verified with profiling, while the latter should be corrected with a rolled-up newspaper ;-)

I would say a better mantra would be "normalize for correctness, denormalize for speed - and only when necessary"

Solution 2

To fully understand the import of the original question, you have to understand something about team dynamics in systems development, and the kind of behavior (or misbehavior) different roles / kinds of people are predisposed to. Normalization is important because it isn't just a dispassionate debate of design patterns -- it also has a lot to do with how systems are designed and managed over time.

Database people are trained that data integrity is a paramount issue. We like to think in terms of 100% correctness of data, so that once data is in the DB, you don't have to think about or deal with it ever being logically wrong. This school of thought places a high value on normalization, because it causes (forces) a team to come to grips with the underlying logic of the data & system. To consider a trivial example -- does a customer have just one name & address, or could he have several? Someone needs to decide, and the system will come to depend on that rule being applied consistently. That sounds like a simple issue, but multiply that issue by 500x as you design a reasonably complex system and you will see the problem -- rules can't just exist on paper, they have to be enforced. A well-normalized database design (with the additional help of uniqueness constraints, foreign keys, check values, logic-enforcing triggers etc.) can help you have a well-defined core data model and data-correctness rules, which is really important if you want the system to work as expected when many people work on different parts of the system (different apps, reports, whatever) and different people work on the system over time. Or to put it another way -- if you don't have some way to define and operationally enforce a solid core data model, your system will suck.

Other people (often, less experienced developers) don't see it this way. They see the database as at best a tool that's enslaved to the application they're developing, or at worst a bureaucracy to be avoided. (Note that I'm saying "less experienced" developers. A good developer will have the same awareness of the need for a solid data model and data correctness as a database person. They might differ on what's the best way to achieve that, but in my experience are reasonably open to doing those things in a DB layer as long as the DB team knows what they're doing and can be responsive to the developers). These less experienced folks are often the ones who push for denormalization, as more or less an excuse for doing a quick & dirty job of designing and managing the data model. This is how you end up getting database tables that are 1:1 with application screens and reports, each reflecting a different developer's design assumptions, and a complete lack of sanity / coherence between the tables. I've experienced this several times in my career. It is a disheartening and deeply unproductive way to develop a system.

So one reason people have a strong feeling about normalization is that the issue is a stand-in for other issues they feel strongly about. If you are sucked into a debate about normalization, think about the underlying (non-technical) motivation that the parties may be bringing to the debate.

Having said that, here's a more direct answer to the original question :)

It is useful to think of your database as consisting of a core design that is as close as possible to a logical design -- highly normalized and constrained -- and an extended design that addresses other issues like stable application interfaces and performance.

You should want to constrain and normalize your core data model, because to not do that compromises the fundamental integrity of the data and all the rules / assumptions your system is being built upon. If you let those issues get away from you, your system can get crappy pretty fast. Test your core data model against requirements and real-world data, and iterate like mad until it works. This step will feel a lot more like clarifying requirements than building a solution, and it should. Use the core data model as a forcing function to get clear answers on these design issues for everyone involved.

Complete your core data model before moving on to the extended data model. Use it and see how far you can get with it. Depending on the amount of data, number of users and patterns of use, you may never need an extended data model. See how far you can get with indexing plus the 1,001 performance-related knobs you can turn in your DBMS.

If you truly tap out the performance-management capabilities of your DBMS, then you need to look at extending your data model in a way that adds denormalization. Note this is not about denormalizing your core data model, but rather adding new resources that handle the denorm data. For example, if there are a few huge queries that crush your performance, you might want to add a few tables that precompute the data those queries would produce -- essentially pre-executing the query. It is important to do this in a way that maintains the coherence of the denormalized data with the core (normalized) data. For example, in DBMS's that support them, you can use a MATERIALIZED VIEW to make the maintenance of the denorm data automatic. If your DBMS doesn't have this option, then maybe you can do it by creating triggers on the tables where the underlying data exists.

There is a world of difference between selectively denormalizing a database in a coherent manner to deal with a realistic performance challenge vs. just having a weak data design and using performance as a justification for it.

When I work with low-to-medium experienced database people and developers, I insist they produce an absolutely normalized design ... then later may involve a small number of more experienced people in a discussion of selective denormalization. Denormalization is more or less always bad in your core data model. Outside the core, there is nothing at all wrong with denormalization if you do it in a considered and coherent way.

In other words, denormalizing from a normal design to one that preserves the normal while adding some denormal -- that deals with the physical reality of your data while preserving its essential logic -- is fine. Designs that don't have a core of normal design -- that shouldn't even be called de-normalized, because they were never normalized in the first place, because they were never consciously designed in a disciplined way -- are not fine.

Don't accept the terminology that a weak, undisciplined design is a "denormalized" design. I believe the confusion between intentionally / carefully denormalized data vs. plain old crappy database design that results in denormal data because the designer was a careless idiot is the root cause of many of the debates about denormalization.

Solution 3

Denormalization normally means some improvement in retrieval efficiency (otherwise, why do it at all), but at a huge cost in complexity of validating the data during modify (insert, update, sometimes even delete) operations. Most often, the extra complexity is ignored (because it is too damned hard to describe), leading to bogus data in the database, which is often not detected until later - such as when someone is trying to work out why the company went bankrupt and it turns out that the data was self-inconsistent because it was denormalized.

I think the mantra should go "normalize for correctness, denormalize only when senior management offers to give your job to someone else", at which point you should accept the opportunity to go to pastures new since the current job may not survive as long as you'd like.

Or "denormalize only when management sends you an email that exonerates you for the mess that will be created".

Of course, this assumes that you are confident of your abilities and value to the company.

Solution 4

Mantras almost always oversimplify their subject matter. This is a case in point.

The advantages of normalizing are more that merely theoretic or aesthetic. For every departure from a normal form for 2NF and beyond, there is an update anomaly that occurs when you don't follow the normal form and that goes away when you do follow the normal form. Departure from 1NF is a whole different can of worms, and I'm not going to deal with it here.

These update anomalies generally fall into inserting new data, updating existing data, and deleting rows. You can generally work your way around these anomalies by clever, tricky programming. The question then is was the benefit of using clever, tricky programming worth the cost. Sometimes the cost is bugs. Sometimes the cost is loss of adaptability. Sometimes the cost is actually, believe it or not, bad performance.

If you learn the various normal forms, you should consider your learning incomplete until you understand the accompanying update anomaly.

The problem with "denormalize" as a guideline is that it doesn't tell you what to do. There are myriad ways to denormalize a database. Most of them are unfortunate, and that's putting it charitably. One of the dumbest ways is to simply denormalize one step at a time, every time you want to speed up some particular query. You end up with a crazy mish mosh that cannot be understood without knowing the history of the application.

A lot of denormalizing steps that "seemed like a good idea at the time" turn out later to be very bad moves.

Here's a better alternative, when you decide not to fully normalize: adopt some design discipline that yields certain benefits, even when that design discipline departs from full normalization. As an example, there is star schema design, widely used in data warehousing and data marts. This is a far more coherent and disciplined approach than merely denormalizing by whimsy. There are specific benefits you'll get out of a star schema design, and you can contrast them with the update anomalies you will suffer because star schema design contradicts normalized design.

In general, many people who design star schemas are building a secondary database, one that does not interact with the OLTP application programs. One of the hardest problems in keeping such a database current is the so called ETL (Extract, Transform, and Load) processing. The good news is that all this processing can be collected in a handful of programs, and the application programmers who deal with the normalized OLTP database don't have to learn this stuff. There are tools out there to help with ETL, and copying data from a normalized OLTP database to a star schema data mart or warehouse is a well understood case.

Once you have built a star schema, and if you have chosen your dimensions well, named your columns wisely, and especially chosen your granularity well, using this star schema with OLAP tools like Cognos or Business Objects turns out to be almost as easy as playing a video game. This permits your data analysts to focus on analysing the data instead of learning how the container of the data works.

There are other designs besides star schema that depart from normalization, but star schema is worth a special mention.

Solution 5

Don't forget that each time you denormalize part of your database, your capacity to further adapt it decreases, as risks of bugs in code increases, making the whole system less and less sustainable.

Good luck!

Share:
105
Vix Hunk
Author by

Vix Hunk

Updated on June 02, 2022

Comments

  • Vix Hunk
    Vix Hunk about 2 years

    I am using this in my app and i am create two identifiers for UITableview one is "normal cell" and another one is "expended cell" which will show when someone tap on a "normal cell". my problem is this that I can not manage heights of both cells separately for example I want normal cell height to be 120 and extended cell is to be 60. But if I change height in heightForRowAtIndexPath then both cells showing same height. here is my code.

    - (UITableViewCell *)tableView:(UITableView *)tableView cellForRowAtIndexPath:(NSIndexPath *)indexPath
    {
        FSParallaxTableViewCell *cell = nil;
        static NSString *cellIdentifier = nil;
        if ([indexPath isEqual:self.expandedIndexPath]) {
            cellIdentifier = @"ExpandedCell";
        }
        else {
            cellIdentifier = @"NormalCell";
        }
        cell = [tableView dequeueReusableCellWithIdentifier:cellIdentifier forIndexPath:indexPath];
        if (cell == nil) {
            cell = [[FSParallaxTableViewCell alloc] initWithStyle:UITableViewCellStyleDefault reuseIdentifier:cellIdentifier];
        }
        if ([[cell reuseIdentifier] isEqualToString:@"ExpandedCell"]) {
            cell.contentView.backgroundColor = [UIColor grayColor];
        }
        if ([[cell reuseIdentifier] isEqualToString:@"NormalCell"]) {
            [cell.cellImageView sd_setImageWithURL:[NSURL URLWithString:[[rssOutputData objectAtIndex:indexPath.row]xmllink]]
                                  placeholderImage:[UIImage imageNamed:@"placeholder"] options:indexPath.row == 0 ? SDWebImageRefreshCached : 0];
            cell.clipsToBounds = YES;
        }
        return cell;
    }
    - (void)tableView:(UITableView *)tableView
    didSelectRowAtIndexPath:(NSIndexPath *)indexPath
    {
        // disable touch on expanded cell
        UITableViewCell *cell = [self.theTableView cellForRowAtIndexPath:indexPath];
        if ([[cell reuseIdentifier] isEqualToString:@"ExpandedCell"]) {
            return;
        }
        // deselect row
        [tableView deselectRowAtIndexPath:indexPath
                                 animated:NO];
        // get the actual index path
        indexPath = [self actualIndexPathForTappedIndexPath:indexPath];
        // save the expanded cell to delete it later
        NSIndexPath *theExpandedIndexPath = self.expandedIndexPath;
        // same row tapped twice - get rid of the expanded cell
        if ([indexPath isEqual:self.expandingIndexPath]) {
            self.expandingIndexPath = nil;
            self.expandedIndexPath = nil;
        }
        // add the expanded cell
        else {
            self.expandingIndexPath = indexPath;
            self.expandedIndexPath = [NSIndexPath indexPathForRow:[indexPath row] + 1
                                                        inSection:[indexPath section]];
        }
        [tableView beginUpdates];
        if (theExpandedIndexPath) {
            [theTableView deleteRowsAtIndexPaths:@[theExpandedIndexPath]
                                withRowAnimation:UITableViewRowAnimationNone];
        }
        if (self.expandedIndexPath) {
            [theTableView insertRowsAtIndexPaths:@[self.expandedIndexPath]
                                withRowAnimation:UITableViewRowAnimationNone];
        }
        [tableView endUpdates];
        // scroll to the expanded cell
        [self.theTableView scrollToRowAtIndexPath:indexPath atScrollPosition:UITableViewScrollPositionMiddle
                                         animated:YES];
    }
    

    can anyone please help me resolving this issue I want to set height for both cells separately.

  • Chris
    Chris over 15 years
    One of the most reoccurring problems I have with normalized data for reports is the request to provide flat file exports where the normalized rows are all in one column. It never ends up being simple to do.
  • Sunil Shakya
    Sunil Shakya over 15 years
    Steven reads a e-newspaper, so should not matter. Aren't there tools that could help in reporting without influencing design?
  • Turnkey
    Turnkey over 15 years
    These problems can also be solved by creating read-only views for reporting.
  • Booji Boy
    Booji Boy over 15 years
    I often do both. Have a normalized database for the application and a denormalized report database.
  • Jonathan Leffler
    Jonathan Leffler over 15 years
    It isn't clear whether you need swatting because you cave in to the requests of the read-only guys, or whether the read-only guys need it because they haven't learned, or both. Probably both. Views can help the read-only guys - that is a good use for them, in fact.
  • Steven A. Lowe
    Steven A. Lowe over 15 years
    i would swat you both with the Sunday Times-Free-Press, including the advertisements: the report guys for not knowing how to do a join and make a view, and you for risking data corruption and incurring possible update overhead instead of providing reporting views/functions in the first place ;-)
  • Steven A. Lowe
    Steven A. Lowe over 15 years
    +1 for pointing out the political component; modern databases employ caching strategies that mean that normalized data often outperforms unnormalized data for most queries. Profile first, denormalize later ;-)
  • JasonTrue
    JasonTrue over 15 years
    There's nothing wrong with creating a denormalized copy of a database, but it's often better to use data warehousing techniques for this, rather than baking it into the schema. SSIS and equivalent Oracle tools can do this kind of thing.
  • aSkywalker
    aSkywalker over 15 years
    I am gonna look a little thick, and probably a little stubborn too, but if the primary reason is performance (that is ok right), can't a secondary benefit still be that it is a littler simpler? When those joins and views are taking up too many cycles?
  • Steven A. Lowe
    Steven A. Lowe over 15 years
    performance is an acceptable reason to denormalize, but ya gotta prove it! modern db engines employ caching strategies that often make normalized databases more efficient than their denormalized equivalents
  • carveone
    carveone over 15 years
    No excuse for justifying denormalisation: -1
  • Steven A. Lowe
    Steven A. Lowe over 15 years
    @[Philippe Grondier]: "in programming, there are no absolutes, not even this one" --anonymous
  • Edi
    Edi over 15 years
    @JasonTrue: In the SQL Server world, SSRS actually provides a way to create the denormalized view of the database through it's 'Reporting Model', saving you the trouble of using SSIS to transport the data.
  • Steven A. Lowe
    Steven A. Lowe over 15 years
    +1 for pointing out that the journey from normalized to denormalized is far easier than the converse
  • olive
    olive about 13 years
    Steven: caching strategies that work better for normalized databases? I'd like to know these...
  • Vix Hunk
    Vix Hunk about 9 years
    i applied this one but app crashes on launch. thanks
  • Vix Hunk
    Vix Hunk about 9 years
    this is not what i am asking for. please read my question once again.
  • Vix Hunk
    Vix Hunk about 9 years
    reason: 'unable to dequeue a cell with identifier (null) - must register a nib or a class for the identifier or connect a prototype cell in a storyboard'
  • Meenakshi Borade
    Meenakshi Borade about 9 years
    if u r having two different custom cell for two section,u can check indexpath.section number in if statement.
  • Vix Hunk
    Vix Hunk about 9 years
    in my uitable i am having 1 section with 2 cells one is normal and second one is extended.
  • Meenakshi Borade
    Meenakshi Borade about 9 years
    if ([indexPath isEqual:self.expandedIndexPath]) { //height for expanded } else { //height for normal }
  • Vix Hunk
    Vix Hunk about 9 years
    Thanks problem solved :) please post it i will mark it right. thanks again.
  • sbkrogers
    sbkrogers almost 9 years
    Definitely worth reading Chris' answer, a LOT of good general advice. Should be higher in terms of upvotes.
  • Walter Mitty
    Walter Mitty almost 9 years
    That last paragraph is worth remembering. Some denormalized designs are disciplined and functional. Others are simply sloppy and haphazard.
  • stk
    stk over 7 years
    Got my +1 for that! But how do you feel about NoSQL db then? Would you only use them if there is a lack of performance in the good old Relational databases?
  • Chris Johnson
    Chris Johnson over 7 years
    @stk That's a classic trade-off. NoSQL is great for some applications, but it puts a lot more burden on the app tier to manage data sanity. This is a gross simplification, but NoSQL is a good choice when the data isn't ultra-mission-critical. I like the pattern of relational for core data, plus other data stores to support the RDBMS, like REDIS for hot data (but: round-tripped to the RDBMS), other NoSQL for transient data etc. That's a different kind of denormalization and needs to be done intentionally. You should have a clear reason why a particular data set lives in each tier.
  • stk
    stk over 7 years
    Exactly the answer that I was hoping for, thanks! Matches perfectly with my own experiences. :)
  • Anshul
    Anshul over 7 years
    Great answer, thanks for probing the question deeply.
  • Rami Far
    Rami Far over 6 years
    note that normalization has nothing to do with data correctness. Data correctness is a quality of the data itself, not how it is organized. Normalization is better organization of data to reduce redundancy.
  • Steven A. Lowe
    Steven A. Lowe over 6 years
    @RamiFar: apologies, yes; "correctness" in the sense of "eliminating update anomalies"
  • Titulum
    Titulum almost 4 years
    Is this answer still relevant in 2020? I am following the documentation for the firebase firestore and it literally tells me to denormalize data in order to limit relations. I'm quite new to NoSQL databases but the concept of deliberately introducing redundancy in your database just doesn't 'click' for me. Where can I find some good lecture on how to store domain objects with many to many relations in a NoSQL database?
  • Steven A. Lowe
    Steven A. Lowe almost 4 years
    @Titulum: good question. the short answer is yes the advice is still relevant but that's not what you want :) NoSQL databases scale horizontally => spreading data across multiple tables makes everything slower and more complex. You want a document or aggregate model instead - Vaughn Vernon's articles on Aggregate design are a great place to start dddcommunity.org/library/vernon_2011