Denormalizing Databases

Dare Obasanjo on Amazon SimpleDB:

It is interesting to imagine how this system evolved. From experience, it is clear that everyone who has had to build a massive relational database that database joins kill performance. The longer you’ve dealt with massive data sets, the more you begin to fall in love with denormalizing your data so you can scale. Taking to its logical extreme, there’s nothing more denormalized than a single table. Even better, Amazon goes a step further by introducing multivalued columns which means that SimpleDB isn’t even in First Normal Form whereas we all learned in school that the minimum we should aspire to is Third Normal Form.

Uh-huh! As I see more and more databases, I reach the WTF point quicker and quicker. And I haven’t seen enough yet. But so far I couldn’t agree more with this observation.
Every project I’ve worked on, the common theme of performance problems is database joins, and obviously the bigger the database the bigger the performance problem. Not only does it cause performance problems, but inevitably the solution architect comes into save the day and starts replicating data in tables to denormalize data and improve performance, which then obviously leads to application bugs.

The database I currently work with has about 1000 tables. And I’m dumbfounded as to why.

Just put all the shit (okay, most of it) in one table.

Now most DBAs, in my observation, would cringe and argue against this ’til they’re blue in the face, but they need to stay employed so they have a decent incentive, I guess.

I think some of my most wasted class time in college (after 4 semesters of spanish) was learning how to normalize tables for 2 reasons, 1. I’ve never really used the skill, and 2. The solution to the problem is often to denormalize, (after checking the explain plan, modifying the query and adding indexes)

  • tmarthal
    I think you want a hash of hashes, not a relational database. =(
  • i am not sure where you are going with that line of thought and what do you mean by distributed (ie: distributed app servers sharing a managed db connection, or distributed clients that each have a different db connections or ?). Managing conflicts, i think, is the job of the transaction manager. Obviously you would still need to ensure ACID, but that is the job of the dbms, and I am not sure much changes when going from a normalized schema to a denormalized one.

    I'll have to think about it some more and do some reading. I am no ORM or DB expert.

    But the original point was denormalizing improves performance, specifically query performance not update performance. So why not design that way from the start?

    I've found that most people and customers don't like waiting for transactions, and the max upper limit on any request/response sequence is about 4-5 seconds before the user begins to get agitated and/or gives up and/or starts cursing the app.

    Along with this, most customers, again in my experience, do more searching for data rather than updating data.

    So, to date, query performance has been of prime importance for the apps I have worked on.

    With that being the goal, it seems one should focus on designing a db that will allow you to achieve that goal, rather than designing a normalized schema, just because your college prof or oracle instructor told you the best way to manage a database was normalizing the schema. Yes that may be the best way to manage a database, but ultimately the customer doesn't care how the database is managed, they just want a fast responsive app.
  • Matthew
    Because now all your data must be added, updated, and deleted together. If you have a distributed application and one client updates one object then you must perform conflict detection and management across the entire row, even if data isn't closely related, and you'll probably end up impacting every client on the system.
  • why not?
  • Matthew
    I don't think it would be very clean mapping the data into objects if it all came from one table.
blog comments powered by Disqus