Database normalization vs denormalization

·

2 min read

Normalized databases:

Definition

  • The “purest” form of a database, where all databases are scoped to be as small as possible

  • The database can contain FK indexes to allow for joining to other DBs

  • e.g. Reports Table, which has columns for: report id (PK), student id (FK), subject id (FK) and grade for that report

  • e.g. Student table, which has columns for student id (PK), student first name, last name, DOB, etc.

Pros:

  • If you don’t need something outside the scope, then things are lightning fast

  • Also there is only one source of truth

Cons:

  • If you need something outside of the scope, you’ll be doing lots of joins

  • This can get confusing if you don’t know where to join to

Denormalized databases

Definition

  • The database has information that is commonly used together for the end user (e.g. someone in the BI team)

  • e.g. whenever the end user wants a printout of the reports, they almost always want to know the students name and the name of the subject that related to that report

  • So we would end up importing a copy of those columns to the reports table:

    • e.g. report id(PK), student id (FK), subject id (FK), grade for that report

    • PLUS replicated data from other tables: student name, subject name, teacher name, etc.

Pros

  • Makes it much easier to work with for the end user - no requirements to join

  • Less likely to get incorrect information, when you join on a table that you think is the right one, but it actually isn’t, but there’s no way to know. Or join incorrectly.

Cons

  • Multiple sources of truth

  • You have to be aware that the copied data needs time to be copied over, and implement a process for doing that at a cadence that is appropriate (as these processes are expensive I imagine)