Database normalization vs denormalization
Normalized databases:
Definition
The “purest” form of a database, where all databases are scoped to be as small as possible
The database can contain FK indexes to allow for joining to other DBs
e.g. Reports Table, which has columns for: report id (PK), student id (FK), subject id (FK) and grade for that report
e.g. Student table, which has columns for student id (PK), student first name, last name, DOB, etc.
Pros:
If you don’t need something outside the scope, then things are lightning fast
Also there is only one source of truth
Cons:
If you need something outside of the scope, you’ll be doing lots of joins
This can get confusing if you don’t know where to join to
Denormalized databases
Definition
The database has information that is commonly used together for the end user (e.g. someone in the BI team)
e.g. whenever the end user wants a printout of the reports, they almost always want to know the students name and the name of the subject that related to that report
So we would end up importing a copy of those columns to the reports table:
e.g. report id(PK), student id (FK), subject id (FK), grade for that report
PLUS replicated data from other tables: student name, subject name, teacher name, etc.
Pros
Makes it much easier to work with for the end user - no requirements to join
Less likely to get incorrect information, when you join on a table that you think is the right one, but it actually isn’t, but there’s no way to know. Or join incorrectly.
Cons
Multiple sources of truth
You have to be aware that the copied data needs time to be copied over, and implement a process for doing that at a cadence that is appropriate (as these processes are expensive I imagine)