Have you ever believed any of the following?
My goals in this talk:
Normalization is a part of relational theory, meaning that there will be math and there will be discussions of formal logic.
I will do my best to minimize the learning curve, but we're ultimately solving math problems using logical analysis.
Also, some of these normal forms are not well-understood outside of academic circles and so your interpretation of the forms may not match what I'm going to describe.
The source of most of today's talk is C.J. Date's Database Design and Relational Theory, 2nd Edition (Apress, 2019).
Our scenario is a music studio, responsible for managing artists, cutting albums, marketing, and distribution.
This studio is working on a centralized database system but their application developers don't have much experience and would like our advice on how to do things the right way.
Normalization is a set of logical rules and processes to follow.
Normalization works off of a logical data model. From there, we can (theoretically) adapt to a physical data model for implementation.
We're used to thinking in terms of tables, rows, and columns. Let's take a step back and talk about the data modeling terms.
First normal form (1NF) is all about key rules and the shape of tuples.
Given relvar R with heading H containing attributes A1...An of types T1...Tn, all tuples follow heading H and have one value of type Ti for attribute Ai.
Artist (CHAR) | Album (CHAR) | ReleaseYear (INT) |
---|---|---|
Tsunami Bomb | The Ultimate Escape | 2002 |
Tsunami Bomb | The Definitive Act | 2004 |
Tsunami Bomb | Trust No One | 2016 |
Adding a new tuple t(Tsunami Bomb, The Spine that Binds, 2019)
requires that the new tuple's heading, with attributes and types, matches the existing relvar's heading.
First normal form (1NF) has a few rules:
Answer: YES!
Repeated sets are NOT a violation of 1NF!
Boyce-Codd Normal Form is one of the two most important normal forms. It prevents a variety of data anomalies:
Date | Artist | Album | Genre | Studio | Manager |
---|---|---|---|---|---|
2023-07-10 | Kidneythieves | A1 | Industrial | A | Marie |
2023-07-10 | Metric | A3 | Indie | A | Marie |
2023-07-10 | Land of Talk | A2 | Indie | B | Jacques |
2023-07-11 | Kidneythieves | A1 | Industrial | A | Marie |
Lindsay takes over Studio A. But we miss it in one entry.
Date | Artist | Album | Genre | Studio | Manager |
---|---|---|---|---|---|
2023-07-10 | Kidneythieves | A1 | Industrial | A | Lindsay |
2023-07-10 | Metric | A3 | Indie | A | Lindsay |
2023-07-10 | Land of Talk | A2 | Indie | B | Jacques |
2023-07-11 | Kidneythieves | A1 | Industrial | A | Marie |
Kidneythieves had to cancel so we booked Metric for the entire day. Let's update the database.
Date | Artist | Album | Genre | Studio | Manager |
---|---|---|---|---|---|
2023-07-10 | Metric | A3 | Indie | A | Lindsay |
2023-07-10 | Metric | A3 | Indie | A | Lindsay |
2023-07-10 | Land of Talk | A2 | Indie | B | Jacques |
2023-07-11 | Kidneythieves | A1 | Industrial | A | Lindsay |
Nobody's booked in Studio C. How do we indicate that it exists?
Date | Artist | Album | Genre | Studio | Manager |
---|---|---|---|---|---|
2023-07-10 | Metric | A3 | Indie | A | Lindsay |
2023-07-10 | Metric | A3 | Indie | A | Lindsay |
2023-07-10 | Land of Talk | A2 | Indie | B | Jacques |
2023-07-11 | Kidneythieves | A1 | Industrial | A | Lindsay |
2023-07-10 | NOBODY | NONE | N/A | C | Charles |
We can define BCNF as:
All functional dependencies have superkey determinants.
Superkey: a unique combination of attributes, but not necessarily the most parsimonious unique combination of attributes.
If (RecordingDate, Artist)
is a key, (RecordingDate)
is a subkey and (RecordingDate, Artist, Studio)
is a superkey.
Suppose we have some attribute SoundEngineer.
The value of SoundEngineer depends on inputs Producer and Studio:
Studio | Producer | SoundEngineer |
---|---|---|
A | Janice | Sam |
B | Randi | Edgar |
A | Janice | Sam |
C | Tamika | Sam |
When we see Studio=A
and Producer=Janice
, we know SoundEngineer=Sam
.
In other words, Studio and Producer determine SoundEngineer, or (Studio, Producer)
is the determinant for SoundEngineer.
In mathematical terms, we can write this as (Studio, Producer) -> SoundEngineer
.
Note that functional dependencies only go one way.
(Studio, Producer) -> SoundEngineer
.
Studio=A && Producer=Janice -> SoundEngineer=Sam
.
But Sam can be the sound engineer for other combinations of Studio + Producer. The only thing we guarantee is that (A, Janice) -> Sam
(for all combinations of Studio*Producer and their relevant SoundEngineers).
NOTE: figuring out functional dependencies in practice may be difficult!
There are two primary techniques for figuring them out in practice.
All functional dependencies have superkey determinants.
Relvars in BCNF are guaranteed to suffer from none of the anomalies we just described!
Start with a relvar whose heading has all attributes. Figure out the candidate key(s) and functional dependencies for this relvar.
CampaignID | CName, COName, AgencyID, MaxBudget, MinBid |
AgencyID | AName, TakeRate |
AudienceTargetID | TAgeRng, TGenPref |
CampaignID, AudienceTargetID | AudienceBidModifier |
Next, ask this question: are all determinants candidate keys? The answer is obviously no. Only the final determinant is.
Determinant | Dependency |
---|---|
CampaignID | CampaignName, CampaignOwnerName, AgencyID, MaximumDailyBudget, MinimumBid |
AgencyID | AgencyName, TakeRate |
AudienceTargetID | TargetAgeRange, TargetGenrePreference |
CampaignID, AudienceTargetID | AudienceBidModifier |
Let's introduce Heath's theorem here:
Given a relvar R with components { X, Y, Z }, as well as functional dependency X --> Y, we can break R into two relvars, containing { X, Y } and { X, Z }, without loss.
If the answer to Step 2 was no, take a non-candidate key determinant and break it out into its own relvar, following Heath's theorem.
Repeat steps 2 and 3 as long as there are more functional dependencies which are not candidate keys. Let's take CampaignID next.
Our remaining functional dependencies are:
Determinant | Dependency |
---|---|
AudienceTargetID | TargetAgeRange, TargetGenrePreference |
CampaignID, AudienceTargetID | AudienceBidModifier |
This is not good enough! AudienceTargetID
is a subkey of the candidate key, not a superkey! Therefore, break it out.
Here is our final data model, in BCNF. Also, I took the opportunity to rename the DM Campaign relvar to CampaignTarget, which better represents its meaning.
Fifth Normal Form is the second most-important normal form. What BCNF is to functional dependencies, 5NF is to join dependencies
Relvar R is in 5NF if and only if every join dependency in R is implied in the keys of R.
Join Dependency: if X1, X2, ..., Xn are the subsets of heading H on relvar R, a join dependency holds in R if and only if R can be non-loss decomposed into its projections on X1, X2, ..., Xn.
Suppose we have a relvar called Warehouse
with FDs { WarehouseID -> GeneralManager, City } && { City -> PropertyTaxRate }
.
WarehouseID | GeneralManager | City | PropertyTaxRate |
---|---|---|---|
1 | Steve | Chicago | 4.8% |
2 | Lewis | Lafayette | 2.1% |
3 | Suzanne | Chicago | 4.8% |
4 | Alexa | Lexington | 3.3% |
5 | Lee | Lafyette | 2.1% |
A join dependency says we can break this relvar into at two components and reconstitute it without loss: W = { WarehouseID, GeneralManager, City }; C = { City, PropertyTaxRate }
WarehouseID | GeneralManager | City |
---|---|---|
1 | Steve | Chicago |
2 | Lewis | Lafayette |
3 | Suzanne | Chicago |
4 | Alexa | Lexington |
5 | Lee | Lafyette |
City | Status |
---|---|
Chicago | 4.8% |
Lafayette | 2.1% |
Lexington | 3.3% |
Functional dependencies are join dependencies.
But not all join dependencies are functional dependencies!
A relvar R is in 5th Normal Form if it is in Boyce-Codd Normal Form and the relvar has no composite keys.
A relvar may be in BCNF but not 5NF if there is a composite key and there are non-functional join dependencies which are not key-based.
Artist | Album | SalesCountry |
---|---|---|
Horrorpops | Bring It On! | United States |
Horrorpops | Bring It On! | Denmark |
Band 1 | Unnamed Future Album | Canada |
Band 2 | Unnamed Future Album | Canada |
Band 1 | Unnamed Future Album | United States |
Band 2 | Unnamed Future Album | United States |
This relvar AlbumSales
has a key of { Artist, Album, SalesCountry }
. This means it is in BCNF, but if it feels weird to you, it should!
There are two join dependencies here! { Artist, Album }
and { Album, SalesCountry }
can be cleanly split out from the ternary relationship into two separate relvars.
Artist | Album |
---|---|
Horrorpops | Bring It On! |
Band 1 | Unnamed Future Album |
Band 2 | Unnamed Future Album |
Album | SalesCountry |
---|---|
Bring It On! | United States |
Bring It On! | Denmark |
Unnamed Future Album | Canada |
Unnamed Future Album | United States |
Suppose we have the following rules:
This is also known as a symmetric constraint. If this does not apply, then we fall into the connection trap.
In this setup, if all three attributes are necessary to determine which combinations are valid and which are not, then we are in 5NF.
Store | Genre | Artist |
---|---|---|
Music Everywhere | Punk Rock | Bad Religion |
Music Everywhere | Prog Rock | Rush |
The Music Factory | Indie | Metric |
Store | Genre | Artist |
---|---|---|
M.E. | Punk | Bad Religion |
M.E. | Prog | Rush |
MFac | Indie | Metric |
Now, we see two stores (Music Everywhere & The Music Factory) who sell albums by various artists in different genres. Bad Religion makes Punk Rock albums, Rush makes Prog Rock albums, and Metric makes Indie albums.
If we have a rule in which stores sell all artists in a genre, then we can decompose this entity further.
We then have a JD: *{ {S, G}, {G, A}, {S, A} }
Store | Genre |
---|---|
M.E. | Punk |
M.E. | Prog |
MFac | Indie |
Genre | Artist |
---|---|
Punk | Bad Religion |
Prog | Rush |
Indie | Metric |
Store | Artist |
---|---|
M.E. | Bad Religion |
M.E. | Rush |
MFac | Metric |
As long as our symmetric constraint holds, everything is fine. But what happens if Bad Religion sells a Prog Rock album?
Store | Genre |
---|---|
M.E. | Punk |
M.E. | Prog |
MFac | Indie |
Store | Artist |
---|---|
M.E. | Bad Religion |
M.E. | Rush |
MFac | Indie |
Genre | Artist |
---|---|
Indie | Metric |
Prog | Rush |
Punk | Bad Religion |
Prog | Bad Religion |
Now this means that Music Everywhere must sell Bad Religion's Prog Rock album.
Now let's say that there is no such symmetric constraint. The only results we should see are the following:
Store | Genre | Artist |
---|---|---|
Music Everywhere | Punk Rock | Bad Religion |
Music Everywhere | Prog Rock | Rush |
The Music Factory | Indie | Metric |
Store | Genre |
---|---|
M.E. | Punk |
M.E. | Prog |
MFac | Indie |
Store | Artist |
---|---|
M.E. | Bad Religion |
M.E. | Rush |
MFac | Indie |
Genre | Artist |
---|---|
Indie | Metric |
Prog | Rush |
Punk | Bad Religion |
Prog | Bad Religion |
We've broken this out. But is it correct?
Nope! This says that Music Everywhere will sell Bad Religion's Prog Rock album, but that's not true!
If 5NF seems like overkill, here are a few reasons why it can be important.
Fourth Normal Form is an anachronism--it's really just a subset of 5NF.
Relvar R is in 4NF if and only if every Multi-Valued Dependency of R is implied in the keys of R.
Multi-Valued Dependency: a join dependency with exactly two components.
We've already seen an example of this: relvar AlbumSales
has a key of { Artist, Album, SalesCountry }
.
R
is on Album A
, sold in SalesCountry S
Artist | Album | SalesCountry |
---|---|---|
Horrorpops | Bring It On! | United States |
Horrorpops | Bring It On! | Denmark |
Band 1 | Unnamed Future Album | Canada |
Band 2 | Unnamed Future Album | Canada |
Band 1 | Unnamed Future Album | United States |
Band 2 | Unnamed Future Album | United States |
*{ {Album, Artist}, {Album, SalesCountry} }
*{ {Album, Artist}, {Album, SalesCountry} }
The "4NF" diagram for this looks like:
Album ->-> Artist && Album ->-> SalesCountry
, or Album ->-> Artist|SalesCountry
Relvar R can be nonloss decomposed into its componentsXY
andXZ
if and only ifX->->Y|Z
holds in R.
In other words, if there are multiple, independent attributes, split them out into separate relvars.
Relvar R is in Sixth Normal Form (6NF) if and only if all join dependencies which hold in R are trivial.
In other words, there may be one non-key attribute in addition to the key.
Suppose we have a relvar { AlbumID, AlbumName, Runtime, ReleaseYear, ArtistID, ArtistID2, GenreID, GenreID2, GenreID3 }
. Furthermore, our natural key is AlbumID
. This relvar is in 5NF (BCNF + non-composite key) and can be made 6NF in the following way:
{ AlbumID, AlbumName }
{ AlbumID, Runtime }
{ AlbumID, ReleaseYear }
...
{ AlbumID, GenreID2 }
{ AlbumID, GenreID3 }
Maybe!
Suppose we might not always have runtime and we want to unpivot artists and genres (assuming there is no 'order' to artists or genres). Then we'd have:
{ AlbumID, AlbumName, ReleaseYear }
{ AlbumID, Runtime }
{ AlbumID, ArtistID }
{ AlbumID, GenreID }
6NF also helps us work with intervals of data. Usually this is temporal data, like this example of a Tour
relation:
HeadlinerArtist | City | Interval |
---|---|---|
Bad Religion | London | D4:D8 |
Horrorpops | Paris | D3:D3 |
Horrorpops | Athens | D5:D10 |
Suppose we have an SCD2 style table:
HeadlinerArtist | WarmupArtist | City | StartDate | EndDate |
---|---|---|---|---|
Bad Religion | Rad Beligion | London | D1 | D8 |
Bad Religion | The Deviates | London | D9 | D14 |
Bad Religion | The Deviates | Paris | D15 | D17 |
Horrorpops | The Chop Tops | Athens | D1 | D17 |
Horrorpops | Nekromantix | Athens | D18 | D18 |
Break it into 2 non-redundant relvars:
Headliner | City | Interval |
---|---|---|
BadReligion | London | D1:D14 |
BadReligion | Paris | D15:D17 |
Horrorpops | Athens | D1:D18 |
Headliner | Warmup | Interval |
---|---|---|
B.R. | Rad Beligion | D1:D8 |
B.R. | The Deviates | D9:D17 |
H. | The Chop Tops | D1:D17 |
H. | Nekromantix | D18:D18 |
To bring these back together, unpack on Interval
, join together elements, and re-pack on Interval
.
This gets us back to our type-2 SCD, but without having to store denormalized data!
Relational theorists like Date tend not to like the Kimball model very much. As a general-purpose relational model, the star schema is terrible.
It is, however, a great model for data warehousing if you follow certain rules:
Overnormalization isn't really a thing. What people tend to see as overnormalization tend to be one of the following:
Normalization has very specific rules: you can tell where you are and what you'd need to do to get to the next level. With denormalization, how can we answer the following questions?
General philosophy:
Normalization is not the end-all of relational design.
This has been a look at normalization from an academic perspective. This can be a complicated topic and I've tried to straddle the line between formal correctness and ease of explanation.
To learn more, go here:
https://CSmore.info/on/dbdesign
And for help, contact me:
feasel@catallaxyservices.com | @feaselkl
Catallaxy Services consulting:
https://CSmore.info/on/contact