Have you ever believed any of the following?
My goals in this talk:
Normalization is a part of relational theory, meaning that there will be math and there will be discussions of formal logic.
I will do my best to minimize the learning curve, but we're ultimately solving math problems using logical analysis.
Also, some of these normal forms are not well-understood outside of academic circles and so your interpretation of the forms may not match what I'm going to describe.
The source of most of today's talk is C.J. Date's Database Design and Relational Theory, 2nd Edition (Apress, 2019).
Normalization is a set of logical rules and processes to follow.
Normalization works off of a logical data model. From there, we can (theoretically) adapt to a physical data model for implementation.
We're used to thinking in terms of tables, rows, and columns. Let's take a step back and talk about the data modeling terms.
From a talk I used to deliver:
First normal form (1NF) is all about key rules and the shape of tuples.
Given relvar R with heading H containing attributes A1...An of types T1...Tn, all tuples follow heading H and have one value of type Ti for attribute Ai.
UserName (CHAR) | Password (BINARY) | IsActive (BIT) |
---|---|---|
Bob | 0xABC123 | 1 |
Jane | 0xCDA331 | 1 |
Marcus | 0xFFF048 | 0 |
Adding a new tuple t(Jim, 0xBBA492, 1)
requires that the new tuple's heading, with attributes and types, matches the existing relvar's heading.
First normal form (1NF) has a few rules:
Quite commonly, depictions of 1NF will discuss the concept of "repeating groups."
An example of a relvar with a repeating group is:
R = (Name, Phone1, Phone2, Phone3)
Is this in 1NF?
From a talk I used to deliver:
R = (Name, Phone1, Phone2, Phone3)
The answer is YES! This relvar is in 1NF!
"Repeating groups" is a term Date came up with and has subsequently recanted. The term itself was confusing and the above interpretation is something he did not intend. His latest advice is to ignore any reference to repeating groups.
SQL tables can be in 1NF.
Also, SQL tables are trivially not in 1NF because they can't otherwise be.
All tuples must have values for all attributes. NULL violates 1NF!
No duplicate tuples are allowed
Attribute order does not matter for tuples
All attributes are regular: they have a name, a type, are not hidden, etc.
Boyce-Codd Normal Form is one of the two most important normal forms. It prevents a variety of data anomalies:
Specialist | Project | Task | Login |
---|---|---|---|
Julia | ETL | Write scripts | JuliaH |
Lindsay | Analysis | Build spark cluster | LindsayM |
Lindsay | ETL | Write scripts | LindsayM |
Lindsay gets married and changes her last name. But we miss it in one project.
Specialist | Project | Task | Login |
---|---|---|---|
Julia | ETL | Write scripts | JuliaH |
Lindsay | Analysis | Build spark cluster | LindsayJ |
Lindsay | ETL | Write scripts | LindsayM |
Ann comes off the bench to do all the ETL work. Let's update the database.
Specialist | Project | Task | Login |
---|---|---|---|
Ann | ETL | Write scripts | AnnB |
Lindsay | Analysis | Build spark cluster | LindsayJ |
Ann | ETL | Write scripts | AnnB |
Where was Ann? How do we make her visible when she's on the bench?
Specialist | Project | Task | Login |
---|---|---|---|
Julia | ETL | Write scripts | JuliaH |
Lindsay | Analysis | Build spark cluster | LindsayJ |
Lindsay | ETL | Write scripts | LindsayJ |
Ann | NONE | Placeholder | AnnB |
We can define BCNF as:
All functional dependencies have superkey determinants.
Superkey: a unique combination of attributes, but not necessarily the most parsimonious unique combination of attributes. If (A1, A2)
is a key, (A1)
is a subkey and (A1, A2, A3)
is a superkey.
Suppose we have some attribute Z.
The value of Z depends on inputs X and Y:
X | Y | Z |
---|---|---|
1 | 4 | 7 |
3 | 5 | 2 |
1 | 4 | 7 |
6 | 8 | 7 |
When we see X=1
and Y=4
, we know Z=7
.
In other words, X and Y determine Z, or (X, Y)
is the determinant for Z.
In mathematical terms, we can write this as (X, Y) -> Z
.
All functional dependencies have superkey determinants.
Relvars in BCNF are guaranteed to suffer from none of the anomalies we just described!
"Deductive" techniques:
"Inductive" techniques:
Start with a relvar whose heading has all attributes. Figure out the candidate key(s) and functional dependencies for this relvar.
CampaignID | CName, COName, AgencyID, MaxBudget, MinBid |
AgencyID | AName, TakeRate |
AudienceTargetID | TAgeRng, TGenPref |
CampaignID, AudienceTargetID | AudienceBidModifier |
Next, ask this question: are all determinants candidate keys? The answer is obviously no. Only the final determinant is.
Determinant | Dependency |
---|---|
CampaignID | CampaignName, CampaignOwnerName, AgencyID, MaximumDailyBudget, MinimumBid |
AgencyID | AgencyName, TakeRate |
AudienceTargetID | TargetAgeRange, TargetGenrePreference |
CampaignID, AudienceTargetID | AudienceBidModifier |
Let's introduce Heath's theorem here:
Given a relvar R with components { X, Y, Z }, as well as functional dependency X --> Y, we can break R into two relvars, containing { X, Y } and { X, Z }, without loss.
If the answer to Step 2 was no, take a non-candidate key determinant and break it out into its own relvar, following Heath's theorem.
Repeat steps 2 and 3 as long as there are more functional dependencies which are not candidate keys. Let's take CampaignID next.
Our remaining functional dependencies are:
Determinant | Dependency |
---|---|
AudienceTargetID | TargetAgeRange, TargetGenrePreference |
CampaignID, AudienceTargetID | AudienceBidModifier |
This is not good enough! AudienceTargetID
is a subkey of the candidate key, not a superkey! Therefore, break it out.
Here is our final data model, in BCNF. Also, I took the opportunity to rename the DM Campaign relvar to CampaignTarget, which better represents its meaning.
Boyce-Codd Normal Form is great for most scenarios, but it does fall apart in one particular case: when there are cycles in functional dependencies.
Suppose we have a relvar with heading { S, J, T }
. These attributes represent Students, Subjects, and Teachers, respectively.
Given a functional dependency T -> J
, we can break this out into { S, T }
and { T, J }
Relvar with heading { S, J, T }
Functional dependency T -> J
Now if we include a functional dependency S, J -> T
, it all falls apart.
{ S, T } + { T, J }
doesn't work--T also depends on { S, J }
!
But T
is not a superkey of { S, J, T }
so all keeping all three won't work either.
Fifth Normal Form is the second most-important normal form. What BCNF is to functional dependencies, 5NF is to join dependencies
Relvar R is in 5NF if and only if every join dependency in R is implied in the keys of R.
Join Dependency: if X1, X2, ..., Xn are the subsets of heading H on relvar R, a join dependency holds in R if and only if R can be non-loss decomposed into its projections on X1, X2, ..., Xn.
Suppose we have a relvar called Suppliers
with FDs { SupplierID -> SupplierName, City } && { City -> Status }
.
SupplierID | SupplierName | City | Status |
---|---|---|---|
1 | S1 | Chicago | 10 |
2 | S2 | Lafayette | 20 |
3 | S3 | Chicago | 10 |
4 | S4 | Lexington | 10 |
5 | S5 | Lafyette | 20 |
A join dependency says we can break this relvar into at two components and reconstitute it without loss: S = { SupplierID, SupplierName, City }; C = { City, Status }
SupplierID | SupplierName | City |
---|---|---|
1 | S1 | Chicago |
2 | S2 | Lafayette |
3 | S3 | Chicago |
4 | S4 | Lexington |
5 | S5 | Lafyette |
City | Status |
---|---|
Chicago | 10 |
Lafayette | 20 |
Lexington | 10 |
Functional dependencies are join dependencies.
But not all join dependencies are functional dependencies!
A relvar R is in 5th Normal Form if it is in Boyce-Codd Normal Form and the relvar has no composite keys.
A relvar may be in BCNF but not 5NF if there is a composite key and there are non-functional join dependencies which are not key-based.
Suppose we have the following rules:
This is also known as a symmetric constraint. If this does not apply, then we fall into the connection trap.
In this setup, if all three attributes are necessary to determine which combinations are valid and which are not, then we are in 5NF.
Agent | Company | Product |
---|---|---|
Smith | Ford | Car |
Smith | GM | Truck |
Agent | Company | Product |
---|---|---|
Smith | Ford | Car |
Smith | Ford | Truck |
Smith | GM | Car |
Jones | Ford | Car |
Now, we see two agents (Smith and Jones) who sell various products for Ford and GM. Ford manufactures cars and trucks, whereas GM only manufactures cars.
If we have a rule in which company salesmen sell all company products, then we can decompose this entity further.
We then have a JD: *{ {A, C}, {C, P}, {A, P} }
Agent | Company |
---|---|
Smith | Ford |
Smith | GM |
Jones | Ford |
Agent | Product |
---|---|
Smith | Car |
Smith | Truck |
Jones | Car |
Company | Product |
---|---|
Ford | Car |
Ford | Truck |
GM | Car |
As long as our symmetric constraint holds, everything is fine. But what happens if GM starts to sell trucks?
Agent | Company |
---|---|
Smith | Ford |
Smith | GM |
Jones | Ford |
Agent | Product |
---|---|
Smith | Car |
Smith | Truck |
Jones | Car |
Company | Product |
---|---|
Ford | Car |
Ford | Truck |
GM | Car |
GM | Truck |
Now this means that Smith must begin to sell GM trucks.
Now let's say that there is no such symmetric constraint. We start with similar data, though note that Jones sells GM trucks instead of cars.
Agent | Company | Product |
---|---|---|
Smith | Ford | Car |
Smith | Ford | Truck |
Smith | GM | Car |
Jones | GM | Truck |
Agent | Company |
---|---|
Smith | Ford |
Smith | GM |
Jones | GM |
Agent | Product |
---|---|
Smith | Car |
Smith | Truck |
Jones | Truck |
Company | Product |
---|---|
Ford | Car |
Ford | Truck |
GM | Car |
GM | Truck |
We've broken this out. But is it correct?
Nope! This says that Smith sells GM trucks, but that's not true!
If 5NF seems like overkill, here are a few reasons why it can be important.
Fourth Normal Form is an anachronism--it's really just a subset of 5NF.
Relvar R is in 4NF if and only if every Multi-Valued Dependency of R is implied in the keys of R.
Multi-Valued Dependency: a join dependency with exactly two components.
Suppose the following:
CNO
can be taught by teacher TNO
using textbook XNOCNO | TNO | XNO |
---|---|---|
C1 | T1 | X1 |
C1 | T1 | X2 |
C1 | T2 | X1 |
C1 | T2 | X2 |
*{ {CNO, TNO}, {CNO, XNO} }
CNO ->-> TNO && CNO ->-> XNO
, or CNO ->-> TNO|XNO
Relvar R can be nonloss decomposed into its componentsXY
andXZ
if and only ifX->->Y|Z
holds in R.
In other words, if there are multiple, independent attributes, split them out into separate relvars.
Suppose Mr. Smith has two skills: he can cook and he can type.
Furthermore, he is fluent in German, French, and Greek.
How do we represent this information?
Employee | Skill | Language |
---|---|---|
Smith | Cook | |
Smith | Type | |
Smith | French | |
Smith | German | |
Smith | Greek |
Problems:
Employee | Skill | Language |
---|---|---|
Smith | Cook | French |
Smith | Type | German |
Smith | Greek |
Problems:
Employee | Skill | Language |
---|---|---|
Smith | Cook | French |
Smith | Type | German |
Smith | Type | Greek |
Problems:
Employee | Skill | Language |
---|---|---|
Smith | Cook | French |
Smith | Type | |
Smith | German | |
Smith | Type | Greek |
Problems:
Employee | Skill |
---|---|
Smith | Cook |
Smith | Type |
Employee | Language |
---|---|
Smith | French |
Smith | German |
Smith | Greek |
{ E, S, L }
was good enough for BCNF ({ E, S, L }
was the candidate key) but not good enough for us.*{ {E, S}, {E, L} }
needed to be broken out.Relvar R is in Sixth Normal Form (6NF) if and only if all join dependencies which hold in R are trivial.
In other words, there may be one non-key attribute in addition to the key.
Suppose we have a relvar { ProductKey, ProductName, ProductColor, ProductWeight, StorageCity }
. Furthermore, our natural key is ProductKey
. This relvar is in 5NF (BCNF + non-composite key) and can be made 6NF in the following way:
{ ProductKey, ProductName }
{ ProductKey, ProductColor }
{ ProductKey, ProductWeight }
{ ProductKey, StorageCity }
Note that each of these new relvars has a simple predicate (natural-language depiction of a relvar's utility) with no "AND":
Relvar | Predicate |
---|---|
{PK,ProductName} |
Product PK has name ProductName . |
{PK,ProductColor} |
Product PK has color ProductColor . |
{PK,ProductWeight} |
Product PK has weight ProductWeight . |
{PK,StorageCity} |
Product PK is stored in city StorageCity . |
Maybe!
Suppose we might not always have product weight. Then we'd have:
{ ProductKey, ProductName, ProductColor, StorageCity }
{ ProductKey, ProductWeight }
6NF works best with intervals of data. Usually this is temporal data. Example:
SupplierKey | City | Interval |
---|---|---|
Supplier1 | London | D4:D8 |
Supplier2 | Paris | D3:D3 |
Supplier2 | Athens | D5:D10 |
Suppose we have an SCD2 style table:
SupplierKey | City | Assistants | Interval |
---|---|---|---|
Supplier1 | London | 4 | D1:D8 |
Supplier1 | London | 5 | D9:D14 |
Supplier1 | Paris | 5 | D15:D17 |
Supplier2 | Athens | 3 | D1:D17 |
Supplier2 | Athens | 4 | D18:D18 |
Break it into 2 non-redundant relvars:
SupplierKey | City | Interval |
---|---|---|
Supplier1 | London | D1:D14 |
Supplier1 | Paris | D15:D17 |
Supplier2 | Athens | D1:D18 |
SupplierKey | Assistants | Interval |
---|---|---|
Supplier1 | 4 | D1:D8 |
Supplier1 | 5 | D9:D17 |
Supplier2 | 3 | D1:D17 |
Supplier2 | 4 | D18:D18 |
To bring these back together, unpack on Interval
, join together elements, and re-pack on Interval
. This is a gaps & islands problem!
Relational theorists like Date tend not to like the Kimball model very much. As a general-purpose relational model, the star schema is terrible.
It is, however, a great model for data warehousing if you follow certain rules:
Overnormalization isn't really a thing. What people tend to see as overnormalization tend to be one of the following:
Normalization has very specific rules: you can tell where you are and what you'd need to do to get to the next level. With denormalization, how can we answer the following questions?
General philosophy:
Normalization is not the end-all of relational design.
This has been a look at normalization from an academic perspective. This can be a complicated topic and I've tried to straddle the line between formal correctness and ease of explanation.
To learn more, go here:
https://CSmore.info/on/dbdesign
And for help, contact me:
feasel@catallaxyservices.com | @feaselkl
Catallaxy Services consulting:
https://CSmore.info/on/contact