CS145 Lecture Notes (6) -- Relational Database Design



For all but the very simplest database applications, there are many, many different relational database "designs" (schemas) that can be used to store the relevant data. Relational database design is the topic of identifying and selecting a good relational schema for an application.

Motivation

Student application information: (Assume students played the same sports at every high school attended.)

Proposed relation to capture this information:

   Apply(ID, name, campus, sport, HScode, HScity)

Consider "Mary" (ID 123) who is applying to Berkeley and Santa Cruz. She played tennis, basketball, and volleyball at both of her high schools, Gunn (code 26) and Menlo-Atherton (code 28).

Question: How many tuples in the relation?











Design "Anomalies"

This Apply relation exhibits three types of anomalies:
  1. Redundancy
    
    
    
  2. Update Anomaly
    
    
    
  3. Deletion Anomaly
    
    
    
    
Question: What's the "best design" for this information?






Good designs:
  1. No anomalies
  2. Can reconstruct all original information
Question: What's the best design if different sports can be played at different high schools?






Design by Decomposition

Overview of Automatic Decomposition

  1. Designer specifies mega-relations plus properties of the data
  2. System decomposes based on properties
  3. Final set of relations satisfies normal form -- guarantees no anomalies, no lost information

Properties and Normal Forms

FDs and BCNF: Example

Apply(ID,name,campus)
Students can apply to multiple campuses

Question: Is this a good design or should we decompose?



Can have many duplicates of each (ID,name) Functional dependency "A -> B" says a given A value always has the same B value.

Boyce-Codd Normal Form says if A,B are both in relation R, then A must be a key => A-B connection stored only once.


BCNF decomposition says make a separate relation (A,B) and take B out of the original.



MVDs and 4NF: Example

Consider just Apply(ID,campus,sport)

Question: Is this a good design or should we decompose?



Multiplicative effect: C campuses and S sports yields CxS tuples

Question: captured by FDs and BCNF?



Multivalued dependency "A ->> B" says a given A value always has every combination of B and C values (C is remaining attributes).


Fourth Normal Form says if A,B are both in relation R, then A must be a key => don't get multiplicative effect of B and C.


4NF decomposition says make a separate relation (A,B) and take B out of the original.







Details and Theory

Now we will formalize the intuition above.
  1. Functional dependencies
  2. Boyce-Codd Normal Form
  3. BCNF decomposition algorithm
  4. Multivalued dependencies
  5. Fourth Normal Form
  6. 4NF decomposition algorithm
Example schema:
   Student(ID, name, address, HScode, HSname, HScity, cumGPA, priority)
   Apply(ID, campus, date, major)

Functional Dependencies

Suppose Student.priority is determined by Student.cumGPA.

(show example)








Formally:
   For every pair of tuples t and u in Student:
   if t[cumGPA] = u[cumGPA] then t[priority] = u[priority]
This is a "functional dependency" (FD): a constraint specified with the schema of a relation. Notation for declaring an FD for a relation R:
   A1, A2, ..., Am -> B1, B2, ..., Bn   (commas may be omitted)
states that for every pair of tuples t and u in R: if t[A1,...,Am] = u[A1,...,Am] then t[B1,..,Bn] = u[B1,..,Bn].

We will abbreviate A1, A2, ..., Am as AA (or "A-bar") and B1, B2, ..., Bn as BB (or "B-bar").

(simple abstract example)









Question: What are some functional dependencies for Student besides cumGPA -> priority?









Question: What are some functional dependencies for Apply?









Functional Dependencies and Keys

  • Assume R has no duplicates.
  • Then AA is a key for R if and only if AA -> BB, where BB is all attributes of R

    (show abstract example)

    
    
    
    
    
    
    
    
    
    
    Note:

    Trivial FD

    AA -> BB where BB is a subset of AA

    (example)

    
    
    
    
    
    
    

    Nontrivial FD

    AA -> BB where BB is not a subset of AA

    (example)

    
    
    
    
    
    
    

    Completely nontrivial FD

    AA -> BB with no overlap between AA and BB

    (example)

    
    
    
    
    
    
    
    
    => Most of the time we're interested in completely nontrivial FD's.


    Rules for Functional Dependencies

    Splitting Rule

    If AA -> B1, B2, ..., Bn then AA -> B1, AA -> B2, ..., AA -> Bn

    Question: Can we also split the left-hand side?

    
    
    
    
    
    
    
    
    
    
    

    Combining Rule

    If AA -> B1, AA -> B2, ..., AA -> Bn then AA -> B1, B2, ..., Bn

    Trivial Dependency Rules

    If AA -> BB then AA -> (BB - AA)
    If AA -> BB then AA -> (BB U AA)

    Transitive Rule

    If AA -> BB and BB -> CC then AA -> CC


    Closure of Attributes

    Given a relation R, a set of FD's for R, and a set of attributes {A1, A2, ..., Am} of R:
       Find all attributes B in R such that A1, A2, ..., Am -> B
    
    This set of attributes is the "closure" and is denoted {A1, A2, ..., Am}+

    Algorithm for computing closure:

       start with {A1, A2, ..., Am}
       repeat until no change:
          if current set of attributes includes LHS of a dependency,
          add RHS attributes to the set
    
    (Effectively applies combining and transitive rules until there's no more change.)

    Example: closure of {ID, HScode} in Student given FD's:

       ID -> name, address, cumGPA
       cumGPA -> priority
       HScode -> HSname, HScity
    
    
    
    
    Question: How can we exploit closure to test whether a set of attributes is a key?
    
    
    
    
    
    
    
    Related question: How can we find all keys given a set of FD's?
    
    
    
    
    
    
    
    

    Specifying FD's for a Relation

    A set S2 of FD's "follows" from another set S1 of FD's if all relation instances that satisfy S1 also satisfy S2.

    Example: apply rules above

    
    
    
    
    When specifying FD's for a relation, we would like to specify a minimal set of completely nontrivial FD's such that all FD's that hold on the relation follow from the dependencies in this set. Sounds hard, but in practice this approach comes naturally.

    Question: How can we tell if one FD follows from others?

    Example: Does AA -> BB follow from a set S of FD's?

    
    
    
    
    

    Decomposition and Boyce-Codd Normal Form

    Reminder:
    1. Designer specifies mega-relations plus properties of the data: functional dependencies
    2. System decomposes based on FDs
    3. Final set of relations satisfies normal form (BCNF) -- guarantees no anomalies, no lost information

    Decomposition

    Formally: decompose R(C1, ..., Ck) into R1(A1, ..., Am) and R2(B1, ..., Bn), where {C1, ..., Ck} = {A1, ..., Am} U {B1, ..., Bn}

    (diagram)

    
    
    
    
    
    
    
    Idea of decomposition:
    1. R1 = PROJECT[A1, ..., Am] (R), eliminating duplicates
    2. R2 = PROJECT[B1, ..., Bn] (R), eliminating duplicates
    3. R1 NATURAL-JOIN R2 = R
    Note: R is never actually created, it's just a step in the design process.

    Example decomposition:

       Student(ID, name, address, HScode, HSname, HScity, cumGPA, priority)
    
    decomposed to
       Student1(ID, name, address, HScode, cumGPA, priority)
       Student2(HScode, HSname, HScity)
    
    (check criteria for decomposition)
    
    
    
    
    

    Another example decomposition:

       Student1(ID, name, address, HScode, HSname, HScity)
       Student2(name, HSname, cumGPA, priority)
    
    (check criteria for decomposition)
    
    
    
    
    
    

    Boyce-Codd Normal Form (BCNF)

    Defines which decompositions are "good" ones, based on specified functional dependencies

    Given: relation R and set of FD's for R

    Definition: R is in BCNF with respect to its FD's if for every nontrivial FD AA -> BB, AA contains a key.

    Question: Why does violating this requirement produce a "bad" relation?

    
    
    
    
    
    
    
    Example:
       Student(ID, name, address, HScode, HSname, HScity, cumGPA, priority)
    
       FD's: ID -> name, address, cumGPA
             cumGPA -> priority
             HScode -> HSname, HScity
    
       Key:  ID, HScode
    

    Question: Is the relation in BCNF?

    
    
    Each violation produces anomalies.


    Example:
       Apply(ID, campus, date, major)
    
    Can apply to campus multiple times for different majors, but can only apply to a campus once per day
       FD's: ID, campus, date -> major
       Key:  ID, campus, date
    

    Question: Is the relation in BCNF?

    
    
    

    BCNF Decomposition Algorithm

       Input: relation schema R and set of FD's for R
    
       (1) compute keys for R based on FD's
       (2) repeat until no more BCNF violations:
             (2a) pick any R' with AA -> BB that violates BCNF
             (2b) decompose R' into R1(AA,BB) and R2(AA,CC)
                  where CC is all attributes in R' except (AA U BB)
             (2c) compute FD's for R1 and R2
             (2d) compute keys for R1 and R2 based on FD's
    
    (diagram)
    
    
    
    
    
    

    Question: How can we compute keys in steps (1) and (2d)?

    
    
    
    

    Question: How do we compute FD's in step (2c)?

    
    
    
    
    
    (run algorithm on Student example)
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    => Final decomposed relations may be different depending on which violating FD is chosen in each iteration (step 2(a)), but all decompositions will be in BCNF.

    Does BCNF guarantee a good decomposition?

    1. Removes anomalies? Yes

    2. Reconstructs original relation?

      • Can we get too many tuples?

        Consider R(A,B,C) with tuples (1,2,3) and (4,2,5)

        Decompose into R1(A,B) with tuples (1,2) and (4,2) R2(B,C) with tuples (2,3) and (2,5)

        "Reassembled" relation has four tuples => wrong

        But BCNF would not decompose this way unless B -> C or B -> A

      • Can we get too few tuples? No
    With BCNF get:

    Multivalued Dependencies

    Example:
       Apply(ID, campus, major)  // removed date from earlier example
    
    Suppose students can apply to several campuses and different majors, but must apply to the same set of majors at every campus.

    Question: What are FD's?

    
    
    

    Question: What is key?

    
    
    

    Question: Is it in BCNF?

    
    
    
    

    Question: Is it a "good" design?

    
    
    
    

    MVD Definition:

       AA ->> BB is an MVD for relation R if:
       For all tuples t,u in R:
          If t[AA] = u[AA] then there exists a v in R such that:
             (1) v[AA] = t[AA]
             (2) v[BB] = t[BB]
             (3) v[CC] = u[CC] where CC is all attributes in R except (AA U BB)
    
    ' (show with picture; show implied fourth tuple)
    
    
    
    
    
    
    
    
    
    
    
    
    MVD's are also called "tuple-generating dependencies."

    Back to example: What are MVD's?

    
    
    
    

    (show example data to verify)

    
    
    
    
    
    
    
    
    Intuition: MVD's uncover situations where independent facts related to a certain object are being squished together in one relation.

    Trivial MVD

    AA ->> BB where BB is a subset of AA or (AA U BB) contains all attributes of R
    
    
    
    
    
    
    
    

    Nontrivial MVD

    AA ->> BB where BB is not a subset of AA and (AA U BB) does not contain all attributes of R


    Rules for Multivalued Dependencies

    FD-is-an-MVD Rule

    If AA -> BB then AA ->> BB

    Prove by showing (1), (2), (3) in MVD definition.

    
    
    
    
    
    
    
    

    Transitive Rule

    If AA ->> BB and BB ->> CC then AA ->> CC

    Complementation Rule

    If AA ->> BB then AA ->> CC where CC is all attributes in R except (AA U BB)

    Question: Are there any rules for FD's that do not apply for MVD's?

    
    
    
    
    
    
    

    Fourth Normal Form (4NF)

    Defines which decompositions are "good" ones, based on specified functional dependencies and multivalued dependencies

    Given: relation R and set of MVD's for R

    Definition: R is in 4NF with respect to its MVD's if for every nontrivial MVD AA ->> BB, AA contains a key.

    Note: Since every FD is also an MVD, 4NF implies BCNF

    Question: What happens in the MVD definition if AA contains a key?

    
    
    
    

    4NF Decomposition Algorithm

    Very similar to BCNF:
       Input: relation schema R and set of FD's and MVD's for R
    
       (1) compute keys for R based on FD's
       (2) repeat until no more 4NF violations:
             (2a) pick any R' with AA ->> BB that violates 4NF
             (2b) decompose R' into R1(AA,BB) and R2(AA,CC)
                  where CC is all attributes in R' except (AA U BB)
             (2c) compute FD's and MVD's for R1 and R2
             (2d) compute keys for R1 and R2 based on FD's
    
    

    Question: How do we compute MVD's in step (2c)?

    
    
    
    

    (run algorithm on Apply example)

    
    
    
    
    
    
    
    
    
    
    

    Shortcomings of BCNF/4NF

    Example 1

    Apply(ID, campus, date, major)
    

  • A student can apply to each campus only once
  • Campuses have specific non-overlapping application dates

    Question: What are FDs and keys?

    
    
    
    
    
    
    Question: What is BCNF decomposition?
    
    
    
    
    
    
    Question: Is this decomposition "good"?
    
    
    
    
    
    
    
    Third Normal Form (3NF) is weaker than BCNF (and therefore weaker than 4NF). It permits Apply as above to remain undecomposed.

    => Material on 3NF (Section 3.6.6) is required reading for the course, not covered in lecture.

    Example 2

    Student(ID, HScode, cumGPA, priority)
    
      ID -> cumGPA
      cumGPA -> priority
      ID -> priority
    
    Note: ID is not a key.

    Apply BCNF decomposition splitting first on ID -> cumGPA

    
    
    
    
    
    
    
    
    
    
    Question: Is the resulting decomposition "good"?
    
    
    
    
    Heuristic: "close" each FD before beginning decomposition
    
    
    
    
    
    
    
    
    
    
    
    => Overall, BCNF/4NF decomposition does not guarantee that all of the original FDs can be enforced on the individual decomposed relations.