0:00 In this sequence of videos, we'll 0:02 learn about designing good schemas for relational databases. 0:06 So let's suppose we're building a 0:07 database for an application or 0:08 set of applications and we 0:09 have to figure out what schema we want to store our data. 0:13 Usually there are many different 0:14 possible schema designs for a 0:16 database, and databases do tend to get quite complicated. 0:20 And some designs are much better than others. 0:22 So how do we choose what design to use? 0:25 Now the reality is that people 0:26 often use higher level tools 0:28 to design relational databases and 0:30 don't design the schemas directly themselves. 0:32 But some designers do go 0:34 straight to relations, and furthermore, 0:36 it's useful to understand why 0:38 the relations that are produced 0:39 by design tools are what 0:41 they are. 0:43 Furthermore, from an academic point of view, 0:44 it turns out there's a very 0:45 nice theory for relational data base design. 0:48 So let's consider the process of 0:49 designing the schema for our 0:51 database about students applying to colleges. 0:54 Specifically, for a given 0:55 student, let's suppose we have 0:56 their social security number and 0:58 their name, the colleges that 1:00 student is applying to, the 1:01 high schools they attended and 1:02 what city those high schools were in, and the student's hobbies. 1:06 So if that's what we want we 1:07 can create a single relation 1:08 called apply, that has one 1:10 attribute for each of those pieces of information. 1:13 Now let's take a look at how that database would be populated. 1:16 Let's suppose that we have 1:17 a student, Anne, with Social Security 1:18 number 123, she went to 2 1:20 different high schools in Palo 1:22 Alto, she plays tennis and 1:23 the trumpet, and she's applying 1:25 to Stanford, Berkeley, and MIT. 1:27 So let's look at some of 1:28 the tuples that we would be 1:30 having in the apply relation 1:31 to represent this information about Anne. 1:33 So we'll have 1,2,3, Anne, 1:36 her name, she's applying to 1:37 Stanford, she went to 1:39 Palo Alto High School, and 1:42 that's in Palo Alto, and one of her hobbies is tennis. 1:47 And then we also have 1 1:48 2 3 and she applied to 1:51 Berkeley and went to 1:53 Palo Alto High School in 1:54 Palo Alto and tennis there as well. 1:58 Of course she also has 2:00 a tuple representing the fact 2:02 that she's applying to Berkeley and 2:04 and we'll stick with Palo 2:05 Alto High School, and she played the trumpet. 2:08 And as you can see we'll 2:09 have more tuples, we'll have 2:11 various Stanford and Berkeleys, we'll 2:12 have some for her other high 2:14 schools called Gunn High School 2:15 also in Palo Alto, and so on. 2:18 So if we think about 2:19 it we will need a total 2:21 of 12 tuples to represent 2:23 this information about Ann. 2:26 Now do we think that's a good design? 2:29 I'm going to argue no, it's not a good design. 2:31 There are several types of anomalies in that design. 2:35 First of all, we capture information 2:37 multiple times in that 2:40 design, and I'll give some examples of that. 2:42 For example how many times 2:43 do we capture the fact that 2:45 1 2 3 the Social Security 2:47 number is associated with a student named Ann? 2:49 We capture that twelve times in our twelve tuples. 2:53 How many times do we 2:54 capture that Anne went to Palo Alto High School? 2:56 We're going to capture that six times. 2:58 And we're going to capture 2:59 the fact that she plays tennis six times. 3:02 And we're going to capture the fact 3:03 that she went to apply to 3:04 MIT four times, so for 3:07 each piece of information, in fact, 3:08 we're capturing it many, many times. 3:10 So that doesn't seem like a good feature of the design. 3:13 The second type is an 3:14 update anomaly, and that's really a direct effect of redundancy. 3:18 What update anomalies say 3:20 is that you can update facts 3:21 in some places but not 3:23 all all or differently in different places. 3:26 So let's take the fact for 3:27 example that Ann plays the trumpet. 3:30 I might decide to call 3:31 that the coronet instead but I 3:33 can go ahead and I 3:35 can modify, say, three of 3:37 the incidences where we captured 3:38 the fact about her playing the 3:39 trumpet and not the fourth 3:41 one and then we end up 3:42 with what's effectively an inconsistent database. 3:45 And the third type of 3:47 anomaly is called a deletion anomaly, and 3:49 there's a case where we could inadvertently 3:51 completely do a complete 3:52 deletion of somebody in the database. 3:54 Let's say for example that we 3:56 decide that surfing is an 3:58 unacceptable hobby for our 4:00 college applicants, and we go 4:01 ahead and we delete the tuples about surfing. 4:04 If we have students who have 4:06 surfing as their only hobby, then those students will be deleted completely. 4:10 Now you may argue that's the right 4:11 thing to do, but probably that isn't what was intended. 4:15 So now let's take a look at a very different design for the same data. 4:18 Here we have five different 4:20 relations, one with the 4:21 information about students and their 4:22 names, one where they've applied 4:24 to colleges, one where they 4:25 went to high school, where their 4:27 high schools are located and what hobbies the students has. 4:30 In this case we have no anomalies. 4:32 If we go back and look at 4:33 the three different types, they don't occur in this design. 4:36 We don't have redundant information, we 4:39 don't have the update anomaly or the deletion anomaly. 4:42 Furthermore, we can reconstruct all 4:45 of the original data from our 4:47 first design, so we haven't 4:49 lost any information by breaking it up this way. 4:52 So in fact this looks like a much better design. 4:54 Now let me mention a couple 4:56 of modifications to this design that might occur. 4:58 Let's suppose, for example, that 5:00 the high school name alone is not a key. 5:03 So when we break up the 5:04 high school name and high school 5:05 city, we no longer can identify the high school. 5:08 In that case, the preferred 5:09 design would be to move 5:11 the high school up here so 5:13 we'll have that together with 5:15 the high school name and then we don't need this relation here. 5:17 And actually that's a fine design. 5:19 It does not introduce any anomalies, 5:21 that's just based on the 5:22 fact that we need the name 5:23 of the high school together with the city to identify it. 5:26 As another example, suppose a 5:28 student doesn't want all of 5:30 their hobbies revealed to all 5:31 of the colleges that they are applying to. 5:33 For example, maybe they don't want Stanford to know about their surfing. 5:37 If that's the case then we 5:38 can modify the design again, 5:40 and in that case we would 5:41 put the hobby up here with where they're applying to college. 5:44 And so that would include the hobbies 5:46 that they want to reveal 5:47 to those particular colleges, and we'll take away this one. 5:50 So it looked like we were 5:51 taking our nice, small relations, 5:53 and moving back to a design that had bigger relations. 5:56 But in this case it was very well motivated. 5:58 We needed these attributes together to 6:00 identify the high school and 6:01 we want it to have our hobbies specific to the colleges. 6:04 So what that shows is 6:06 that the best design, for an 6:08 application for relational databases 6:10 depend not only on constructing 6:12 the relations well, but also 6:14 in what the data is representing in the real world. 6:17 So the basic of idea of 6:18 what we're going to do is 6:20 design by decomposition, specifically, 6:23 we're going to do what we 6:24 did at the very beginning of this 6:25 example, which is start 6:26 by creating mega-relations that just 6:27 contain attributes for everything 6:30 that we want to represent in our 6:31 database, then we're going 6:32 to decompose those mega relations 6:34 into smaller ones that are 6:36 better, but still capture the same information. 6:39 And most importantly we can do this decomposition automatically. 6:43 So how does automatic decomposition work? 6:45 In addition to the mega 6:46 relations, we're going to specify 6:48 formally, properties of the data.The 6:50 system is going to use 6:52 the properties to decompose the 6:54 relations, and then it's 6:55 going to guarantee that the final 6:57 set of relations satisfy what's called a normal form. 7:00 And we'll be formalizing all of this. 7:02 But the basic idea behind normal 7:04 forms is that they don't 7:05 have any of those anomalies that 7:06 I showed and they don't lose any information. 7:09 So specifically for specification of 7:11 properties, we're going to begin 7:13 by looking at something called functional dependencies. 7:15 And once we specify functional dependencies, 7:17 the system will generate 7:19 relations that are in 7:20 what's called Boyse Codd normal form. 7:22 And Boyse and Codd by the 7:23 way were two early pioneers in relational databases in general. 7:27 Then we're going to 7:29 look at another type of 7:30 specification called multi valued 7:32 dependencies which will add to 7:34 functional dependencies and when we 7:35 have both functional and multi 7:37 valued dependencies, then we 7:38 can have what's called fourth 7:40 normal form, and again, that 7:42 would be relations that are generated 7:44 by the system that satisfy the normal form. 7:47 Boyce-Codd normal form is stricter than fourth normal form. 7:50 Specifically if we make 7:51 a big Venn diagram here of 7:53 all the relational designs 7:56 that satisfied Boyce-Codd Normal Form, 7:58 which by the way is very 7:59 often abbreviated BCNF, then that 8:02 contains all of the 8:04 relations that satisfy fourth normal form, 8:06 normally abbreviated 4NF. 8:08 So every relation that's in 8:10 fourth normal form is also 8:11 in Boyce-Codd normal form, but not vice versa. 8:15 You might be wondering what happened 8:16 to first, second and third, normal forms. 8:19 So first normal form is 8:21 pretty much just a specification 8:22 that relations are real 8:24 relations with atomic values in each cell. 8:27 Second normal form is specifying 8:29 something about the way relations are structured with respect to their keys. 8:33 Neither of those is discussed very much anymore. 8:36 Third normal form is a 8:37 slight weakening of Boyce-Codd 8:39 normal form and sometimes people 8:41 do like to talk about third normal form. 8:42 So you can think of third normal form as a little bit of a even bigger circle here. 8:46 We're not going to cover third normal 8:47 form in this video because 8:49 Boyce-Codd normal form is the 8:50 most common normal form used 8:52 if we have functional dependencies only, and 8:54 fourth normal form if we 8:56 have functional and multivalued dependencies. 8:58 So what's going to happen next is 9:00 I'm going to give some examples 9:02 to motivate these four concepts: 9:04 functional dependencies, Boyce-Codd normal form, 9:06 multivalued dependencies normal form, 9:08 and then later videos will 9:09 go into each one in much greater depth. 9:12 So let me just give 9:13 the general idea of functional dependencies 9:15 and Boyce-Codd Normal Form. 9:16 And we'll use a very 9:18 simple for example, an abbreviated version 9:20 of our apply relation that has 9:21 students' social security numbers, the 9:23 student's name and their colleges 9:25 that the student is applying to. 9:26 Even this small relation actually 9:28 has redundancy and update and deletion anomalies. 9:32 Specifically, let's say that our 9:33 student, 123Ann, applies to 7 colleges. 9:36 Then there will be 7 tuples and 9:38 there will be 7 instances where we 9:40 know that a student with the 9:42 social security number 123 is named Ann. 9:45 Specifically, we're going to store 9:46 for every student the name 9:49 and social security number pair once 9:50 for each college that they apply to. 9:53 So now let me explain what 9:54 a functional dependency is and then 9:56 we'll see how functional dependencies are 9:58 used to recognize when we 9:59 have a bad design like this 10:00 one, and to see how we can fix it. 10:03 A functional dependency, in this 10:04 case from social security number 10:06 to name, and we're saying 10:07 social security number functionally determines 10:10 the student name says that 10:12 the same social security number always has the same name. 10:15 In other words, every time we see 123, we're going to see Ann. 10:18 Now it doesn't necessarily go in the other direction. 10:21 It might not be that whenever 10:22 we see Ann, it's 123, 10:23 but whenever we see 123, it is Ann. 10:26 And so what we'd like to 10:27 do is store that relationship just one time. 10:30 One time say that for 123, the name is Ann. 10:34 Now what Boyce Codd Normal Form 10:35 says is that whenever we have 10:37 one of these functional dependencies, then 10:39 the left hand side of that functional dependency must be a key. 10:43 And think about what that's saying. 10:44 Remember a key says 10:46 that we have just one tupple with each value for that attribute. 10:50 So if we have say 10:51 social security number to name 10:53 as a functional dependency and we 10:55 satisfy Boyce-Codd Normal Form, 10:57 then we're going to say that 10:58 social security number has to 11:00 be a key in our relation, 11:01 and we'll only have one 11:02 tupple for each social security number. 11:05 Specifically, we can go back to our original relation. 11:08 We have this functional dependency social 11:11 security number here is not a key, right? 11:14 So then we know that this is not in Boyce-Codd Normal Form. 11:17 So we're going to use functional 11:18 dependencies to help us 11:20 decompose our relation so 11:22 that the decomposed relations are in Boyce-Codd Normal Form. 11:25 And here's what would happen in this example. 11:27 Our functional dependency would tell 11:29 us to pull out the social 11:31 security number and student name 11:32 into its own relation where 11:34 the social security number is a 11:35 key and then we have 11:37 just one time for each 11:38 social security number that students 11:40 name, and then separately we'll 11:42 have the information about the 11:43 students and which colleges they applied to. 11:46 Again, we'll completely formalize this 11:48 whole idea, the definition of 11:49 functional dependencies, their properties, the 11:51 normal form, and how we 11:52 do the decomposition in a later video. 11:55 Now, let's similarly motivate the 11:57 concept of multi-value dependencies, and fourth normal form. 12:00 It is actually a little bit more complicated, but it follows the same rough outline. 12:04 Now let's look at a different 12:05 portion of the information 12:06 about applying and let's suppose, 12:08 for now, that we're just concerned 12:10 about students, what colleges they're 12:12 applying to, and what high schools they went to. 12:15 We still have redundancy and update and deletion anomalies. 12:17 For example, a student who 12:19 applies to Stanford is going 12:21 to have that fact captured once 12:23 for every high school that they went to. 12:24 A student who went to 12:26 Palo Alto high School will have 12:28 that fact captured once for every college they apply to. 12:31 In addition, we get a kind of multiplicative effect here. 12:34 Because let's say a student 12:36 applies to C colleges 12:38 and they went to H 12:39 high schools; I know students don't 12:41 go to a lot of high schools 12:42 but let's suppose that this is one that had moved a lot. 12:44 In that case, we're going to have C times H, tuples. 12:48 What we'd really like to have 12:49 is something more on the 12:50 order of C plus H, 12:52 because then we'd be capturing each piece of information just once. 12:55 Now the interesting thing is that 12:57 the badness of this particular 12:58 design is not addressed by 13:00 Boyce-Codd Normal Form, in fact 13:01 this relation is in Boyce-Codd 13:03 Normal Form, because it has no functional dependencies. 13:06 It's not the case that every 13:08 instance of a social security 13:10 number is associated with a 13:11 single college name or a single high school. 13:14 As we will see later, if there 13:15 are no functional dependencies, then the 13:16 relation is automatically in Boyce-Codd 13:18 Normal Form, but it's not 13:20 in and fourth normal form. 13:22 So fourth normal form is associated with what are called multi-value dependencies. 13:26 When we specify a multi-value 13:28 dependency as we've done here 13:29 with the double arrow, what this 13:31 is saying is that if we 13:33 take a particular social security 13:35 number in the relation, 13:36 we will have every combination 13:39 of college names that are 13:40 associated with that social security 13:42 number with every high 13:43 school that's associated with that social security number. 13:46 We'll actually see that when 13:48 we have this multi-value dependency, 13:50 we automatically have this one, too. 13:53 I know it seems a bit complicated, 13:54 and we will formalize it completely, 13:55 but for now now just think 13:57 about the English statement that multi-valued dependency 13:59 is saying that we are going to 14:00 have every combination of those 14:02 two attributes and values 14:04 in those attributes for a given social security number. 14:06 In other words, those values 14:08 are really independent of each other. 14:10 So if we have that situation, 14:12 then what we should really do 14:13 is store each college name 14:15 and each high school for each 14:16 social security number one time, 14:18 and that's what fourth normal form will do for us. 14:21 Fourth normal form, similarly to 14:24 Boyce-Codd normal form, says if 14:25 we have a dependency, then the left hand side must be a key. 14:29 In this case, it's a multi-value dependency 14:31 we're looking at, so it's really 14:32 saying something different but the 14:34 basic idea is the same 14:35 which is that we want 14:37 only one tuple that has 14:39 each value that's appears on 14:41 the left hand side of a 14:42 multi-value dependency So let's 14:43 see what would happen, in this 14:45 example, if we use 14:46 our multi-value dependencies to decompose 14:49 the relation, based on the idea of fourth Normal Form. 14:52 Well it is the intuitive thing that happens. 14:54 We separate the information about 14:56 the college names that a student 14:57 applies to from the information 14:59 about the high schools themselves, and 15:00 then we'll see that that 15:01 we only store each fact 15:03 once and we do get 15:04 the behavior of having C 15:06 plus H tuples instead of having C times H tuples. 15:10 Like with functional dependencies and Boyce-Codd 15:11 Normal Form we'll be completely 15:13 formalizing all of this 15:14 reasoning and the definitions in later videos. 15:17 To summarize, we're going to do relational design by decomposition. 15:21 We're going to start by specifying mega 15:22 relations that contain all the 15:24 information that we want to capture, 15:26 as well as specifying properties of 15:27 the data usually reflecting the real world in some fashion. 15:31 The system can automatically decompose 15:33 the mega-relations into smaller relations 15:35 based on the properties we specify, 15:37 and guarantee that the final 15:38 set of relations have certain 15:40 good properties captured in a normal form. 15:42 They will have no anomalies, and they'll be guaranteed not to lose information. 15:47 We'll start by specifying properties as 15:48 functional dependencies and from 15:50 there the system will guarantee Boyce-Codd 15:52 Normal Form and then we'll 15:54 add to that properties specified as 15:56 multi-value dependencies, and from 15:57 there the system will guarantee fourth 15:59 normal form, which is even 16:00 stronger than Boyce-Codd Normal Form and 16:02 is generally thought to be good relational design.