0:00 In this sequence of videos, we'll 0:02 learn about designing good schemas for relational databases. 0:06 So let's suppose we're building a 0:07 database for an application or 0:08 set of applications and we 0:09 have to figure out what schema we want to store our data. 0:13 Usually there are many different 0:14 possible schema designs for a 0:16 database, and databases do tend to get quite complicated. 0:20 And some designs are much better than others. 0:22 So how do we choose what design to use? 0:25 Now the reality is that people 0:26 often use higher level tools 0:28 to design relational databases and 0:30 don't design the schemas directly themselves. 0:32 But some designers do go 0:34 straight to relations, and furthermore, 0:36 it's useful to understand why 0:38 the relations that are produced 0:39 by design tools are what 0:41 they are. 0:43 Furthermore, from an academic point of view, 0:44 it turns out there's a very 0:45 nice theory for relational data base design. 0:48 So let's consider the process of 0:49 designing the schema for our 0:51 database about students applying to colleges. 0:54 Specifically, for a given 0:55 student, let's suppose we have 0:56 their social security number and 0:58 their name, the colleges that 1:00 student is applying to, the 1:01 high schools they attended and 1:02 what city those high schools were in, and the student's hobbies. 1:06 So if that's what we want we 1:07 can create a single relation 1:08 called apply, that has one 1:10 attribute for each of those pieces of information. 1:13 Now let's take a look at how that database would be populated. 1:16 Let's suppose that we have 1:17 a student, Anne, with Social Security 1:18 number 123, she went to 2 1:20 different high schools in Palo 1:22 Alto, she plays tennis and 1:23 the trumpet, and she's applying 1:25 to Stanford, Berkeley, and MIT. 1:27 So let's look at some of 1:28 the tuples that we would be 1:30 having in the apply relation 1:31 to represent this information about Anne. 1:33 So we'll have 1,2,3, Anne, 1:36 her name, she's applying to 1:37 Stanford, she went to 1:39 Palo Alto High School, and 1:42 that's in Palo Alto, and one of her hobbies is tennis. 1:47 And then we also have 1 1:48 2 3 and she applied to 1:51 Berkeley and went to 1:53 Palo Alto High School in 1:54 Palo Alto and tennis there as well. 1:58 Of course she also has 2:00 a tuple representing the fact 2:02 that she's applying to Berkeley and 2:04 and we'll stick with Palo 2:05 Alto High School, and she played the trumpet. 2:08 And as you can see we'll 2:09 have more tuples, we'll have 2:11 various Stanford and Berkeleys, we'll 2:12 have some for her other high 2:14 schools called Gunn High School 2:15 also in Palo Alto, and so on. 2:18 So if we think about 2:19 it we will need a total 2:21 of 12 tuples to represent 2:23 this information about Ann. 2:26 Now do we think that's a good design? 2:29 I'm going to argue no, it's not a good design. 2:31 There are several types of anomalies in that design. 2:35 First of all, we capture information 2:37 multiple times in that 2:40 design, and I'll give some examples of that. 2:42 For example how many times 2:43 do we capture the fact that 2:45 1 2 3 the Social Security 2:47 number is associated with a student named Ann? 2:49 We capture that twelve times in our twelve tuples. 2:53 How many times do we 2:54 capture that Anne went to Palo Alto High School? 2:56 We're going to capture that six times. 2:58 And we're going to capture 2:59 the fact that she plays tennis six times. 3:02 And we're going to capture the fact 3:03 that she went to apply to 3:04 MIT four times, so for 3:07 each piece of information, in fact, 3:08 we're capturing it many, many times. 3:10 So that doesn't seem like a good feature of the design. 3:13 The second type is an 3:14 update anomaly, and that's really a direct effect of redundancy. 3:18 What update anomalies say 3:20 is that you can update facts 3:21 in some places but not 3:23 all all or differently in different places. 3:26 So let's take the fact for 3:27 example that Ann plays the trumpet. 3:30 I might decide to call 3:31 that the coronet instead but I 3:33 can go ahead and I 3:35 can modify, say, three of 3:37 the incidences where we captured 3:38 the fact about her playing the 3:39 trumpet and not the fourth 3:41 one and then we end up 3:42 with what's effectively an inconsistent database. 3:45 And the third type of 3:47 anomaly is called a deletion anomaly, and 3:49 there's a case where we could inadvertently 3:51 completely do a complete 3:52 deletion of somebody in the database. 3:54 Let's say for example that we 3:56 decide that surfing is an 3:58 unacceptable hobby for our 4:00 college applicants, and we go 4:01 ahead and we delete the tuples about surfing. 4:04 If we have students who have 4:06 surfing as their only hobby, then those students will be deleted completely. 4:10 Now you may argue that's the right 4:11 thing to do, but probably that isn't what was intended. 4:15 So now let's take a look at a very different design for the same data. 4:18 Here we have five different 4:20 relations, one with the 4:21 information about students and their 4:22 names, one where they've applied 4:24 to colleges, one where they 4:25 went to high school, where their 4:27 high schools are located and what hobbies the students has. 4:30 In this case we have no anomalies. 4:32 If we go back and look at 4:33 the three different types, they don't occur in this design. 4:36 We don't have redundant information, we 4:39 don't have the update anomaly or the deletion anomaly. 4:42 Furthermore, we can reconstruct all 4:45 of the original data from our 4:47 first design, so we haven't 4:49 lost any information by breaking it up this way. 4:52 So in fact this looks like a much better design. 4:54 Now let me mention a couple 4:56 of modifications to this design that might occur. 4:58 Let's suppose, for example, that 5:00 the high school name alone is not a key. 5:03 So when we break up the 5:04 high school name and high school 5:05 city, we no longer can identify the high school. 5:08 In that case, the preferred 5:09 design would be to move 5:11 the high school up here so 5:13 we'll have that together with 5:15 the high school name and then we don't need this relation here. 5:17 And actually that's a fine design. 5:19 It does not introduce any anomalies, 5:21 that's just based on the 5:22 fact that we need the name 5:23 of the high school together with the city to identify it. 5:26 As another example, suppose a 5:28 student doesn't want all of 5:30 their hobbies revealed to all 5:31 of the colleges that they are applying to. 5:33 For example, maybe they don't want Stanford to know about their surfing. 5:37 If that's the case then we 5:38 can modify the design again, 5:40 and in that case we would 5:41 put the hobby up here with where they're applying to college. 5:44 And so that would include the hobbies 5:46 that they want to reveal 5:47 to those particular colleges, and we'll take away this one. 5:50 So it looked like we were 5:51 taking our nice, small relations, 5:53 and moving back to a design that had bigger relations. 5:56 But in this case it was very well motivated. 5:58 We needed these attributes together to 6:00 identify the high school and 6:01 we want it to have our hobbies specific to the colleges. 6:04 So what that shows is 6:06 that the best design, for an 6:08 application for relational databases 6:10 depend not only on constructing 6:12 the relations well, but also 6:14 in what the data is representing in the real world.