0:00 This is the first of two 0:01 videos where we learn about relational algebra. 0:04 Relational Algebra is a formal language. 0:07 It's an algebra that forms 0:09 the underpinnings of implemented languages like SQL. 0:12 In this video we're going to 0:14 learn the basics of the 0:15 Relational Algebra Query Language and a few of the most popular operators. 0:19 In the second video we'll learn 0:20 some additional operators and some 0:22 alternate notations for relational algebra. 0:26 Now, let's just review first from 0:28 our previous video on relational 0:29 querying that queries over 0:31 relational databases operate on 0:33 relations and they also produce relations as a result. 0:36 So if we write a query 0:38 that operates say on the three 0:39 relations depicted here, the 0:41 result of that query is going to be a new relation. 0:44 And, in fact, we can 0:46 post queries on that 0:47 new relation or combine that 0:49 new relation with our previous relations. 0:52 So let's start out with Relational Algebra. 0:54 For the examples in 0:56 this video we're going to be 0:57 using a simple college admission relations database with three relations. 1:00 The first relation, the college 1:02 relation, contains information about the 1:03 college name, state, and enrollment of the college. 1:06 The second relation, the student 1:08 relation, contains an ID 1:10 for each student, the student's name, 1:11 GPA and the size of the high school they attended. 1:14 And, finally, the third relation contains 1:16 information about students applying to colleges. 1:18 Specifically, the student's ID, the 1:20 college name where they're 1:21 applying, the major they're 1:22 applying for and the decision of that application. 1:25 I've underlined the keys for these three relations. 1:28 As a reminder, a key is 1:30 an attribute or a set of 1:31 attributes whose value is 1:33 guaranteed to be unique. 1:35 So, for example, we're going 1:36 to assume the college names are unique, 1:38 student IDs are unique and that 1:40 students will only apply to 1:41 each college for a particular major one time. 1:46 So, we're going to have a 1:47 picture of these three relations at 1:48 the bottom of the slides throughout the video. 1:52 The simplest query in relational 1:54 algebra is a query 1:55 that is simply the name of a relation. 1:57 So, for example, we can 1:58 write a query, "student" and that's 2:01 a valid expression in relational algebra. 2:04 If we run that query on 2:05 our database we'll get as 2:07 a result a copy of the student relation. 2:09 Pretty straightforward . 2:11 Now what happens next 2:12 is that we're going to use 2:13 operators of the relational algebra 2:15 to filter relations, slice relations, and combine relations. 2:19 So, let's through those operators. 2:23 The first operator is the select operator. 2:26 So, the select operator is used 2:27 to pick certain rows out of a relation. 2:30 The select operator is denoted by 2:31 a Sigma with a subscript--that's 2:33 the condition that's used to 2:35 filter the rows that we extract from the relations. 2:37 So, we're just going through three examples here. 2:40 The first example says that we 2:41 want to find the students 2:42 whose GPA is greater than 3.7. 2:44 So to write that 2:45 expression in relational algebra, we write 2:47 the sigma which is the 2:48 selection operator as a 2:50 subscript the condition that we're 2:51 filtering for--GPA greater than 2:54 3.7--and the relation over which we're finding that selection predicate. 2:58 So, this expression will return 3:00 a subset of the student 3:02 table containing those rows 3:04 where the GPA is greater 3.7. 3:07 If we want to filter for two 3:08 conditions, we just do 3:09 an "and" of the conditions in the subscript of the sigma. 3:12 So if we want, say, students 3:13 whose GPA is greater than 3.7 3:15 and whose high school size 3:16 is less than a thousand, we'll 3:18 write select GPA greater than 3.7. 3:23 We used a logical 3:24 and operator--a caret, high school 3:27 size is less than a 3:28 thousand, and again we'll apply that to the student relation. 3:32 And once again, the result of 3:33 that will be a subset of 3:34 the student relation containing the rows that satisfy the condition. 3:37 If we want to find 3:39 the applications to Stanford for 3:41 a CS major, then we'll be 3:42 applying a selection condition to the apply relation. 3:45 Again, we write the sigma 3:46 and now the subscript is 3:48 going to say that the college 3:49 name is Stanford and the major is CS. 3:54 Again, the and operator, and 3:57 that will be applied 3:59 to the apply relation and 4:01 it will return as a 4:02 result, a subset of the apply relation. 4:06 So the general case of the 4:07 select operator is that we have the sigma. 4:10 We have a condition as a 4:12 subscript and then we have a relation name. 4:14 And we return as a result the subset of the relation. 4:17 Our next operator is the Project Operator. 4:21 So the select operator picks certain 4:23 rows, and the project operator picks certain columns. 4:26 So let's say we're 4:28 interested in the applications, but all 4:29 we wanted to know was the 4:30 list of ID's and the decisions for those applications. 4:34 The project operator is written 4:35 using the Greek pi symbol, 4:37 and now the subscript is 4:39 a list of the column names that we would like to extract. 4:42 So we write ID, sorry, 4:43 student ID and decision, and 4:46 we apply that to the apply relation again. 4:49 And now what we'll get 4:51 back is a relation that has just two rows. 4:54 It's going to have all the 4:55 tuples of apply, but it's 4:56 only going to have the 4:57 student ID and the decision columns. 5:00 So the general case of 5:02 a project operator is the 5:04 projection, and then a 5:05 list of attributes, can be 5:07 any number, and then a relation name. 5:13 Now, what if we're interested in 5:14 picking both rows and columns at the same time. 5:16 So we want only some of 5:18 the rows, and we want only some of the columns. 5:20 Now we're going to compose operators. 5:23 Remember that relational queries produce relations . 5:26 So we can write a 5:27 query, say, with the 5:29 select operator of the 5:31 students whose GPA is greater than 3.7. 5:33 And this is how we do that. 5:35 And now, we can take 5:37 that whole expression which produces a 5:39 relation, and we can 5:40 apply the project operator to that, and 5:42 we can get out the student 5:43 ID and the student name. 5:49 Okay. 5:50 So, what we actually see 5:51 now is that the general 5:54 case of the selection and 5:55 projection operators weren't quite what I told you at first. 5:58 I was deceiving you slightly. 5:59 When we write the select 6:01 operator, it's a select 6:03 with the condition on any 6:06 expression of the relational 6:07 algebra, and if it's 6:08 a big one we might want 6:09 to put parenthesis on it, 6:11 and similarly the project operator is 6:13 a list of attributes from 6:16 any expression of the relational algebra. 6:18 And we can compose these as much as we want. 6:20 We can have select over project, over 6:22 select, select, project, and so on. 6:24 Now let's talk about 6:27 duplicate values in the results of relational algebra queries. 6:31 Let's suppose we ask 6:33 for a list of the majors that 6:34 people have applied for and the decision for those majors. 6:37 So we write that as the 6:37 project of the major 6:40 and the decision on the applied relation. 6:45 You might think that when we get 6:46 the results of this query, we're going to have a lot of duplicate values. 6:49 So we'll have CS yes, CS 6:51 yes, CS no, EE yes, EE no, and so on. 6:54 You can imagine in a 6:55 large realistic database of applications, 6:57 there's going to be hundreds of 6:59 people applying for majors and having a yes or a no decision. 7:02 The semantics of relational 7:04 algebra says that duplicates are always eliminated. 7:08 So if you run a query 7:09 that would logically have a 7:10 lot of duplicate values, you just 7:11 get one value for each result. 7:14 That's actually a bit of a difference with the SQL language. 7:16 So, SQL is based on 7:18 what's known as multi-sets or 7:20 bags and that means 7:21 that we don't eliminate duplicates, whereas 7:24 relational algebra is based 7:26 on sets themselves and duplicates are eliminated. 7:29 There is a multi-set or 7:30 bad relational algebra defined as 7:32 well but we'll be fine by 7:33 just considering the set relational algebra in these videos. 7:38 Our first operator that combines two 7:39 relations is the cross-product operator, 7:41 also known as the Cartesian product. 7:43 What this operator does, is it 7:45 takes two relations and kinda 7:46 glues them together so that 7:48 their schema of the result 7:49 is the union of the 7:50 schemas of the two relations and 7:52 the contents of the result are every 7:54 combination of tuples from those relations. 7:56 This is in fact the normal 7:57 set cross product that you 7:59 might have learned way back in the elementary school. 8:01 So let's talk about, say, 8:03 doing the cross products of students and apply. 8:06 So if we do this 8:08 cross products, just to 8:09 save drawing, I'm just gonna glue 8:11 these two relations together here. 8:14 So if we do the 8:15 cross product we'll get at the 8:16 result a big relation, here, which 8:18 is going to have eight attributes. 8:20 The eight attributes across the student 8:21 and apply now the only 8:24 small little trick is that 8:25 when we glue two relations together 8:27 sometimes they'll have the same 8:28 attribute and we can see we have SID on both sides. 8:31 So just as a notational convention, 8:33 when cross-product is done 8:35 and there's two attributes that are 8:36 named, they're prefaced with the name of the relation they came from. 8:39 So this one would be referred 8:40 to in the cross-product as 8:42 the student dot SID where this 8:44 one over here would be referred 8:45 to as the apply dot SID. 8:48 So, again, we glue together in 8:49 the Cartesian product the two 8:51 relations with four attributes each, 8:53 we get a result with eight attributes. 8:55 Now let's talk about the contents of these. 8:57 So let's suppose that the student 9:00 relation had s-tuples in it 9:02 and that's how many tuples, while 9:04 the apply had 8 tuples in 9:05 it, the result of the 9:07 Cartesian products is gonna 9:08 have S times A tuples, 9:10 is going to have one tuple 9:12 for every combination of tuples 9:14 from the student relation and the apply relation. 9:19 Now, the cross-product seems 9:20 like it might not be that 9:21 helpful, but what is 9:23 interesting is when we use 9:24 the cross-product together with other operators. 9:26 And let's see a big example of that. 9:28 Let's suppose that we want 9:30 to get the names and GPAs of 9:32 students with a high school 9:33 size greater than a thousand who 9:35 applied to CS and were rejected. 9:37 Okay, so let's take a look. 9:40 We're going to have to access 9:41 the students and the apply 9:43 records in order to run this query. 9:45 So what we'll do is we'll 9:46 take student cross apply as our starting point. 9:49 So now we have 9:51 a big relation that contains 9:53 eight attributes and all of those tuples that we described previously. 9:57 But now we're going to 9:58 start making things more interesting, because 10:00 what we're going to do is 10:01 a big selection over this relation. 10:03 And that selection is first 10:04 of all going to make sure 10:06 that it only combines student and 10:08 apply tuples that are referring to the same student. 10:11 So to do that, we write 10:12 student dot SID equals 10:15 apply dot SID. 10:17 So now we've filtered the result 10:18 of that cross-product to only 10:20 include combinations of student 10:22 and apply by couples that make sets. 10:24 Now we have to do a little bit of additional filtering. 10:26 We said that we want the 10:28 high school size to be 10:29 greater than a thousand, so 10:31 we do an "and" operator in the high school. 10:33 We want them to have applied 10:35 to CS so that's and major equals CS. 10:37 We're getting a nice big query here. 10:40 And finally we want them to 10:41 have been rejected, so "and 10:42 decision" equals, we'll just be using R for reject. 10:47 So now, we've got that gigantic query. 10:50 But that gets us exactly what 10:51 we want except for one more thing, 10:53 which is, as I said, all we want is their names and GPAs. 10:56 So finally we take a 10:57 big parentheses around here and 10:59 we apply to that the 11:00 projection operator, getting the 11:02 student name and the GPA. 11:06 And that is the relational 11:07 algebra expression that produces 11:09 the query that we have written in English. 11:12 Now we have seen how the 11:13 cross product allows us to 11:14 combine tuples and then 11:16 apply selection conditions to 11:18 get meaningful combinations of tuples. 11:21 It turns out that relational algebra 11:22 includes an operator called 11:24 the natural join that is 11:25 used pretty much for the exact purpose. 11:27 What the natural join does is 11:29 it performs a cross-product 11:31 but then it enforces equality 11:33 on all of the attributes with the same name. 11:35 So if we set up our 11:36 schema properly, for example, 11:38 we have student ID and student 11:40 ID here, meaning the same 11:42 thing, and when the cross 11:43 product is created, it's only 11:45 going to combine tuples where the student ID is the same. 11:48 And furthermore, if we add 11:49 college in, we can 11:51 see that we have the college name here and the college name here. 11:54 If we combine college and apply 11:56 tuples, we'll only combine tuples that are talking about the same college. 11:59 Now in addition, one more 12:01 thing that it does is it 12:02 gets rid of these pesky attributes that have the same names. 12:06 So since when we combine, 12:07 for example, student and apply 12:09 with the natural join, we're only 12:11 combining tuples where the 12:13 student SID is the same as the apply SID. 12:16 Then we don't need to keep two 12:17 copies of that 12:19 column because the values are always going to be equal. 12:23 So the natural join operator 12:25 is written using a bow 12:28 tie, that's just the convention. 12:30 You will find that in your text editing programs if you look carefully. 12:34 So let's do some examples now. 12:37 Let's go back to our same 12:38 query where we were finding 12:39 the names and GPAs of students 12:42 from large high schools who applied to CS and were rejected. 12:45 So now, instead of using 12:46 the cross-product we're gonna 12:47 use the natural join, which, as I said, was written with a bow tie. 12:51 What that allows us to 12:53 do, once we do that natural 12:54 join, is we don't have 12:56 to write that condition, that enforced 12:57 equality on those two 12:59 attributes, because it's going to do it itself. 13:01 And once we have done that then 13:03 all we need to do is 13:04 apply the rest of our conditions, 13:05 which were that the high school 13:06 is greater than a thousand and the 13:08 major is CS and the decision 13:11 is reject, again we'll 13:14 call that R. And then, 13:16 since we're only getting the names 13:17 and GPAs, we write the 13:19 student name and the GPA. 13:24 Okay. 13:24 And that's the result of the query using a natural join. 13:26 So, as you can see that's a 13:27 little bit simpler than the original 13:29 with the cross-product and by setting 13:30 up schemas correctly, natural join can be very useful. 13:33 Now let's add one more complication to our query. 13:37 Let's suppose that we're only interested 13:38 in applications to colleges where the enrollment is greater than 20,000. 13:41 So, so far in 13:43 our expression we refer to the 13:44 student relation and the apply 13:46 relation, but we haven't used the college relation. 13:48 But if we want to have a 13:50 filter on enrollment, we're going to have 13:51 to bring the college relation into the picture. 13:55 This turns out to perhaps be easier than you think. 13:57 Let's just erase a couple 13:59 of our parentheses here, and what 14:01 we're going to do is we're 14:02 going to join in the 14:04 college relation, with the two relations we have already. 14:07 Now, technically, the natural 14:11 join is the binary operator, people 14:13 often use it without parentheses 14:15 because it's associative, but if 14:16 we get pedantic about it we could add that and then we're in good shape. 14:20 Now we've joined all three relations together. 14:22 And remember, automatically the natural 14:24 join enforces equality on the shared attributes. 14:27 Very specifically, the college 14:29 name here is going to 14:30 be set equal to the apply college name as well. 14:33 Now once we've done that, we've got all the information we need. 14:36 We just need to add one 14:37 more filtering condition, which is 14:39 that the college enrollment is greater than 20,000. 14:42 And with that, we've solved our query. 14:47 So to summarize the 14:48 natural join, the natural join combines relations. 14:52 It automatically sets values equal 14:54 when attribute names are the 14:55 same and then it removes the duplicate columns. 14:58 The natural join actually does not 15:01 add any expressive power to relational algebra. 15:04 We can rewrite the natural join without it using the cross-product. 15:08 So let me just show that rewrite here. 15:10 If we have, and now I'm 15:12 going to use the general case of two expressions. 15:14 One expression, natural join 15:17 with another expression, that is 15:19 actually equivalent to doing 15:21 a projection on the 15:23 schema of the first 15:25 expression - I'll just 15:27 call it E1 now - union 15:29 the schema of the second expression. 15:30 That's a real union, so that 15:32 means if we have two copies we 15:33 just keep one of them. Over 15:36 the selection of. Now we're 15:39 going to set all the shared attributes 15:40 of the first expression to 15:42 be equal to the shared attributes of the second. 15:44 So I'll just write E1, A1 15:45 equals E2, A1 15:48 and E1, A2 equals E2 dot A2. 15:54 Now these are the cases where, 15:57 again, the attributes have the same names, and so on. 16:00 So we're setting all those equal, 16:02 and that is applied over 16:04 expression one cross-product expression two. 16:07 So again, the natural 16:10 join is not giving us 16:12 additional expressive power, but it is very convenient notationally. 16:18 The last operator that I'm 16:20 going to cover in this video is the theta join operator. 16:23 Like natural join, theta join is 16:24 actually an abbreviation that doesn't add expressive power to the language. 16:28 Let me just write it. 16:28 The theta join operator takes 16:30 two expressions and combines them 16:32 with the bow tie looking 16:34 operator, but with a subscript theta. 16:37 That theta is a condition. 16:39 It's a condition in the style 16:40 of the condition in the selection operator. 16:43 And what this actually says 16:45 - it's pretty simple - 16:47 is it's equivalent to applying 16:49 the theta condition to the 16:51 cross-product of the two expressions. 16:53 So you might wonder why 16:55 I even mention the theta 16:56 join operator, and the reason I 16:57 mention it is that most 16:59 database management systems implement the 17:01 theta join as their basic 17:03 operation for combining relations. 17:05 So the basic operation is 17:07 take two relations, combine all tuples, 17:08 but then only keep the combinations 17:10 that pass the theta condition. 17:12 Often when you talk to 17:13 people who build database systems or 17:15 use databases, when they 17:16 use the word join, they really mean the theta join. 17:21 So, in conclusion, relational algebra is a formal language. 17:25 It operates on sets of 17:27 relations and produces relations as a result. 17:29 The simplest query is just 17:30 the name of a relation and 17:32 then operators are used 17:33 to filter relations, slice them, and combine them. 17:36 So far, we've learned the 17:37 select operator for selecting rows; 17:40 the project operator for selecting 17:41 columns; the cross-product 17:43 operator for combining every possible 17:45 pair of tuples from two 17:46 relations; and then two 17:48 abbreviations, the natural join, 17:50 which a very useful way to combine 17:51 relations by enforcing a equality 17:53 on certain columns; and the theta join operator. 17:56 In the next video, we'll learn 17:58 some additional operators of relational 18:00 algebra and also some alternative 18:01 notations for relational algebra expressions.