0:00 Now we'll learn about the JOIN family of operators in SQL. 0:04 Going to our select from where statement, in the from clause we list tables, 0:10 separated by commas, and that's implicitly a cross product of those tables. 0:15 But it's also possible to have explicit join of 0:19 tables, and this follows the relational algebra style of join. 0:23 There is a few different types. 0:25 One of them is what's called the inner join on a condition. 0:29 And we'll see that that's equivalent to what in relational algebra we 0:32 were calling the theta join where the theta here is a condition. 0:36 So it's effectively taking the cross product but then applying a condition and 0:39 only keeping the tuples in the cross pro, product that satisfy the condition. 0:44 The separate type of join we can use in SQL is the natural join. 0:48 And that is in fact exactly the natural join in relational 0:52 algebra where it equates columns across tables of the same name. 0:57 So it requires the values in those columns to be 0:59 the same to keep the tuple in the cross product. 1:02 And then it also eliminates the duplicate columns that are created. 1:05 And we'll see this very clearly when we see the demo. 1:08 The third type of join operator in SQL is again inner join. 1:12 But with a special clause called using and listing attributes. 1:16 And that's kind of again like the natural join, except 1:19 you explicitly list the attributes that you want to be equated. 1:23 And finally, the fourth type, and actually 1:26 the most interesting type is the outer join. 1:28 And there is a left outer join, right outer join and full outer join. 1:32 And this is again combining tuples similar to the theta join except when tuples don't 1:39 match the theta condition, they're still added 1:42 to the result and padded with no values. 1:45 Now I will say right off that none of 1:47 these operators are actually adding expressive power to SQL. 1:51 All of them can be expressed using other constructs, 1:54 but they can be quite useful in formulating queries, 1:57 and especially the outer join, is, is a fairly 1:59 complicated to express without the o, outer join operator itself. 2:04 So as usual we'll be doing our demo with our simple 2:07 college admissions database, with college table, student table, and apply table. 2:11 So let's move ahead to the demo. 2:14 As usual, we'll have four colleges, a 2:17 bunch of students and students applying to colleges. 2:21 Let's start with a simple query that we've seen before, 2:23 which matches student names with majors to which they've applied. 2:28 So that combines the student and apply relation making sure 2:31 the student ID is the same across the two relations. 2:34 And gives us the name in major back. 2:36 Now if you remember your relational algebra, you can see clearly 2:40 that this is a join of the Student and Apply relation. 2:43 Actually a natural join we will come to that in a moment. 2:46 We are going to first rewrite it using the equivalent of 2:48 a theta join operator which is called inner join and SQL. 2:52 And so this does the theta join or the 2:54 combination of student and apply on a specific condition. 2:58 So we'll change the where to on and it 3:01 is effectively the cross product of the two tables but 3:05 then when it does the cross product, it checks this 3:07 condition and only keeps the tuples that satisfy the condition. 3:11 So let's run that query. 3:13 And of course, we get the same result. 3:14 The two queries we saw are exactly equivalent, 3:17 we're just expressing them a little bit differently. 3:19 Now the inner join is the default join operator 3:23 in SQL, so we can actually take away the word 3:26 inner, and when we run that, we again get the 3:28 same result because join is an abbreviation for inner join. 3:32 Now let's see what happens when we have a join 3:34 operator with additional conditions besides the one on the two tables. 3:40 So, this is also a query that we've seen before. 3:42 This finds the name and GPA of students who 3:45 came from a high school with less than 1,000 students. 3:48 They've applied to major in computer science at Stanford. 3:52 So we run the query and we find just two students in our result. 3:55 So now let's rewrite that using the join operator. 3:59 So we type join instead of comma, the comma being the cross product, and the 4:04 join condition is again combining the student 4:07 and apply records where the student ID matches. 4:11 And the rest of this becomes our where condition. 4:13 Go ahead and run the query. 4:15 And we get the same result. 4:17 Now, it turns out that we can actually 4:19 put all of these conditions into our on clause. 4:22 So we can make this where back into an and. 4:26 And our on clause now is the and of all three conditions. 4:30 We run the query and we get the same result. 4:32 Now you're probably thinking how do I know what to put 4:35 in the on clause and what to put in the where clause. 4:39 Because these queries are obviously equivalent. 4:41 Well first of all there are many equivalent queries 4:43 in SQL, we can right things in different ways. 4:45 In theory, a SQL query processor should execute them 4:49 all in the most efficient possible way, but the join 4:52 clause, in particular, is often used as a hint 4:55 to the query processor on how to execute the query. 4:58 So if we put all of these in the on condition, we're sort of 5:01 saying, as the query processor does the 5:04 join, it should be applying all the conditions. 5:07 When we make this a where, it's sort 5:09 of a hint saying, here's the condition that really 5:11 applies to the combination of the tuples, and 5:13 the rest of the conditions apply to separate attributes. 5:17 Now let's take a look at what happens when we have 3 relations that we want to join. 5:21 So this query is just finding a bunch of information about student applications. 5:26 So it's combing all three of our tables and right 5:30 here, we're doing it with a comma, which is a cross-product. 5:33 It's combining tuples only when the student sID matches the 5:37 Apply sID, and the Apply cName matches the column College cName. 5:41 In other words it's the same student who's, that we're talking about 5:44 in the apply relation and we're capturing the same college as well. 5:48 So this is a typical query. 5:49 From that we're going to take the student ID, 5:51 student name, GPA, the college name, and the enrollment. 5:54 So, common form of query and here's the result. 5:57 Now let's rewrite that query using the join operator. 6:01 So we'll change our commas to joins. 6:04 And we'll change our join condition, our, our where condition to on, since these are 6:11 the two conditions that are actually joining 6:13 apply and student and joining apply with college. 6:16 And we run the query, and as a matter of fact we get an error. 6:19 Now this query actually would run in a 6:22 couple of instances we've been using in this course. 6:24 Right now, I'm running the PostRisk system, because it 6:27 actually supports the most of what I want to show today. 6:31 But it does not support this particular three way 6:34 join, while MySQL and SQLite actually happen to do that. 6:37 PostRisk requires all join operators to 6:39 be binary, explicitly just joining two relations. 6:43 So we can change this, query right here, to, to be binary. 6:47 By joining apply and student first, and then joining college. 6:52 But we do need to be careful with our conditions, so we need 6:55 to move the, a join condition for applying students into this join here. 7:01 So we've moved it in there, we take it away here, and 7:04 here's the join condition for joining the result of this join with college. 7:09 And we go ahead and run that, and now we get our result. 7:13 Now I wanted to remark again briefly about the interaction between 7:16 the expression of a query in SQL and the query processor. 7:19 Query, SQL systems do tend to actually follow the structure that 7:24 is provided when we have join operators and we parenthesize them. 7:28 So if we say first join apply and student and then join the result, with college. 7:33 Typically that's exactly the order in which the query processor will do things. 7:37 It doesn't have to do them in that order and it 7:38 can get the same result by doing them in a different order. 7:42 But, you might play around and you may 7:43 discover that by changing the order that things are 7:46 done in or even just by moving the parentheses 7:49 here, you'll get different performance out of this system. 7:51 And again, people do use SQL to tune performance by changing the order around. 7:57 Theoretically, that shouldn't happen, because the system is supposed to find 8:01 the best execution plan, but in practice that does actually happen. 8:05 Now let's move on to different types of joins. 8:07 So let's go back to our first query where we used the inner join 8:10 to combine the student and apply to find the majors that students applied to. 8:15 So here's that query result. 8:18 Now, obviously this is in fact a natural join. 8:21 So as a reminder of relational algebra, the natural join 8:25 takes two relations that have column names in co, in common. 8:29 And then it performs a cross product that only keeps the tuples 8:33 where the, the tuples have the same value in those common attribute names. 8:37 So student and apply have the sID name in column. 8:41 In common. 8:41 And so if I change this to the natural join, the system will automatically apply 8:48 this equality between the student ID, in 8:51 the student relation and in the apply relations. 8:53 So I can just simply take that condition away, 8:55 and run the query, and I get the same result. 8:59 Now the other thing that the natural 9:00 join does automatically, aside from equating values 9:03 and common attribute names, is it eliminates 9:06 those duplicate attribute names in it's result. 9:08 So if I do a select star here from student natural join apply, 9:13 I will see that there's only one column named sID in the result. 9:17 Rather than the column sID from the student and from the apply relation. 9:22 Let's be even more explicit about that, let's just take the sID column. 9:27 And here we go, and there's just one sID column in the result. 9:30 And if we tried to do the same thing without the 9:32 natural join, if we were doing, for example, the cross product. 9:36 Then we would actually get an error, because the column 9:38 sID in the cross product of student apply is ambiguous. 9:41 There's one from student and one from apply. 9:43 But when we do the natural join, 9:45 it's automatically equating those columns, and then 9:47 we don't have to worry about ambiguity 9:49 because the duplicate columns have been eliminated. 9:52 Now let's take a look at the natural join with additional conditions. 9:56 Let's go aback to our query using the regular inner 9:58 join, that finds the name and GPA of students from 10:02 a high school less than 1000 who have applied to 10:04 stanford to major in CS, and here is those students. 10:07 And we again have a join condition on attributes that have the same name, 10:12 so we can just change this to the natural join of student and apply. 10:16 And that allows us to eliminate the on condition. 10:19 We run the query and we get the same result. 10:23 Now there's a feature that goes along with the join operator in SQL that's 10:26 considered actually better practice than using the 10:29 natural join and it's called the using clause. 10:32 And the using clause explicitly lists the attributes 10:35 that should be equated across the two relations. 10:38 So, we're going to take away the natural here, we're 10:41 just going to use the regular inner join, but then 10:44 we're going to specify that the student ID is the 10:46 attribute that should be equated across student and apply. 10:49 We can only put in the using clause any attributes that appear in both relations. 10:55 It would be an error if we tried to use 10:56 one that was in the only one relation or the other. 10:59 We'll run the query, and we'll get the same result. 11:01 The reason this is considered better practice, to make this explicit, is 11:05 that because the natural join implicitly combines columns that have the same name. 11:10 It's possible, say, to add a column to a relation that has the same name 11:14 as the other relation, or not realize that two relations have the same column name. 11:19 And the system will, will sort 11:20 of underneath the covers, equate those values. 11:24 Where as, when we put the attribute name in the query we're saying 11:27 explicitly that this attribute does mean the same thing across the two relations. 11:32 And one thing I might add by the way, in realist applications there 11:35 can also be often 40, 50, even a hundred attributes in a relationship. 11:39 So, it's, it's, there's a higher chance that you really could have 11:42 attributes that have the same name but aren't meant to be equated. 11:45 And again, by putting e, explicitly the equated attributes in the column, 11:48 it's just sort of, in the query, it's just better software engineering practice. 11:53 So most of my future examples are actually 11:54 going to have a using to express that natural join. 11:59 Now let's take a look at using the join operator 12:01 when we have more than one instance of the same relation. 12:04 So this is actually a query we've see in previous videos. 12:06 It's combining two instances of the student relation in order 12:10 to find pairs of students that have the same GPA. 12:14 And, in order to get each pair of students only once. 12:17 We add the condition that the student ID on the 12:19 left side is less than the one on the right side. 12:21 Again that's just to, to not have duplicates. 12:24 So let's go ahead and run this query. 12:26 And now let's try to convert this to using the join, using the join operator. 12:31 So first of all, we see that we want to 12:33 equate the GPAs, so we're going to make this a join. 12:39 Where we use a using clause, that tells us that we want the GPA 12:42 on the, the wa, left side relation to equal the GPA on right side relation. 12:47 And then we can take that condition out of our where clause. 12:50 And then we can make the other part the on condition of our join. 12:55 Let's go ahead and run that query and we get an error. 12:58 It turns out that most SQL systems, every one I've tried, doesn't allow you 13:02 to have both a using clause and an on clause in combination with the join. 13:06 You can have one or the other but not both. 13:09 So, then we can just go ahead and take away the on, convert 13:12 it to a where and run the query and we get the right answer. 13:16 Let's do one more query that involves the self join of Student with itself. 13:21 And this is just the natural join of two instances of Student. 13:25 So if you remember your relational algebra, it is a property 13:29 that any relation joined with itself gives you back the original relation. 13:34 So we'll run this query, and you can verify that 13:37 this is in fact exactly equivalent to the student relation. 13:40 In fact, why don't we just verify that ourselves. 13:43 Let's just run select star from student. 13:46 It actually changed the order of the result, 13:47 but the result tuples are exactly the same. 13:51 So we've seen the inner join, we've seen 13:53 the natural join, we've seen the using clause. 13:56 Now let's move to what I consider the most interesting part 13:59 of the join family of operators, which is the outer join. 14:02 We'll start with a query that's joining with the inner join, student 14:06 and apply on the matching sID and it's returning a few attributes. 14:09 The students name, ID and the college they're applying to, and the major. 14:13 And here's the query. 14:15 Now suppose that we also wanted in our result, 14:18 the information about students who haven't yet applied anywhere. 14:21 And in our database, we actually have a couple 14:23 of students, who do not appear in the apply relation. 14:26 If we're interested in adding those to our result. 14:29 We do what's called, the left outer join. 14:33 Let's just go ahead and run that and see what we get. 14:36 We get our same tuples as we had with the inner join. 14:39 But we get in addition, these tuples here that have blanks. 14:43 Let me remind you in this interface blank is actually a null value. 14:47 So this tuple here is Dorris 456 null, null. 14:50 What the left outer join does, is it takes any tuples on the left side of the join. 14:55 So any tuples from the relation that's on the left. 14:57 And if they don't have a matching tuple from the right. 15:00 And that's called a dangling tuple, by the way. 15:02 If there's a dangling tuple with no right matching tuple, it's 15:05 still added to the result and it's padded with null values. 15:09 As with the inner join, we can actually abbreviate this. 15:12 So when we have left or, as we'll see, right, 15:15 or full, we can take a word, away the word 15:18 outer and that will implicitly do the left outer join 15:20 so we run this and we get, again, the same result. 15:23 Just an abbreviation. 15:25 Now we can also do a natural outer join. 15:28 So we can do the natural left outer join. 15:32 We could leave the word outer out. 15:34 And that's the implicit version where we match the student IDs. 15:38 Again not that recommended from a 15:39 software engineering standpoint, but it's worth seeing. 15:42 And we will again get exactly the same result. 15:46 Now I should mention that everything we've seen so far with 15:49 the join operators can be rewritten using regular SQL without join operators. 15:54 And up until the outer join that was probably very obvious, what we were doing 15:57 was very similar to just doing your basic 16:00 select from where with cross products and conditions. 16:03 The outer join is slightly less obvious that you can do it in SQL, but you can. 16:07 In fact, that's one of my, favorite little SQL exercises, 16:10 is to rewrite the outer join without using the outer join. 16:14 And let's just go straight to the 16:15 solution to that exercise for this particular query. 16:18 And here it is. 16:20 So what we do is we take the inner join, or in 16:23 this case I've just I've just done it in the cross product form. 16:25 We take the inner join of student and 16:27 apply, so all the tuples that actually match. 16:29 And then we add to that result student 16:32 tuples that don't have a matching tuple in apply. 16:35 So where their student ID does not appear in the apply table. 16:39 Okay, I hope you remember your subqueries in the where clause. 16:41 And for those we return the student name, student ID null and null. 16:46 And we do the union of those two results. 16:51 And that will give us exactly the same answer, and here it is. 16:56 Now let's go back to our original query. 16:58 In our query, we retained all student tuples 17:01 whether or not they matched an apply tuple. 17:03 What if we wanted to do things the other way around? 17:05 What if we wanted to retain apply tuples 17:08 whether or not they matched a student tuple? 17:10 Well, you can probably guess that all we need to do is swap 17:13 the order of apply and student and put apply on the left instead. 17:16 It so happens that our database actually doesn't have any apply 17:19 tuples that don't match, so we'll add those tuples right now. 17:23 We've added a student 321 applying to MIT for a couple of 17:27 majors and we don't have a matching 321 in our student table. 17:31 So going back to the original query and swapping the order so 17:36 that we keep all tuples that apply whether or not they match. 17:40 Let's go ahead and run the query. 17:42 And we take a look down here and we see that we 17:44 have those two applications of 321 even though there's no matching student name. 17:49 Now not surprisingly SQL actually has a companion to 17:52 the left outer join called the right outer join. 17:55 So I don't actually have to swap my relations, I can 17:59 actually use the right outer join with apply on the right-hand side. 18:03 And what this says is we're going to find 18:06 the student tuples that match the apply tuples, and 18:08 then if there are tuples in apply that don't 18:11 match the students, we'll add those padded with nulls. 18:14 So here goes. 18:15 And that's again the same result that we saw before with 18:18 the two non-matching apply tuples and null values for the student name. 18:22 Now I"m sure what you're thinking is what if we want to have 18:26 unmatched tuples from both the left and the right appear in our result. 18:30 And that's what's called the full outer join. 18:33 So let's just change this right now to the full outer join. 18:38 And let's move back to using the using clause which I like better, like that. 18:45 And let's go ahead and run the query and now we see that we have the 18:49 unmatched apply tuples with null values for the 18:51 name and we also have the unmatched student tuples. 18:55 With null values for the college name and major. 18:58 Not surprisingly the full outer join can be expressed in other ways. 19:02 We're going to start by expressing the full outer join 19:04 using a combination of the left and right outer join. 19:08 So, here we go. 19:09 Here we've got the left outer join where 19:12 we're going to retain student tuples that don't match apply. 19:15 We're going to do the right outer join, where we retain apply tuples 19:18 that don't match student, and then we're going to do the union of those. 19:22 Now you might be thinking we're going to get a lot of 19:24 duplicates because all the properly matching tuples are going to appear twice. 19:28 Once in the first one, and once in the second. 19:30 That is true, but remember, the 19:32 union operator automatically eliminates any duplicates. 19:35 So we'll go ahead and run this query. 19:37 And when we look at the result, we see that we get exactly the same 19:40 result we got when we ran, did the full outer join of the two relations. 19:45 Now we can also simulate, without using 19:48 any join operators at all, which you probably 19:50 figured out, because we're able to simulate 19:52 the left outer join, we already saw that. 19:55 Here we go. 19:55 This query has three parts. 19:57 The first part finds all of the matching student and apply records. 20:01 The second part adds to that those students who aren't present in apply. 20:05 So it gives us their name and ID and the 20:07 two null values for the attributes from the apply relation. 20:10 And then the third part finds those apply 20:13 records where there's not a matching student, and so 20:15 it gives us the three attributes from apply, 20:18 but then gives us a null for student name. 20:20 So we go ahead and run this one, and we 20:23 again get that you can verify, we get the same result. 20:28 The grand finale of this demonstration might seem a little esoteric, 20:31 but it does demonstrate a interesting pitfall of the outer join. 20:35 Let me start by re, reviewing, for many 20:37 of you, the concepts of commutativity and associativity. 20:40 Commutativity says when you do some operation, a 20:43 op b, that's always equivalent to b op a. 20:47 And most operations, most of the join and the cross 20:50 product, and so on, in SQL or relational algebra are commutative. 20:55 One exception, of course being the left outer join or the right outer join. 20:58 Associativity says when you have these binary operations. 21:02 You can do them in either order first. 21:05 In other words, you could first, do the operation on 21:09 a and b, and then combine the result with c. 21:11 Or you could first do, the operation on b and c and then combine the result with a. 21:17 Even if you're doing the same left to right order. 21:20 Associativity is also very common to cross product 21:23 and the natural join and so on are associative. 21:26 What's interesting is the outer join is not associative. 21:29 And I'm going to demonstrate that, not with the 21:31 college database but with a very simple set 21:33 of tables, each with one tuple, that I'm 21:35 going to set up specifically to make my point. 21:38 So I'm going to create three tables, each with two 21:41 attributes, and put just one tuple in each one. 21:44 So we've got, table T1 with attributes A and B and tuple 1, 2. 21:48 We've got T2 with attributes B and C and a tuple 2, 3 that matches the T1 tuple. 21:53 And then table T3 has attributes A and C. 21:57 So attribute A from table T1 are matching T1. 22:00 And C from T2, but the values don't match at all. 22:04 I would highly recommend you pull out a pencil and paper and 22:06 actually write these tables and tuples down so you can see what's happening. 22:11 Now let's look at this question of associativity. 22:13 So we're going to do the natural full outer join of the three 22:16 tables and we're going to start by putting the parenthesis on the left. 22:19 So we're first going to do the outer join of T1 and T2. 22:22 And then do the outer join of that result with T3. 22:25 And let's run the re, query, and we see that 22:28 we got two tuples 1, 2, 3, and 4 null 5. 22:32 We got 1, 2, 3, because T1 and T2 had 22:35 that nice pair of tuples that combine with each other. 22:38 But then when we took that result and did the outer 22:40 join with T3, that gave us an unmatching tuple in T3. 22:44 So we got four and five with a null value in B. 22:48 Now let's try the other associativity. 22:50 Let's just change our parentheses so that T2 is outer joined with 22:56 T3 first and then the result of that is outer joined with T1. 23:00 We run the query, and we get a different result. 23:03 And that's because we started with our unmatching 23:05 tuples which then didn't match correctly with T1. 23:09 Again, the best way to do this is to take your pencil 23:12 and paper and look at what's happening or try running the queries yourself. 23:15 To see that the outer jo, join is not associative. 23:18 And by the way the left outer join 23:20 and the right outer join are not associative either. 23:23 So none of the outer joins are associative. 23:24 The full outer join, however, is communicative while the others aren't. 23:28 Ok, so why is this important? 23:29 It's important because if you do queries that do have outer joins of 23:33 more than one relation, then it is important to think about the order. 23:37 It's not a performance thing, it's actually what result you get may 23:40 be different with different orders and 23:42 parentheses, parenthesizations of your outer joins. 23:47 So, in conclusion the family of join operators is used in SQL 23:51 to combine relations where we're explicitly 23:53 doing operations like relation, relational algebra. 23:57 The inner join is the relational algebra theta join, 23:59 where we explicitly combine relations on a specific condition. 24:03 We have a natural join operation which implicitly combine, equates column, the 24:08 values in columns that have the same name and eliminates those duplicate columns. 24:12 Both of those, can be expressed in fairly similar ways without the join operator. 24:17 The outer join, however, is sort of significantly different because 24:21 it does add the tuples that don't match the join condition. 24:25 And pads them with null values and it, it's quite useful. 24:28 People use join operators quite a lot. 24:31 The one thing I mentioned is that join operators 24:34 can be used to kind of tell the query processor how 24:37 it should execute the query and that can sometimes be 24:40 used to improve the performance of queries, of equivalent queries. 24:44 Though in theory, again, database systems should always find the best 24:48 way to execute their queries, but that's a tough thing to do. 24:51 And so people do sometimes give hints in the way they formulate 24:55 their queries sometimes using the join family of operators to do that.