0:00 This SQL video, which as 0:02 usual will consist mostly of 0:04 running live queries, introduces sub-queries in the where clause. 0:08 As usual, we start with our 0:10 basic select from where expression, 0:13 and we're going to add to it 0:14 now the ability in the 0:15 condition part of the select 0:17 from where expression to include sub-queries. 0:19 Sub-queries are nested, select 0:21 statements within the condition, 0:23 and we'll see they're actually quite powerful. 0:27 As usual, we'll be using a sample demonstration database that 0:29 consists of colleges, students, and students applying to colleges. 0:33 As always, we'll have our 0:34 four colleges, a bunch of 0:36 students, and a bunch of 0:38 records that show students applying to colleges. 0:40 So let's launch right 0:42 into our first query that shows 0:44 a sub-query in the where clause. 0:46 What this query finds is the 0:49 ID's and names of all students 0:51 who have applied to major in CS to some college. 0:54 So here's the sub-query in the where clause. 0:57 This expression right here in 0:58 the where clause finds the 1:00 ID's of all students who 1:01 have applied to major in CS. 1:05 So now we have that set of ID's, 1:06 our outer query says let's 1:07 take from the students those 1:09 students whose ID is in 1:11 that set, and let's select 1:13 their ID and their name. 1:14 So we go ahead and execute 1:16 the query and we find out 1:17 that five students have applied to major in CS. 1:22 Now we actually can do 1:23 this query without a sub-query in the where clause. 1:26 So let's take a look. 1:29 We can do it instead by joining 1:30 the student relation with the apply relation. 1:33 So here we do what we learned in previous videos. 1:36 We take student and apply, we 1:38 write the joined condition to make 1:39 sure we're talking about the same 1:40 student, we make sure 1:42 they're majoring in CS, and 1:43 we get their ID and their name. 1:45 Let's run the query. 1:48 Whoops, an error. 1:49 I knew that was coming. 1:51 This is just to remind you about disambiguating attributes. 1:54 The ID here in the 1:56 select clause could have come 1:57 from student or apply, and 1:59 even though the value is equal, 2:00 we do have to disambiguate by putting one of those. 2:03 So let's put student dot SID 2:05 and let's run the query. 2:07 Okay. 2:07 Now we see we got more 2:09 students back than we 2:10 got back when we ran the query using the sub-query. 2:12 Let's go back and look. 2:13 We got five results here and 2:16 we got several more here, 2:17 but the additional results are actually duplicate values. 2:21 So we have two copies, for example, of 1-2-3 Amy. 2:25 The reason for that is 2:26 that Amy actually applied to 2:28 major in CS at multiple colleges. 2:30 So if we go back and we 2:31 look at the apply data, we'll 2:33 see that Amy who is 2:35 1-2-3 applied to major in CS 2:36 at Stanford as well as Berkeley. 2:38 Let's hope she selects Stanford. 2:40 In any case, that's 2:42 why we got Amy twice in 2:43 the join because she applied twice. 2:46 Back here where we used 2:47 the sub-query, we were just 2:48 looking at students and whether their ID was in the set. 2:52 Okay, so when we do 2:53 the join we get basically the 2:55 same answer, but we have some duplicate values. 2:57 Of course, we can fix that by adding distinct. 3:01 We run the query and now 3:03 we have the same result that we got when we used the sub-query. 3:08 Let's look at some other 3:09 similar queries and let's 3:11 focus again on the duplicates 3:12 issue because it gets a little tricky, as we'll see. 3:16 This query is very similar 3:17 to the previous one, finding students 3:19 who are applying to major in 3:21 CS, but this time 3:22 we're only getting the names of 3:23 the students and not their ID's. 3:25 So we run the query and we find our same five students. 3:29 Just want to mention that these 3:30 two Craigs are two different Craigs. 3:33 If we go back to our 3:34 original result, there's three-four-five Craig 3:36 and five-four-three Craig. 3:38 So coming back here, we 3:40 find the names of the 3:41 students who majored in CS. 3:43 Now similarly to what we 3:44 did previously, let's write this 3:46 query using a join instead 3:48 of using the sub-query in the where clause. 3:51 So here's the join. 3:52 We're joining student and apply 3:54 on the student's ID majoring in CS, as always. 3:56 The only difference is that we're just selecting the name. 3:59 We run the query and we 4:01 get again many more results 4:03 than we got previously because we 4:05 get two copies when a 4:06 student has applied to to 4:07 major in CS at two different places. 4:09 And just as we 4:11 did before, we can add distinct to get rid of the duplicates. 4:18 Well, something different happened this time. 4:20 This time, when we get 4:21 rid of the duplicates, we only have 4:23 four results where previously we had five. 4:26 And the reason is that previously 4:27 we included the student ID 4:29 in the result, and so the 4:31 two instances of Craig were 4:33 two different Craigs and didn't 4:35 cause duplicates to be eliminated. 4:36 We can see that back here: Craig and Craig. 4:39 But in this result, because we 4:41 only kept the names, the two 4:42 copies of Craig turned into one result. 4:45 Now we might wonder why 4:47 do we care about duplicates so much. 4:49 Let's see an example where duplicates really do matter quite a bit. 4:53 We're going to do exactly the same 4:55 query again, finding students who 4:57 have applied to major in CS, 4:59 but now we're not retrieving the 5:00 IDs or names, we're retrieving 5:02 the GPAs of those students. 5:03 Presumably what we're interested in 5:05 doing is some analysis of the 5:06 GPA's of students who choose to apply for CS. 5:10 So let's run the query. 5:11 As always, we get our five results. 5:13 And here's the GPA's of the 5:15 five students who have applied to major in CS. 5:18 Once again, this will be 5:19 the last time, I promise, we'll 5:21 do this query using a join instead of the sub-query. 5:24 So here we go. 5:25 We've got student and apply join 5:27 on SID, majoring in CS, and returning the GPA. 5:31 Once again, because we 5:32 have students who applied multiple 5:34 times for CS, we're getting more than our five results. 5:37 So we get a rather large number 5:38 of results here and again we have duplicates. 5:42 So here's where the problem comes in. 5:44 If we use this result to 5:46 compute the average GPA, we'll be 5:47 counting some students multiple times. 5:49 And presumably, that's not what we want to do. 5:51 Presumably, we want to count 5:53 each student who's applied to 5:54 CS once in the computation 5:56 of, say, the average GPA. 5:58 That worked in the previous query 6:00 when we got the five results for the five students who applied to CS. 6:02 When we do the join, we get too many results. 6:06 But this time, again, we're 6:07 going to have a problem when we 6:08 do select distinct, because some 6:10 of these students have the same GPA. 6:12 And now we only have 6:13 four GPAs instead of the five that we should have. 6:16 And if we compute the 6:18 average GPA now, then we're 6:19 not factoring in one of the student's GPAs. 6:22 So in this case, neither the 6:23 version with distinct nor the 6:25 version without distinct gives us 6:26 the right number of GPAs. 6:29 Neither of those will give us the correct average. 6:32 The only way to get the 6:33 correct number of duplicates is 6:34 to use the version of the 6:35 query that has the sub-query in the where clause. 6:39 Now let's move to some different 6:41 examples that also use subqueries in the where clause. 6:44 You might remember from 6:46 the previous video when we 6:47 were learning the difference operator that 6:49 we had a query that we 6:50 could write using the difference operator, 6:52 which in SQL is called accept, 6:54 but we were unable to 6:55 write that query without the difference operator. 6:57 And the query we were trying to 6:59 write is to find students 7:00 who have applied to major in 7:02 CS, but have not applied to major in EE. 7:05 Now that we have sub-queries in 7:06 the where clause, we can write 7:08 that query without using the except operator. 7:11 And here it is. 7:12 The query looks for students where 7:14 their ID is among the 7:15 set of ID's of students 7:17 who have applied to CS, but 7:18 their ID is not among 7:20 the set of ID's of students 7:22 who are applying to major in EE. 7:25 So let's run the query. 7:26 And we discover that 7:28 there are three students who have 7:29 applied to major in CS, 7:30 but not applied anywhere in EE. 7:33 By the way, let me just show 7:34 you a slightly different way to write exactly the same query. 7:37 You can see that we use this 7:39 key word combination "not in" 7:41 to specify that the idea is not in this set. 7:44 We can actually write it by 7:46 writing the SID is 7:48 in the set and then applying a "not" to that result. 7:52 We'll execute and we'll get exactly the same result. 7:54 It's fairly common in SQL 7:56 for there to be multiple ways to 7:57 write the same query, and we'll 7:58 see more examples of that later in this video. 8:01 So far, our examples 8:04 of sub-queries have used "in" 8:05 and "not in" for testing 8:07 membership in the sets that are produced by sub-queries. 8:10 Our next examples are going 8:11 to show that we can apply 8:13 the exists operator to sub-queries 8:15 just to test whether they're empty or not empty. 8:18 Here's the query. 8:19 This query uses exists to check 8:22 whether a subquery is empty 8:23 or not empty rather than checking 8:25 whether values are in the subquery. 8:28 The other new construct that's 8:29 begin to introduce here's what's known as a correlated reference. 8:33 But inside the subquery we're going 8:34 to refer to a value, 8:36 C1, that comes from outside the subquery. 8:39 So let's talk about exactly how this query works. 8:42 First let me tell you what the query is trying to return. 8:45 The query is going to find 8:47 all colleges, such that there's 8:48 some other college that is in the same state. 8:52 Okay? 8:52 So, in our example, just a 8:54 reminder, we have Stanford, Berkeley, MIT and Cornell. 8:56 So, the two Colleges we should 8:57 get back are Stanford and Berkeley 8:59 because in each case there's another 9:00 college in the same state. 9:02 So how does this query work? 9:04 It says we're gonna to take 9:05 the colleges, and for 9:07 each college we're going to 9:08 check whether their exists another 9:10 college and we're going to 9:11 call that one C2, where the 9:13 state of C2 is the same as the state of C1. 9:16 This is sort of similar 9:17 to what we saw when were 9:19 doing self-joins with table variables, 9:21 but now, the variables are 9:22 appearing in the outer query and the inner query. 9:25 Okay. 9:25 So, let's run the query and 9:26 we see that we get the wrong answer. 9:29 That was again intentional. 9:31 Well, here's the problem. 9:32 When we were in this query, C1 9:34 and C2 could be bound to the same college. 9:38 So every college is in 9:39 the same state as another college 9:41 of the other college could be the same college. 9:43 What we need to do is 9:44 add inside the sub-query a 9:46 further condition that says that C1 and C2 are different colleges. 9:50 Specifically, C1.Cname 9:53 is not equal to C2.Cname. 9:54 Let's run the query 9:58 and now, we get the correct answer. 10:02 Now, let's look at some 10:03 other uses of the exists, 10:05 construct and sub-query. 10:08 If you happen to already know SQL 10:09 a little bit and someone 10:10 asks you to write a query 10:11 where you were going going to 10:12 get a largest value of some type, 10:14 the first thing you would think of 10:15 is probably using the max operator, 10:18 but we don't know max yet, we'll be learning that later videos. 10:21 And as it happens, a number of 10:22 queries that are effectively computing 10:24 a max can be written using sub-queries. 10:27 And here's our first example. 10:29 This example is finding the college that has the largest enrollment. 10:32 And we're going to 10:33 do that with the sub-query and with the not-exists operator. 10:36 Specifically, we're going to 10:38 find all colleges where there 10:40 does not exist another college 10:42 whose enrollment is higher than the first college. 10:45 So let's go ahead and run the query. 10:47 And, not surprisingly, we get Berkeley as a result. 10:51 So this is a form of 10:52 query that we can write any 10:53 time we're looking for something that's the largest or the smallest. 10:56 Let's, for example, look for 10:58 the student with the highest GPA. 11:00 So we'll change it to 11:02 the student name, and we'll 11:04 be looking instead of colleges at students. 11:07 Otherwise this form of this query is going to stay very similar. 11:11 We're going to look for students here as well. 11:13 And finally, we're going to have GPA instead of enrollment. 11:18 So, the way this query 11:20 works is it says "I want 11:21 to find all students, such that 11:24 there does not exist another student 11:26 who's GPA is higher". 11:28 We run the query and we get four results. 11:31 Seems odd. 11:32 Actually it's that odd. 11:34 Let's just add the GPA to our query and we'll see exactly what happened. 11:39 We can see that these four 11:40 students all have a GPA 11:42 of 3.9, so they're all tied for 11:44 having the highest GPA and very 11:46 specifically it faithfully runs the 11:48 query that there does not 11:49 exist another student whose 11:51 GPA is higher than these students. 11:52 Now, let's see if 11:54 we can write this same query without using a sub query. 11:59 So as usual, if we 12:00 want to write a query without 12:02 a sub query, we'll need to do some type of joint. 12:04 So we're looking for students with the highest GPAs. 12:07 So, we need to join two 12:09 instances of the student relation 12:10 as we've done here and then 12:12 we'll apply a condition that 12:14 the GPA of the first one 12:15 is greater than the GPA of the second one. 12:18 We run the query. 12:19 Wow, we got a lot of answers there. 12:23 Maybe our problem is just 12:24 one of duplicates. So first thing 12:25 we do when it looks too big 12:27 we add select distinct. Nope, 12:29 that doesn't solve the problem either. 12:31 Actually this query is fundamentally wrong 12:34 and we cannot write the query 12:35 that finds the student with the 12:36 highest GPA just by using joins. 12:38 What this query actually does is 12:40 it finds all students such 12:43 that there is some other student whose GPA is lower. 12:47 In other words, it's finding all 12:48 students except those who have the lowest GPA. 12:50 Now let's see another 12:54 new construct we can use 12:56 with sub-queries in the where 12:57 clause and we'll continue with 12:58 the same query of finding the student with the highest GPA. 13:02 This query uses the all keyword. 13:04 What all tells us is that 13:05 instead of checking whether a 13:07 value is either in or 13:08 not in the result of a 13:09 sub-query, we're going to 13:11 check whether the value has a 13:12 certain relationship with all the results of the sub-query. 13:14 And here, in this case, we're 13:16 checking to see if the 13:18 GPA is greater than or 13:19 equal to all elements 13:21 of the sub-query that returns the GPA's of all students. 13:25 If the student's GPA is indeed 13:26 greater than or equal to all 13:27 GPA's then the student has the highest GPA in the database. 13:31 We'll run the query, and we'll 13:33 get the same four students with the 3.9 GPA. 13:36 Now, let's try writing 13:38 this in a slightly different fashion 13:40 just for fun to illustrate some concepts. 13:43 Once again, we're going to 13:44 try to find the students whose GPA is highest in the database. 13:47 Here's how we're going to do it this time. 13:49 Instead of using greater than or 13:50 equal to all, we're going to use greater than all. 13:53 So, we're going to find all students 13:55 where their GPA is higher than every other student 13:57 by saying GPA is greater than 13:59 all GPA's of students 14:01 in the database who are not the student we're looking at. 14:04 By saying that the ID's are not equal. 14:06 Let's run the query. 14:08 Well, we got an empty result. 14:10 Let's think about it for a second. 14:11 That is actually the correct result to the query. 14:14 The query itself was incorrect. 14:16 The query is looking for all 14:17 students where nobody else 14:19 has the same GPA as that student. 14:21 Everybody else's GPA is lower. Well, 14:23 remember, we had four students 14:25 with a 3.9 GPA, so 14:27 none of those students are going 14:28 to satisfy the query and nor will any other student. 14:30 So this is an incorrect formulation of the query that we wanted. 14:34 Now this query would be 14:36 correct if we knew 14:37 that every student had the same GP...I'm sorry. 14:40 This query would be correct if 14:41 we knew that every student's GPA was 14:43 unique, because there would be 14:45 then one student who had the highest GPA. 14:47 So let's modify the query 14:49 to instead look for the 14:50 colleges that have the highest enrollment. 14:52 Because it turns out in our database every college has a unique enrollment. 14:56 So we'll change it to 14:58 getting the C name of college instead of student. 15:04 And we'll want the enrollment to 15:08 be greater than all other enrollments. 15:10 So we'll...enrollment, college, almost done here. 15:15 Grab a cup of coffee if you want. 15:17 I just have to make this 15:18 be C name and this 15:22 one be C name and we're all set. 15:24 So what are we asking for here? 15:26 We're asking for all colleges whose 15:28 enrollment is greater than all 15:30 of the enrollments of colleges that 15:32 are different than the one we're looking at. 15:34 We run the query and we get 15:35 Berkeley as a result 15:37 which is exactly what we'd expect. 15:39 So far we've seen the keyword 15:41 all for checking whether a 15:43 value has relationship with all of the results of a sub query. 15:46 There's also a key word 15:48 any that is a 15:49 companion but instead of 15:51 having to satisfy a condition 15:52 with all of the elements of 15:54 a set any says you 15:56 must satisfy the condition with 15:57 at least one element of the set. 15:59 So what we're going to do 16:00 now is going to do 16:01 the same query but we're gonna write it a little differently. 16:04 Let me just type it in and then explain. 16:09 Here what we're going to 16:10 say is get me all colleges 16:12 where it's not the case 16:14 that the enrollment is less 16:15 than or equal to any other college. 16:18 Okay? 16:19 So in other words there's no other colleges have bigger enrollment. 16:22 Just think about it for 16:23 a second, this is if you happen to know predicate logic. 16:26 This is an equivalence where when 16:28 you say for all it's equivalent 16:30 to not exists not. 16:31 In any case if you didn't follow that don't let that bother you. 16:34 Let's run the query and we 16:36 see we again get the result 16:37 Berkeley, so again all 16:39 tests a condition against every 16:41 element in the result of a 16:42 sub query, and the condition 16:44 is true if it's satisfied with 16:45 every element; whereas any is 16:48 true if the condition is satisfied with one or 16:50 more elements of the sub-query. 16:53 Let's use any for another query. 16:55 This query finds all students 16:57 who are not from the smallest high school in the database. 17:01 So, what we are looking for 17:02 here is all students, where the 17:04 size of their high school 17:05 is greater than any high school size. 17:08 In other words, a student is 17:09 going to be returned if there's 17:11 some other student whose size 17:13 high school is smaller than this student. 17:15 We run the query, and there's the result. 17:17 And you can verify if you 17:18 look back at the data, but 17:20 there is in fact, a few, 17:21 there are students who are 17:22 from high schools that have 200 17:24 students, so here, we 17:25 have in our result every 17:27 student that comes from a high school bigger than that. 17:31 Some systems, notably at 17:33 the current time SQLite, do 17:35 not support the any and the all operators. 17:38 We do not lose expressive power, we 17:40 just have to write those queries 17:41 a little differently typically using exist 17:43 or not exists, so let's 17:45 look at the same query written without using any. 17:49 Here's what we do. 17:50 We look for students where there 17:52 exists some other student, 17:56 okay, whose high school is 17:57 smaller than the student we returned. 18:00 So this is going to give us 18:01 exactly the same result and 18:02 we will see we will get the same set of students. 18:05 No students from a high 18:06 school with two hundred but we 18:07 get all the rest of the 18:08 students without using any or all. 18:11 And just to emphasize, any or 18:13 all are very convenient for writing queries, but they aren't necessary. 18:17 It turns out we can always write 18:18 a query that would use 18:20 any or all by using the 18:21 exists operator or not exists instead. 18:25 As a grand finale query, let's 18:27 revisit the query that finds 18:29 students who have applied to major 18:30 CS and have not applied to major in EE. 18:33 And now we're going to write 18:34 that query using the any 18:36 operator, two instances of it. 18:38 Now let's go back and look 18:39 when we wrote the query using in 18:41 and not in, and we see 18:42 that there are three results, Helen, Irene and Craig. 18:45 And now let's look at 18:46 our query using any and not equal to any. 18:49 So we find students where their 18:50 SID is in the set 18:53 of SID's of students who have 18:54 applied to major in CS and 18:56 their SID is not equal to 18:57 any of the students who 18:58 have applied to major in EE and let's run the query. 19:03 Well we got the wrong 19:05 answer and that was once again intentional. 19:07 Let's look very closely at what this query asks for. 19:10 This query asks for students 19:12 students where their ID is in 19:14 the set of ID's that 19:15 applied to CS, that's all fine 19:17 and good but this is where things got tricky. 19:20 What we're saying here is 19:22 that we want 19:24 the condition to check whether 19:27 there's any element in the 19:28 set of EE's that are 19:30 not equal to this SID. 19:32 So, in fact, this second condition 19:34 is satisfied as long as 19:35 there's anybody who applied to 19:37 EE that's not equal to 19:38 the student we're looking at and 19:40 that of course is commonly the case, 19:42 so we made a mistake 19:43 here and it's actually quite tricky 19:45 to use the any and 19:46 all operators and always get the right answer. 19:49 It's especially tricky when you 19:50 can't just eyeball the answer 19:51 and check but the correct 19:54 formulation that we want 19:55 here is that it's not 19:57 the case that the SID 19:59 is equal to any member of the set. 20:02 In other words, for each member 20:03 of the set of the EE 20:05 application, that value is 20:07 not equal to the SID 20:09 of the student we're going to retrieve. 20:10 And now we run the query and we get the right answer. DBMS by Stanf