0:00 The next set of queries will show introduced the concept of the aggregation in SQL. 0:04 Once again we start with the basic select from where construct, 0:08 and this time we're going to introduce what are known as aggregate or aggregation 0:11 functions. 0:13 These are function that will appear in the select clause initially 0:16 and what they do is they perform computations over 0:19 sets of values in multiple rows of our relations, 0:22 and the basic aggregation functions supported by every SQL system are 0:26 minimum, maximum, some, average and count. 0:29 Now once we've introduced the aggregation functions 0:32 we can also add two new clasues to the SQL select from where statement, 0:36 the group by and having clause. 0:38 The group by allows us to partition our relations into groups 0:42 and then will compute aggregated aggregate functions over each group independently. 0:47 The having condition allows us to test 0:50 filters on the results of aggregate values. 0:53 The where condition applies to single rows at a time. 0:56 The having condition will apply to the groups that we generate from the group 0:59 by clause. 1:00 All of these constructs will become very clear when we show some examples. 1:04 As usual our examples we use are simple college admissions database, 1:08 that has a table of colleges, 1:10 a table of students 1:11 and information about students applying to colleges. 1:15 As usual we have four colleges, 1:18 a bunch of students, 1:20 and applications of students to colleges for specific majors. 1:24 Our first aggregation query is a very simple one computes the average GPA of 1:28 the students in the database. 1:30 Here is the query, but just to explain the query let me first replace the aggregation 1:33 with star. 1:35 When we run select star from students we get our result here. 1:38 Specifically when we look at the GPA column, what the aggregation is going to 1:42 do is perform a computation over all values in the column, 1:46 and then produce just one tuple 1:49 in the results that has that value. 1:51 So we'll write average GPA. 1:56 We'll run the query, 1:57 and we see now the average GPA in the database. 2:00 Our second querie is a bit more complicated. It involves a join. 2:04 What it finds is the minimum GPA of students who have applied for a 2:08 CS major. 2:09 So we do the join of student and apply on student ID. 2:12 Filter for major equal CS. 2:14 Once again let me just remove the aggregation first, so we can see the 2:17 result without aggregation. 2:20 Here we see all of the information about students who've applied to a 2:23 CS major. 2:25 The aggregation is going to look at the GPA column and it's going to take the 2:28 lowest value in that column. 2:30 So we write min GPA. 2:33 Run the query, 2:34 and we discover that the lowest GPA is three point four. 2:38 Now let's go back to the average aggregate function again. So let's compute 2:41 the average GPA of the students who have applied to computer science, 2:45 run the query and we see the result is about three point seven. 2:49 In fact this result is probably not precisely what we were looking for. 2:53 In a previous video we talked in some detail about this particular form of query 2:57 versus using a sub query 2:59 to find the GPA of students who have applied to CS. 3:02 The issue with this particular query, and let's go back to select star version, 3:07 is that if a student 3:08 applied to CS 3:09 multiple times for example student 123 Amy applied to both Stanford 3:13 and Berkeley. 3:15 Then When we compute their average GPA we're going to be counting their GPA twice. 3:20 Presumaly what we actually want is to count the GPA once for 3:23 each student who applied to CS, no matter how many times they applied. 3:27 so in order to do that we use the sub query form where we select from student 3:33 and then we just check for each student whether their ID is among those who 3:36 apply to CS. 3:37 So let's just do a bit of editing here, 3:43 from apply where major equals CS. 3:49 I'm not a very good typist. 3:52 Let's just first take the sub query form, 3:55 and we see that we made a mistake here, 3:57 and it 3:59 says that we forgot a word here which is in. 4:03 Apologize for that. 4:04 Okay, now we have the 4:06 students who applied to CS 4:08 and in this case we only have one instance of each student. So now if 4:11 we run the aggregation 4:13 when we compute the average GPA we'll correctly count students GPA only 4:17 one time. 4:18 So remember with three point seven approximately before 4:21 now we run the query and we see that the correct result is three point six eight 4:25 not that different but this one is the numerically correct result. 4:30 So far we've seen the average and min aggregate functions. 4:33 This query shows the count function. 4:35 So not surprisingly the count function just returns the number of tuples 4:39 in the result. 4:40 So this 4:41 particular query finds the number of colleges in our database whose enrollment 4:44 is greater than fifteen thousand, just for it to be thorough let's run the 4:48 query without the aggregate function 4:50 and we see that there are two of them 4:52 and the aggregate function simply counts those number of tuples. 4:55 We run the query 4:57 and the result is two as expected. 5:00 Let's do another count query that looks very similar. 5:03 In this case we are counting the number of students who have applied to Cornell. 5:06 So the query looks the same 5:08 will run the query and we discover that six students have applied to Cornell. 5:12 Well in reality what we've actually counted in the number of applications to 5:15 Cornell, 5:16 not the number of students who have applied to Cornell. If a student applied 5:19 three times, then we're counting them three times in this result. 5:23 So in some sense 5:23 this is similar the previous one we saw, we were over counting, and we 5:27 could try to fix it with the sub query and the in and so forth but actually SQL 5:31 provides a very nice way for us to perform the query that we want. 5:35 In the count function 5:37 we can put a special keyword distinct and then the name of one or more 5:41 attributes 5:42 and in this case what the count will do, is will look at the result 5:45 and then it will count the distinct values for the particular attribute so we run the 5:49 query and we are find that there are 5:51 three distinct 5:53 student IDs. If we ran it without 5:55 just like this. 5:57 When we see, we would look and again we would find that there are three distinct 6:00 student IDs. 6:02 So count distinct actually turnes out to be a very useful feature in SQL. 6:07 Here's a rather complicated looking query and I have to admit it computes 6:10 something fairly obscure but it does demonstrate some features. 6:14 What this query computes 6:16 is, it gives us back all students 6:18 where the number of other students who have the same GPA as the student 6:23 is equal to the number of other student that have the same high school size at 6:26 the student. 6:27 Again I admit it's a little bit of obscure but let's take a look at how it works. 6:30 So it looks at the student relation and for each student it counts 6:35 the number of other students, that's by testing that the ID is different, 6:38 that have the same GPA. 6:40 It also counts the number of other students that have the same size 6:43 high school 6:44 and if those two values, 6:46 and the sub queries both produce a single value which is why would contest 6:49 equality, if those two values are the same then the student comes out in the 6:53 result. 6:54 Let's run the query. 6:55 We get our answer. 6:56 I'll leave it to you to verify by looking at the data that this is indeed 7:00 the correct result. 7:02 Here's another complicated looking query although this one computes something 7:05 quite a bit more intuitive. 7:07 What this query computes is the amount by which the average GPA of students 7:11 who apply to computer science exceeds the average GPA of students who don't 7:16 apply to computer science and we are going to assume it does in fact exceed. 7:20 So we are using in this example sub queries in the from clause which I hope 7:24 you remember from the previous video. 7:25 A sub query in the from clause allows you to write a select from where expression 7:29 and then use the result of that expression as if it were an actual table in the 7:33 database. 7:35 So we are going to compute two sub queries in the from clause, one of them which 7:39 is the average GPA of 7:40 CS applicants 7:41 and one the average GPA of non CS applicants. Let's look a little closer 7:45 here so this query here says let's find 7:48 the students 7:49 who have applied to major in CS, 7:51 let's compute their average GPA and we'll call it average GPA. 7:55 We'll take the whole result and we'll name it CS. 7:58 Similarly the second 8:00 new relation that were computing in the from clause 8:03 computes the average GPA 8:05 of students who did not apply to CS so their student ID is not in 8:09 the set of students who applied to CS. 8:11 We'll call that one non CS. 8:13 So now with this from clause we have a relation called CS with an attribute 8:17 called average GPA 8:19 and a relation called non CS 8:21 with an attribute called average GPA 8:23 and in the select clause we'll simply do the subtraction of 8:27 the non CS GPA from the CS GPA. 8:30 We run the query 8:31 and we discover that the CS applicants exceed on average 8:34 the non CS by point one nine. 8:38 Now just for demonstration purposes let me show that we can write the same query 8:42 but using sub queries and the select clause. 8:45 You might remember from the previous video 8:47 that in a select clause we can write a sub query as long as it returns a single 8:51 value. We're going to go even further we're going to write two sub queries that 8:55 return single values and subtract them. 8:57 So I'm going to replace this from with select. 9:00 I'm going to take up this whole first line here. 9:05 And then I'm going to... 9:08 Let's see. What else do I need to do? I'm going to take out the as here 9:11 and I'm going to replace it with a minus. 9:14 I'm going to take out the as here. So now 9:16 what we've got is we've got the average GPA of CS student in the select clause 9:20 producing a value. 9:22 The average GPA of non CS students. We perform the subtraction as part of the 9:26 select clause. 9:27 The last thing we need is something in the from clause so we'll say from 9:30 student 9:33 will call the result of this subtraction d as in for difference we run the 9:37 query and we get 9:39 almost the same result except for a bunch of duplicates. 9:42 The reason for the duplicates is that we compute this result once for each tuple in 9:47 student. 9:48 We can add distinct, 9:52 and now we get the single result. 9:55 Now let's learn about the group by clause 9:57 which is only used in conjunction with aggregation. 10:00 Our first query is going to find the number of applicants to each college and 10:04 it's going to do so 10:05 by using grouping. 10:07 Effectively what grouping does, is it takes a relation 10:10 and it partitions it by values of a given attribute or set of attributes. 10:14 Specifically in this query we're taking the apply relation 10:17 and we're breaking into groups 10:19 where each group has one of the college names, so it'll be the Stanford group, the 10:23 Berkeley group and so forth 10:24 and then for each group we will return one tuple in the result 10:28 containing the 10:29 college name for that group and the number of tuples in the group. 10:33 So to illustrate what's happening with grouping very clearly, let me start by 10:36 replacing the select clause with 10:39 uh... select star, removing the aggregation 10:42 and doing an order by 10:44 on the college name. So that illustrates the groups that we're going 10:47 to be using in in the actual query 10:50 so we see there are three tuples for Berkeley 10:52 there six tuples for Cornell and so forth. 10:56 Within each group then for the particular query we're doing we're simply going to 10:59 count the number of tuples 11:01 So going back to the group by form 11:05 we return the college name 11:08 and we return the count of a number of tuples. 11:12 Now before I actually run the query 11:14 one thing to notice is what it makes sense to put in that select clause here. 11:18 We're grouping by the college name so the college name is going to be the same for all of 11:22 the tuples of the group so it's sensible to include that, but it wouldn't 11:25 really be sensible to include when the other attributes that differs although 11:29 we'll see an example later 11:30 where we include that and interesting things happen. 11:33 For now we'll just put the name of the grouping attribute and then we can 11:36 perform aggregation over that. In this case count. 11:39 We run the query 11:40 and we see that there are three applicants to Berkeley, six to Cornell 11:43 and so forth. 11:45 Here's a very similar query, just as a second example, 11:48 and this example we're finding the total enrollment of college students 11:52 for each state. 11:53 Now we only happen to have three states and four colleges in our database 11:56 but what this query does is it takes the college relation it 12:00 breaks it up into partitions by state 12:02 and then within each partition or group it will return the state for that 12:06 partition and the sum of the enrollment. 12:08 We run the query 12:09 and we see 12:10 the expected result. 12:12 Here is a more complicated group by query. 12:14 In this case we're grouping by two attributes. 12:17 We also have a join involved 12:19 and we're going to compute two aggregate functions in our result. 12:22 What this query computes is for each college and major combination 12:27 the minimum and maximum GPAs for the students who've applied to that college. 12:32 As usual before we do the aggregation let's replace the query with one where 12:35 we can see the groups explicitly. So I'll put GPA here 12:39 and instead of grouping I'll order by 12:43 and we run the query 12:45 and we see for Berkeley biology we actually only have one applicant 12:49 for Berkeley CS we have two. 12:51 I think Stanford CS has the largest number of applicants at three. 12:55 So once we put the aggregation back each of these combinations of college and 13:00 major is going to be considered individually 13:02 and for each one will compete the minimum and maximum GPA. So let's go 13:06 ahead and do that. 13:12 All those parentheses. 13:14 Ok, change this to group by. 13:18 Run the query 13:20 and here we see 13:21 for Berkeley biology the minimum and maximum are the same because we saw that there 13:25 was only one 13:26 uh... student uh... who applied to Berkeley biology. 13:29 Stanford CS we had three we can see the spread of their GPA's and so 13:34 forth for each college major combination. 13:37 Now what if we were interested in finding information about the spread of 13:41 GPAs for each college and major 13:44 uh... what the differences between the minimum and the maximum. 13:48 So let's suppose in fact what we wanted to find was largest spread, but let's take that 13:51 in steps. 13:52 I'm going to take this current query here and I'm just gonna slap it into the from 13:56 clause. 13:58 So now in the from clause I'm going to have a relation, we'll call it M, 14:03 that's going to have the 14:05 college major combinations and the minimum and maximum GPA. This M 14:08 is going to now have the result that we see below. 14:11 Now once we have that result to work with 14:14 in the select clause 14:16 we can access the attributes here. Let me call this one 14:21 mn and this one 14:23 mx and if we write 14:24 just mx minus mn here 14:27 what we're going to get is a list of the spreads 14:29 of the min and max GPA for 14:32 each collage and major. 14:34 So we see many of them are zero. Those are all of the combination that had just 14:37 one student, but 14:38 in some cases we actually have a spread, 14:40 but what we want to find it is the largest spread, watch this it's very 14:44 cute, we just simply put a max function right here. 14:48 Run the query and 14:50 we find that the largest spread between min and max is approximately one. 14:55 Our next sequence of queries introduces some subtleties of the group by clause. 15:00 What our query finds is the number of colleges that have been applied for the 15:04 number of colleges that have been applied to 15:06 by each student. 15:07 So the query join student and apply. 15:10 It groups the result for the student in the student ID 15:14 and then for each student ID finds the number of distinct colleges in that 15:18 group. 15:19 So once again just to illustrate let's take away the aggregation, 15:23 let's replace the group by by an order by, just so we can see the data that 15:27 we're working on, 15:28 and run the query and we see that student one two three has four 15:32 applications to Stanford, Berkeley and Cornell. Three distinct colleges. 15:36 Two three four has one, 15:37 and so forth. 15:38 So now let's return to our original query group by student 15:42 we'll count the 15:43 distinct college names. 15:49 Run the query, 15:51 and we get the expected answer 15:53 student one two three applied to three 15:54 distinct colleges two three four to one 15:57 and so forth. 15:58 Maybe we'd like to have the student's name in the result as well. 16:01 That's easy to do. We just add the student name to the select clause 16:05 and we run query 16:06 and now we've added Amy is one two three, Bob is two three four and so forth. 16:11 That worked out fine in this case but we're gonna see in a moment where simply 16:15 adding an attributes to the select clause in the case of a group by does something a 16:18 little strange. 16:19 In fact the only reason that it worked nicely in this case is that when we 16:24 group buy the student ID 16:25 within each group 16:27 we have the same student name. Let's see that... 16:30 Let's just take a look at that. 16:33 Again we'll replace 16:34 the group by with an order by. 16:38 We're sorting by student ID 16:41 and we see for the first group which is one two three we have all Amys 16:45 for the three four five group we have all Craigs 16:48 and so on. So when we put student name in 16:51 the select clause for our group by query it was ok because for each group 16:55 we had the same name of the student. 16:57 When that's not the case 16:59 we should 16:59 presumably get an error 17:02 but let's see what happens. 17:03 So let's say that we're interested now not only in finding the uh... student 17:08 name in our query but we're further gonna add a college name. So let's go back to 17:12 the original query. 17:15 Distinct college name. Let's add college name in our result 17:19 and let's go back to grouping by student. So now for each student 17:23 we're gonna return the I.. for each group, 17:25 which is identified by the student ID, 17:27 we're gonna return the ID, the name, 17:30 the number of distinct colleges and then 17:32 a college. 17:33 Let's run the query 17:35 and we do get a result we don't get an error. So we've got 17:38 Amy one two three. 17:39 She applied to three distinct colleges but for whatever reason the system 17:43 decided to tell us about Stanford. 17:45 Craig applied to two colleges and the system tells us about MIT. 17:50 Let's go back again, revert now to our order by form of this query 17:55 and take away 17:57 the aggregation 18:00 and when we run the query we see there's Amy and she did apply to 18:03 four different colleges but when we put the college name in the grouping query 18:08 it chose to tell us about Stanford. 18:10 What this system is actually doing 18:13 is when we include in the select clause of a grouping query, again if we go back to group 18:18 by, and we put in the select clause an attribute that's not one of the grouping 18:22 attributes. 18:23 It actually chooses a random value from the group to include 18:27 in the case of the student name the random value was always the same because the 18:31 student name is always the same for the same student ID. 18:35 In the case of the college it shows a random value among the colleges. 18:39 I'm personally not crazy about this behavior. Among 18:42 three open-source database systems that we've been using for our demos 18:46 two of them allow this behavior, 18:48 SQLite and MySQL. I happen to be using MySQL today. 18:52 If we were using Postgre it would actually throw an error in this case. 18:56 Let's go back to the original query that finds a number of colleges applied to 19:00 by each student. Here it is as we originally wrote it, 19:03 and here is the result. 19:05 Now there are may some students in the database who haven't applied to any 19:09 college at all and they're not going to be represented in our result because 19:12 they're not going to be in the results of the join and we don't see any zero 19:16 counts here. 19:17 What if we wanted to actually lift the student IDs of students who haven't 19:21 applied anywhere and include a zero in the result. 19:24 This is actually one of my favorite queries to give as an exercise or an exam 19:28 question in my class 19:30 because it takes a little innovation to figure out how to write it. 19:33 The way we write it, actually, is by using the union operator, which I hope you 19:37 remember from a previous video. 19:40 We're going to take the union of the students who have applied somewhere and the number of 19:44 places they've applied 19:46 together with the students who haven't applied anywhere and for those students 19:49 we're going to 19:50 put a cero in the result directly. 19:52 So let me just write the beginnings of this so we're going to find those 19:55 students who haven't applied anywhere at all so those are the students whose 20:00 sID is not in at all 20:03 the set of sIDs in the apply relation. 20:08 So that finds the students who didn't apply anywhere for each student will 20:12 return their student ID 20:14 and then we want to have a zero in the result. 20:16 Very simple to do, we simply write zero. 20:19 We run the query, 20:20 and now we scroll down we'll see that in addition to all the students 20:24 who have applied somewhere 20:26 we get zeros in the result for the students who haven't applied anywhere. 20:31 The last clause that we're going to introduce in this video is the having clause and the 20:34 having clause is also only used in conjunction with aggregation. 20:38 What the having clause allows us to do is apply conditions to the results of 20:43 aggregate functions. 20:45 So the having clause is applied 20:46 after the group by clause and it allows us to check conditions that involve the 20:50 entire group. 20:51 In contrast the where clause applies to one tuple at a time. 20:55 So this particular query is finding colleges that have 20:58 fewer than five applicant. 21:01 So we look at the apply relation we group it by college name so we're going to 21:05 have one group 21:06 for each college. 21:07 Then we're only going to keep those groups 21:10 were the number of tuples in the group is less than five indicating that we 21:14 have fewer than five applicants to the college. 21:17 We'll run the query, 21:18 and we'll see that Berkeley and MIT each have fewer than five applicants. 21:22 So presumably Cornell and Stanford 21:24 have greater than or equal to five applicants and you can check that by 21:27 examining the data. 21:29 Now it is 21:30 possible to write the same query 21:33 without that having clause and in fact without the grouping clause 21:36 and let me show you how that's done. 21:39 We're going to select again the cName from apply 21:43 and here's what we're going to do. We're going to add 21:45 an attribute there, sorry a table variable. 21:49 We're gonna erase this whole business, 21:51 and we're going to find all 21:53 applicants, so all applications where 21:56 let me just type this out for you 21:58 five is greater than the count 22:02 from apply A2 22:06 where A2 dot cName equals A1 dot cName and let's take a look at what 22:11 this is doing. 22:15 So what we're doing here 22:17 is we're looking at each application record and we're checking 22:21 whether the number of other applications to the same college, were gonna find all 22:26 the applications to the same college, we're going to count them 22:28 and we're going to 22:29 see if that number is less than five. So it's a little bit non intuitive. 22:34 Now remember the result should be Berkeley MIT. Let's run the query 22:38 and we do get Berkeley MIT. We again get duplicates however. 22:42 The reason is that we're running 22:43 uh... we're checking this condition for each application record. 22:47 So for each application to Berkeley we checked the condition that will be 22:50 many of them and for each one to MIT and so on. 22:54 Of course as usual we can add distinct. 22:58 We can run the query and now we get the result. 23:01 It does happen actually 23:03 that uh... every query that can be written with the group by and a having 23:06 clause can be written in another form sometime it can be in extremely 23:11 contorted 23:12 but another sort of 23:13 complicated exercise I like to give sometimes in my classes to prove the 23:17 fact 23:18 that you can write every group by having query 23:21 without using those clauses. 23:24 Now let's go back to our original form of the query were we used the group by and 23:27 the having. 23:29 So we'll just get rid of all this. 23:33 I'm sorry I'm such a terrible typist. 23:35 Group by college name 23:37 having count star less than 23:40 five 23:44 and we'll run the query will get the same result. 23:46 What if we were interested not in those colleges that have fewer than five 23:50 applications but rather those colleges that have fewer than five applicants? 23:55 It's a subtle point but instead of counting the 23:59 number of apply tuples for each college 24:02 what we want count is the number of 24:05 distinct student ID's 24:06 that appear in the group of apply tuples for that college. 24:09 So again, now we're looking for the number of colleges who have fewer than five 24:12 distinct people, distinct students applying to them, and now we see that one 24:17 more college joins the group so Cornell had more than five applications 24:22 but at least five applications but it has fewer than five applicants. 24:28 Here's our grand finale query 24:30 it has a having clause and group by 24:32 and in addition to that is has a sub query within the having clause 24:35 that also uses aggregation. 24:37 What this query finds is all majors represented in the database where the 24:42 maximum GPA of a student applying for that major is lower than the average 24:47 GPA in the database. 24:49 So we join student and apply and we group by the major so we can look at each major 24:53 one-at-a-time. 24:55 Within the major we find the maximum GPA of a student applying for that major 24:59 and we check whether it's lower than the average GPA in the student relation. 25:04 We run the query 25:05 and we discover that bioengineering and psychology, at this point in time, are 25:10 garnering applicants whose uh... highest GPA is lower than the average 25:14 in the database. Of course that does not represent reality.