0:00 Good day, everyone. This is Dr. Soper, here. 0:05 And today we will be exploring our seventh topic in our series of database lectures 0:11 with this lecture focusing on the database indexes. 0:18 To begin, I want to emphasize the importance of indexes to the overall performance of database queries. 0:27 Of all of the strategies that we have for improving database performance, many people, including myself, 0:36 would argue that database indexes are the single most critical tool that is available to a database 0:44 administrator for improving the query performance of the database. An index itself is simply a data structure 0:53 of some sort that contains an organized copy of some of the data from one or more 1:01 of the tables in the database. And this data structure can be used to vastly improved 1:07 query performance. Conceptually speaking, you can think of a database index 1:13 just as you might think of an index at the back of a textbook. Imagine, for example, that I gave you a chemistry textbook 1:22 and I asked you to find information about beryllium, which is the fourth element 1:29 on the periodic table. Without an index or some other form of organizational framework for finding 1:37 the information in which you are interested, it would take an extremely long time in order 1:44 for you to locate information on beryllium that I had requested. In this case, without an index, your strategy 1:52 for locating the desired information would probably involve flipping through the pages 1:58 of the chemistry textbook, one by one, scanning the content of each page, 2:04 and looking for information about the element beryllium. If, however, you had an organizational framework 2:12 for the content of the textbook, such as an index that you might 2:18 find at the back of the book, you could simply look through the index, which in this case 2:25 would almost certainly be organized alphabetically, find beryllium, and then the index 2:31 would tell you which pages within the textbook contained information about beryllium. 2:38 You could then immediately skip directly to the pages in the textbook which 2:44 contain the information in which you are interested. I hope you can intuitively understand 2:51 how an index can thus vastly improve the speed with which 2:56 SQL queries can be answered. When teaching people about database indexes, 3:02 I always like to begin with an intuitive overview. Consider the table of names that are 3:09 shown on the right side of your screen. You can see that we have 22 rows of data. 3:15 In the first column we see the row position, row 22. And in the second column we see a person's last name. 3:25 And again, the last names here are randomly ordered. Now imagine that we choose a name at random from this list. 3:32 If we were to begin at the top of the list and move toward the bottom, examining one row at a time, 3:39 how many rows would we need to inspect before we would be able to find our randomly chosen name? 3:47 Well the answer, of course, is, it depends. And in this case it depends upon the name. 3:53 If, for example, we had chosen Salehian as our randomly selected name, and we started at the top, 4:01 moving down, we would find our desired name after inspecting only three rows. 4:08 If, on the other hand, the randomly chosen name was Pham, 4:13 we would need to inspect 20 rows before we would locate our desired name. 4:20 Thus, the performance of our search strategy varies widely and depends upon the name which is chosen 4:29 and the random order in which the names happen to appear. A more useful metric for assessing this search strategy 4:36 then, would be to ask the question, what is the average search time if we were to repeat 4:44 this process many, many times? So if we were to choose one of these names randomly 4:50 and go through this linear search process of starting at the top, working our way toward the bottom, searching for our target name, 4:59 if we were to repeat that process over and over and over again, constantly just choosing random names, 5:05 what would the average search time be? And in this case, by search time, 5:11 I'm referring to the number of rows that we would need to inspect in order to find our desired name. 5:21 Well this search strategy is known as a linear search. And the answer to our question is, that the average search 5:29 time would be equal to n plus 1 divided by 2, 5:35 where n is the number of rows in the table. In our example here, we have 22 rows in the table 5:43 and therefore, the average search time to find any randomly chosen name would 5:48 be equal to 22 plus 1 divided by 2. Which works out to 11.5 rows. 5:57 So on average then, using this linear search process, we would need to inspect 11.5 rows in order 6:05 find the name in which we are interested. Another good metric for assessing the efficiency 6:11 of this search strategy would be to ask, what is the maximum search time? 6:18 That is, at most, how many rows would we 6:23 need to inspect to find a randomly chosen name. And in this sort of linear search strategy, 6:30 if, by chance, we happen to randomly select the last name in the list, in this case, 6:37 the last name in the list is Israr, then we will need to inspect all 22 rows before we 6:46 locate the name in which we are interested. Thus, in this sort of linear search strategy, 6:51 the maximum search time will be n, where n is the number of rows. 6:56 In our example, the maximum search time would thus be 22. Next, let's explore a different sort of search strategy. 7:05 In this case, rather than having a table in which the names were 7:10 ordered randomly, we instead have a table in which the names are ordered alphabetically. 7:16 And as we will see, when we have this sort of organizational structure imposed on the data, 7:23 in this case, an alphabetical sort, that one change alone and vastly improved 7:29 the speed with which we can locate any randomly chosen name. 7:34 Now that we know that the names in our table are listed in alphabetical order, what 7:40 sort of search strategy could we use that could improve our performance. If we were to rely on the same linear search strategy 7:50 that we used in the previous example, we would not experience any gains 7:55 in our average search time. And the reason why is, with a linear search, we 8:00 are not capitalizing on the fact that our table now has 8:07 an organizational structure. That is, the names are listed in alphabetical order. 8:13 Because we have an ordered list of data, we can instead employ something called a binary search 8:21 strategy. And a binary search strategy is much more efficient at allowing us to locate any randomly chosen name 8:30 because it capitalizes on the knowledge that the names in the table are in alphabetical order. 8:37 So let's see how this binary search strategy might work. Imagine that we randomly choose a name. 8:43 Again, we will choose the name Salehian. Strategy involved in a binary search 8:50 might be described as divide and conquer. So knowing that the names are in alphabetical order 8:58 and knowing how many rows there are in the table, the best place for us to begin our searches 9:05 in the middle of the table. In this case, because we have an even number of rows, 9:11 there is no precise middle row in the table. So we may select the row location at n over 2 plus 1 9:21 to begin our search. Since we have 22 rows here, that would mean that we begin our search by examining 9:27 the value of row 12, which in this case, is the last name, Mishra. 9:34 So when we examine this row, we will ask the question, is this the row that we are looking for. 9:41 And since we know we're looking for Salehian and the row that we are currently examining is Mishra, 9:47 the answer to that question is, no. But because we know that the names are 9:53 in alphabetical order, we also now know that the row which we are seeking 10:01 is below row 12 in the table. Thus, after examining a single row, 10:09 we were able to eliminate more than half of the possibilities of where our randomly 10:16 selected name might appear. That is, after inspecting our first row, 10:22 we now know that the name we are seeking exists somewhere between rows 13 and 22. 10:31 We next repeat the process again. We look at the row which lies in the middle of our remaining 10:38 set of rows. In this case, we again have an even number of rows remaining. 10:44 So we might look at the row that is at location n over 2 plus 1. 10:51 And since we have 10 rows remaining, that would mean the next place that we will look will be at row six 10:59 among our remaining set of rows. In this case, that means the second location, where 11:07 we would look for our target name, would be the row which contains the last name Spievak. 11:14 And again, we would ask our self the same question. Now we asked before to wit, is this 11:21 the row that we're seeking. Again, since we are seeking the name Salehian, the answer to that question is, no. 11:28 And because the names are in alphabetical order, we are now able to eliminate Spievak and all of the rows 11:36 that follow as possible locations for our target name. Thus, after inspecting just two rows in this table, 11:46 we've been able to narrow down the set of possibilities for where our target name resides to just five rows. 11:55 We then repeat the process once again. Selecting the row in the middle of our remaining rows. 12:02 And since in this case, we have five rows remaining, the row in the middle is Salehian. 12:09 We would again, ask the question is this the row that we are seeking. 12:14 The answer is yes. And therefore, we have completed the search process. 12:20 In this case, we only needed to inspect three rows in the table 12:25 in order to find the randomly chosen name in which we were interested. 12:31 But again, a better metric of the relative performance 12:36 of the search strategy is to ask, what is the average search time if we were to repeat this process many, 12:45 many times, choosing a name randomly each time for which to search. 12:51 With this sort of binary search strategy, mathematicians and computer scientists 12:57 have been able to prove that the average search time will be log base 2 n minus 1, 13:08 where n is the number of rows. In our example, we have 22 rows so the average search 13:16 time would be log base 2 of 22 minus 1, 13:21 which means on average, we need to inspect approximately 3.5 rows in order to find 13:28 any randomly chosen name using this binary search strategy with an ordered table of data. 13:36 That's much, much better than the average from our previous linear search strategy, which if you recall, 13:43 required us to inspect 11.5 rows on average. As with our previous example, we can also 13:50 consider the maximum search time required. And what is remarkable about this binary search strategy 13:59 is that the maximum search time is simply equal to log base 2 n. 14:07 So to whereas the average is log base 2 n minus 1, here the maximum search time is log base 2 n. 14:16 That is just one more search than the average. 14:21 Thus, with 22 records, the maximum number of rows that we would need to inspect before we would locate 14:30 any randomly chosen name, would be log base 2 of 22, 14:35 or approximately 4.5 rows that we would need to inspect. 14:41 So the purpose of comparing these two search strategies, a linear search versus a binary search, 14:48 is to demonstrate the massive gain in search performance 14:53 that we can achieve if we impose an organizational structure 14:58 on the data through which we are searching. In this case, that organizational structure 15:04 was alphabetizing the list of names. Next, let's consider a few of the major concepts associated 15:14 with database indexes. As we saw in the previous intuitive examples, 15:21 without an indexing strategy in place, the database has no choice but to perform 15:27 a table scan when it is asked to locate one or more rows within a table. 15:33 That is, in the absence of an index, the database simply has to adopt a naive approach, where 15:40 it starts at the first row in the table and says, is this the row I'm looking for. 15:46 No, it moves on to the next row. Is this the row I'm looking for. No. And so forth until it finally locates 15:53 the row has been seeking. With an index in place, the DBMS no longer 16:01 needs to rely upon such a nice strategy. And it can locate the desired information much more rapidly. 16:09 Indexes then could be created on one or more columns within a database table. 16:17 Each table can contain potentially in many different indexes. And each column within a table might belong 16:26 to many different indexes. As an example, imagine that we have a table 16:32 and we want to create an index on the primary key column. So we instruct the DBMS to create the index. 16:42 And the contents of the index will be the primary key value 16:47 for each row in the table, along with the row's ordinal position. 16:53 That is, its row number within the table. The primary key values in this case 17:00 would be stored in a sorted order. So let's say that they are stored 17:06 in an ascending numerical order. Thus, whenever a query is run against the database which 17:14 involves the primary key for this table, the DBMS can search through the index rather than 17:22 the table itself to find the primary key value using the sort of binary search strategy 17:29 that we described previously. And then it can quickly locate the row position 17:36 that is the ordinal position of the target row within the table. 17:41 So again, this is just like looking in the index in the back of a textbook. 17:46 If you know that the values in the index are in alphabetical order or numerical order, 17:53 you can easily locate the information you want. And there will be a pointer which tells you 18:00 on which page of the textbook you should look for your desired information. 18:06 Again, although table can contain potentially many different indexes and although certain columns might be 18:15 involved in many different indexes, there are certain types of columns upon which an index 18:23 cannot be created. And the determination as to whether an index can 18:29 be created on a particular column depends on the column's data type. 18:35 Columns which have what we call a large object data type cannot be indexed directly. 18:43 There are some strategies, such as using hashing algorithm, which I will describe later, that 18:49 can be used to index these sorts of large objects. 18:55 However, generally speaking, they cannot be indexed directly. In SQL Server, the large object data types 19:03 include text, end text, which as we know 19:09 is a data type which supports unicode text, the image data 19:14 type, varchar(max), nvarchar(max), 19:20 which, again, is a variable length character string that supports unicode characters. 19:27 And of course, varbinary(max), which can be used to store binary encoded data directly 19:34 in the database. So these six data types are all considered large object data 19:40 types and columns which have one of these data types cannot be directly indexed. 19:46 Another important concept to understand about indexes is that we are making a trade off 19:53 that is, although when we properly use an index, we can vastly improve the performance of our queries, 20:01 the cost that we incur for that performance increase are additional storage space requirements. 20:09 And of course, all of the additional maintenance requirements that come along with the index. 20:16 Put simply, the reason that using an index increases the storage space requirements of the database 20:23 is that the index contains a copy of some of the data within a table. 20:29 And we need to store that copy of the data on the file system. 20:35 Let's see an example using our list of names from earlier in this lecture. 20:42 Let's imagine that the table on the left is an index table. While the table on the right is our actual table of data 20:51 from within the database. As you can see, the index table is 20:56 storing a copy of some of the data from the actual table. In this case, it's storing a copy of all of the last names 21:05 because we had instructed the database to construct an index on the last name attribute. 21:12 Although our database now requires more storage space, we can reasonably argue that the performance gains realized 21:22 through the expense of this extra storage space are well worth it. 21:27 As we saw previously, if we were to use a binary search against this index, 21:35 we could locate any randomly chosen name by inspecting just 3.5 rows within the table. 21:42 And then the index would contain a pointer, which tells us precisely where to look in the actual data 21:48 table for the row that we've been seeking. 21:53 Because indexes consume extra storage space, it's important for a database administrator 21:59 to know how to calculate how much extra storage space and index it will require. 22:05 A simple way of estimating the required storage space for an index is simply to multiply 22:11 the number of rows in a table by the average number of bytes that are required per row for the indexed columns. 22:19 As an example, imagine that we want to create an index on two columns. 22:27 Named, Last Name, and Department ID. And we know that within our table, 22:33 Last Names require an average of 16 bytes per row while Department IDs require an average of two bytes per row. 22:42 And let's say that our table contains 98,000 rows. We can then easily estimate the storage space requirements 22:50 for our index by multiplying the number of rows, 98,000, 22:55 times the average number of bytes required per row for the indexed columns, in this case, 23:03 at 16 bytes for each last name on average. And two bytes for each Department ID on average. 23:10 So we multiply 98,000 times 18 and we determined that our index will require about 1,764,000 bytes 23:22 of extra storage space. Next, we'll talk about a few specific types 23:29 of index structures. By far, the commonest type of index uses a B-Tree structure. 23:37 And the B here stands for balanced. So this is balanced tree. 23:42 In a B-Tree structure, we use pointers and layers of nodes 23:48 in order to organize our data. And using this structure, we can quickly locate any data that we desire. 23:56 Conceptually speaking, traversing a B-Tree index is exactly the same as the binary search strategy 24:05 that we described earlier. So a B-Tree is arranged around several layers of nodes. 24:12 We begin at the root node level. And then we might pass through one or more layers 24:20 of intermediate nodes until ultimately, we arrive at the leaf nodes. And it is at the leaf nodes where 24:26 we are able to get the information that we need for answering whatever query has been asked. 24:34 Let's see an example of how this works. So here we have a sample balanced tree index. 24:39 And again, we can give ourselves the task of locating any randomly selected name. 24:46 So I will, once again, choose Salehian as my randomly chosen name. 24:54 Beginning at the root node, we then can immediately traverse to the right side of the tree 25:00 because we know that our randomly selected name begins with a letter which falls between the range 25:06 M to W. We can then move immediately to the intermediate node for S. After which 25:12 we arrive at the leaf nodes and we can locate our desired name. 25:18 Recall that a B-Tree index is a balanced tree. And I hope you can see in this example 25:26 where a term balanced comes in. The objective, when constructing B-Tree index, 25:32 is to subdivide the data as evenly as possible 25:38 in a balanced way. And in following this strategy, we can maximize the search performance. 25:45 Before we move on to our next index structure, I want to speak briefly about clustered 25:50 versus nonclustered indexes. In a clustered index, we can store the actual data rows 25:58 of themselves, that is, the data rows that together, comprise 26:04 the actual data table in the database, at the leaf level of the index. 26:09 And this can improve search time because the database will not need to follow a pointer to the actual data 26:16 row within the table. At the leaf level of the index, the data rows 26:22 were already there. Because we are storing the actual data rows that comprise the data table at the leaf 26:30 level of the index however, this imposes a constraint such 26:36 that we can have only one clustered index per table. Primary key columns are, therefore, excellent candidates 26:45 for clustered indexes. Returning to our previous example of a balanced tree 26:51 index, imagine that at the level of the leaf nodes, 26:56 we are not simply storing pointers here to the rows within the actual database table. 27:04 But instead, we are storing the actual data rows themselves. Another way of thinking about this 27:11 is that in a clustered index, we are breaking the actual table apart and storing its rows in a specific way 27:22 such that we can maximize our search performance. In this case, the rows become the leaf nodes 27:29 at the bottom of the B-Tree index. By contrast, in a non clustered index leaf nodes 27:37 contain the values of the index columns and a pointer, which tells the database were 27:44 to look in the actual database table, for the target row or rows. 27:49 And of course, the actual data row itself might be stored as a leaf node of a clustered index, 27:57 as we saw previously. Or it might just be stored in something called a heap. 28:03 Where a heap is just the simple ordinary table that does not use a clustered index. 28:11 A few important points to note about non clustered indexes are that nonclustered indexes are slower than 28:18 clustered indexes. Because again, in a nonclustered index, the database must follow a pointer to the actual data row. 28:28 An advantage of nonclustered indexes however, is that each table in a database can contain more than one 28:35 nonclustered index. And another advantage of a nonclustered index 28:41 is that the leaf nodes of a nonclustered index are allowed to contain values from non-indexed columns. 28:51 And using this strategy can often substantially improve query performance. 28:56 Because an artfully designed nonclustered index, which contains values from non-indexed columns, 29:05 can be used by the DBMS to answer certain queries 29:10 without actually having to look at the real table of data 29:15 itself. For example, imagine that I have a product table where I have indexed the product name attribute. 29:25 Now let's imagine, as well, that I have included the price attribute as a part 29:31 of the nonclustered index. Now if I run a query where I'm attempting 29:36 to find the price of a particular product, the database can locate the product within the index. 29:45 And then it will immediately know the price of that product without actually having to look in the real data table itself. 29:54 So it will be able to answer that SQL query without ever having to look in the actual data table. 30:02 I hope you can understand how this can improve query performance. 30:07 Next, let's talk about a different index structure. And in this case, we'll look at bitmap index. 30:14 So in a bitmap index, we create a table 30:22 where we have the values of one attribute listed along the horizontal axis. 30:28 And the values of another attribute along the vertical axis. Each cell value contains a bit, that is, simply 30:37 a value of 1 or 0, which indicates whether the value of one attribute 30:43 is associated with the value of another. Important things to note about bitmap indexes 30:49 include that bitmap indexes work best when one or both of the attributes involved 30:57 in the bitmap index has only a relatively small number of unique values. 31:05 Also, when they are used properly, a bitmap index can require only about one quarter 31:14 of the storage space of a tree-based index and can also be about 10 times faster. 31:21 Let's see an example. So here we have two attributes, Student ID and Student Grade, 31:28 where a Student Grade exists along the horizontal axis of the bitmap index 31:33 and Student ID exists along the vertical axis of the bitmap index. 31:39 And we can see all the possible values for Student Grade are listed, A, B, C, D, and F. And the bits, 31:46 that is, the 1s or 0s, which exist at the intersection of a Student Grade 31:51 and a Student ID indicate to us whether or not that student is associated with that particular grade. 32:00 So if I write a query which says, give me the Student IDs of all of the students who 32:06 received A's, the database could use a bitmap index, such as the one depicted here, 32:13 and performance bit-wise comparisons, which are computationally extremely fast, in order 32:20 to answer our query. Next. Let's look at a third type of index, which 32:25 we can call a hashed index. Conceptually speaking, the idea with a hashed index 32:31 is that we use a hashing algorithm in order to convert some kind of input value or input data 32:39 into a location within an index. And that location within the index, in turn, 32:46 will contain a pointer to the actual data row within a data table. 32:52 These hashing algorithms typically work by using prime numbers as input along with something 32:59 called a modulo operation, which returns a remainder in order 33:06 to generate a location within an index. These types of hash indexes are useful in several situations. 33:17 Notably, in very large databases where parallel processing or a distributed database model 33:25 is being used. A hashing algorithm can be used to quickly determine 33:32 which database server or which processor is associated with the desired row or rows. 33:39 And we can also use hashed indexes in situations where we need to index complex objects, 33:47 such as images or other types of large objects. Here, we see an example of how a hashing algorithm can 33:55 be used to allow us to index complex objects, such as images. 34:01 So we begin by running the image through the hash algorithm. And the result of that operation is a location within an index. 34:12 So let's say that we run in image through hash algorithm and it gives us a location within the index of five. 34:22 We can then traverse our balanced tree until we get down to indexed location, which then tells 34:29 us to look in row nine of the actual data table for the data in which we are interested. 34:36 To finalize our lecture today, I just want to draw your attention to some index considerations 34:42 and some guidelines for using and implementing indexes. 34:47 First, remember that indexes require additional storage space. 34:53 And because of this, we should generally only create indexes on columns that 35:00 are involved in common queries. That is, columns which commonly appear in the WHERE clause 35:08 of a SQL query. Or perhaps in an order by or group by clause. 35:15 This also means that the database designer must know which sort of queries will 35:22 be run against the database if he or she is to design an effective indexing strategy. 35:28 A second point to consider is that indexes should be used sparingly on tables that are subjected 35:37 to frequent updates. To understand why, consider that whenever 35:44 an insert operation, an update operation, or a delete operation occurs which affects an index column, 35:54 then the index for that column has to be rebuilt by the DBMS. 36:00 Thus, if we have a table which is updated frequently, then we are also having to frequently rebuild the index. 36:08 Because rebuilding indexes takes time, if we create a lot of indexes on tables that are frequently 36:16 updated, the effort required by the database to constantly rebuild those indexes 36:23 may actually slow the overall performance of the database. 36:28 Finally, just a few indexing guidelines that I hope you will find useful when designing an indexing strategy 36:36 for your database. Again, if a table is heavily updated, 36:41 try to indexes few columns as possible. If you over-index a heavily updated table, 36:50 it may act really slow the overall performance of the database rather than improving the performance. 36:58 If, however, table is updated rarely, then, assuming that you have the storage space available, 37:05 feel free to use as many indexed columns as necessary to support the common types of queries that 37:11 are run against that table. Doing this will substantially improve the performance 37:18 of those queries. With respect to clustered versus nonclustered indexes, 37:24 remember that clustered indexes are most effectively used on columns that do not allow null values 37:32 and whose values are unique. Because, by definition, primary key column 37:37 it's contained unique values and values that are not allowed to be null, primary key columns 37:45 are therefor good targets for clustered indexes. And finally, the performance benefits 37:53 that you can achieve through the use of an index are related to the uniqueness of the values 37:59 in the indexed column. That is, if an indexed column contains 38:05 many, many duplicate values, the performance of the index will be quite poor. 38:12 On the other hand, if the indexed column contains many unique values, that's when you can achieve the maximum benefit from using an index. 38:21 Well my friends, thus ends our overview of database indexing. 38:31 I hope that you learned something interesting. And until next time, have a great day. 38:37