0:00 Good day, everyone. This is Dr. Soper here. And today we will be exploring the eighth and final topic 0:08 in our series of lectures on databases, with today's topic focusing on big data, data warehouses, 0:17 and business intelligence systems. Although the objectives for today's lecture 0:23 are varied and manifold, broadly speaking, our goals are to extend our knowledge of database systems 0:33 beyond relational databases by exploring some of the additional ways in which database technologies can 0:41 be used to achieve organizational objectives. Along the way, we will explore the basic concepts of data 0:49 warehouses and data marts. And we'll learn about the basic concepts and architecture 0:54 associated with dimensional or multi-dimensional databases. We will also learn about business intelligence systems, 1:03 online analytical processing, and data mining, all of which 1:08 are very useful for helping managers to make decisions. And then toward the end of today's lecture, 1:15 we will explore some of the fastest growing topics in the database world, including 1:21 big data, the NoSQL movement, structured storage, 1:26 and the MapReduce process. So without further ado, let's begin. 1:33 I thought a good way of introducing some of the topics in this lecture 1:38 would be through an exploration of human mathematical intuition, that is, do we as human beings 1:47 have a natural intuition for finding truth in numbers. 1:53 To examine this question, consider the following example. Here we have two people, whom I've named Bubba and Daisy. 2:04 And both Bubba and Daisy are interested in purchasing new vehicles. 2:11 Currently, Bubba drives a very inefficient vehicle. Let's say that he drives an old truck. 2:17 And he gets 10 miles per gallon. Whereas Daisy drives a reasonably efficient car. 2:26 And in her case, her current vehicle gets 30 miles per gallon. 2:32 Now let's say that both Bubba and Daisy are interested in purchasing new vehicles that will have better 2:40 fuel economy than their current vehicles with the goal of needing to purchase less gasoline every year, 2:48 thereby saving money. Bubba, however, does not want to give up 2:54 all of the benefits of his big truck, so the vehicle he's looking at provides just 12 miles 3:01 per gallon, as opposed to his current vehicle which provides 10. Daisy, on the other hand, is very 3:08 interested in fuel economy. And so she is planning to purchase a vehicle which 3:14 delivers 50 miles per gallon, as opposed to her current vehicle, which has 30. 3:21 Another way of considering these numbers is by looking at the percentage increase in miles per gallon. 3:28 In Bubba's case, he is moving from a vehicle which has 10 miles a gallon to one which has 12 miles per gallon, 3:36 and that is an increase of 20% in the miles per gallon. Daisy, by contrast, is moving from a vehicle 3:43 with 30 miles per gallon to one which provides 50 miles per gallon. And that is an increase of 66.7%. 3:52 Next, for the sake of offering a fair comparison between Bubba and Daisy, let's say that both drivers drive 4:00 10,000 miles per year. Now at this point, I want to appeal to you or mathematical intuition. 4:07 Based upon the situation that I've described, I would ask you a simple question, 4:13 which of these drivers, Bubba or Daisy, is going to save more money on gasoline every year? 4:21 Well, if you are like most people, you will look at these numbers and you'll say that clearly Daisy's vehicle has a much greater increase 4:31 in miles per gallon, therefore, she will be saving the most money every year on gasoline. 4:38 Unfortunately, this is the wrong answer. In fact, Bubba will be saving more money 4:45 every year on gasoline. And let's see why. 4:50 Here we have a table which shows us how much gasoline both Bubba and Daisy 4:56 are currently consuming based upon their driving habits versus how much they will consume if they purchase 5:04 their new vehicles. In Bubba's case, he drives 10,000 miles per year. 5:10 And in his current vehicle, he gets 10 miles per gallon. Therefore he will consume 1,000 gallons of gasoline per year. 5:19 In his new vehicle, Bubba will still drive 10,000 miles per year, but now will 5:25 get 12 miles per gallon, and therefore will consume 833.33 gallons of gasoline per year. 5:35 By contrast, in Daisy's current vehicle, she gets 30 miles per gallon. 5:40 And driving 10,000 miles per year, she consumes 333.33 gallons of gas. 5:48 Whereas with her new vehicle, which gets 50 miles per gallon, she will consume 200 gallons of gasoline every year. 5:58 Simple subtraction, then, shows that Bubba will save 166.67 gallons of fuel per year 6:08 if he purchases his new vehicle. Whereas Daisy will only save 133.33 gallons 6:16 of fuel per year. That is, Bubba will enjoy it more fuel savings per year 6:23 by buying a new truck, which has 12 miles per gallon, versus his old truck, which had 10 miles per gallon, 6:31 than will Daisy by buying a new car which has 50 miles per gallon, versus her old car which 6:38 provided 30 miles per gallon. So if you're like most people, in this problem 6:44 you're mathematical intuition has failed. And it is for this reason that I would 6:49 argue that managers should not rely on their intuition 6:55 when making important managerial decisions, but rather should rely on data. 7:02 Many managers have begun to recognize the deficiency is in human intuition, 7:08 and are hence adopting technologies like business analytics, business intelligence, data 7:14 warehousing, and data mining, so as to better support their decision-making activities with the data 7:21 that the organization has at its disposal. By the way, if you're interested, 7:27 this problem is known as the miles per gallon illusion, and stems from the fact that we here in the United 7:35 States commonly measure fuel economy in miles per gallon, as opposed to the much more appropriate measure which 7:44 would be gallons per mile. Now that we've seen for ourselves 7:49 how human intuition can fail, in even very simple mathematical problems, we can 7:55 begin discussing some of the systems and technologies that have been developed to help managers make better decisions 8:04 and avoid the failures that are commonly associated with relying upon intuition alone. 8:10 First among these technologies are business intelligence, or BI, systems. 8:16 BI systems are information systems that are intentionally designed to allow managers to capitalize 8:25 upon organizational data for the purpose of improving their decision making. 8:31 It's important to note that these business intelligence systems do not support the real-time operational 8:38 activities of the organization. Those activities, by contrast, are supported by transaction 8:46 processing systems. Instead, these BI systems are designed 8:51 to support managerial assessment activities, analysis, planning, 8:57 controlling, et cetera. Broadly speaking, we can classify these BI systems 9:04 into two categories. First, are simple reporting systems, the purpose of which 9:10 is to provide managers with the ability to look at data in a flexible, real-time way, that 9:19 is, these reporting systems support simple data organization capabilities, such as sorting, filtering, 9:29 and grouping data. And they often provide the ability to make simple calculations on data in real time. 9:38 These simple calculations might include operation such as a sum, or an average, or a count. 9:48 By contrast, second category of BI systems are data mining applications. 9:55 Data mining applications are typically designed not to rely on that real time data, 10:02 but rather to rely on archived historical data. And the reason for this is that data mining applications 10:11 typically allow for sophisticated analyses on an organization's data. 10:17 Because these analyzes involve complex statistical and mathematical processing, they typically cannot be 10:23 conducted in real time. The advantage, of course, of such statistical and 10:29 mathematical techniques, is that they can deliver insights and create predictive models that simply 10:36 are not possible with the simple types of calculations that are available within Reporting systems. 10:45 This figure depicts the relationships that exist between operational and BI applications. 10:52 Whereas our operational applications are used by functional users and rely upon the operational DBMS. 11:02 Business intelligence applications are used by managerial users, and can 11:08 rely upon the operational DBMS, or a specialized DBMS that is made just for the business intelligence 11:16 applications. And by extension, these BI applications 11:21 can hence rely directly upon the operational database by way of the operational DBMS, for example, 11:30 in the case of reporting applications. While they also might rely upon archived historical data, 11:37 or other data sources, which are typically made available in the form of a data warehouse or a data mart. 11:45 As a quick review, just remember that simple BI reporting applications typically rely on an organization's 11:54 operational data. And they provide the ability to look at data in a simple way in real time. 12:03 By contrast, data mining applications typically rely on archived historical data, and as such, 12:11 do not provide a real time view of the organization. The trade-off or this time lag is 12:19 that data mining applications can use sophisticated statistical and mathematical techniques 12:26 to create models which allow managers to perform what if analyzes, do predictions about the future, 12:34 and generally speaking, improve their decision making. As I noted earlier, business intelligence applications 12:43 that provide capabilities which go beyond basic reporting, 12:49 typically rely upon extracts of the organization's operational database, along with data acquired 12:56 from other sources, all of which can be aggregated and stored 13:01 in a data warehouse. Thus a data warehouse commonly contains data from many different sources. 13:08 Not only does it contain data from the organization's operational databases, but it can 13:14 contain other internal and external data as well. In the case of external data, an organization 13:21 may be able to obtain such data from publicly available sources. 13:26 Or they may purchase data. Examples of these external data sets 13:32 might include information about what competitors are doing, what the market is doing, or expectations 13:39 about future global trends. Together, data from all of these various sources 13:46 are run through an ETL system, where ETL stands for extract, transform, and load, 13:54 so as to clean and prepare the data for inclusion in the data warehouse. After this process, the data can actually 14:01 be added to the data warehouse itself. And then our more complex business intelligence 14:07 applications will have a source of data upon which they can rely when performing their tasks. 14:16 Despite our best efforts at designing relational databases 14:21 that will ensure the quality and integrity of the data that they contain, it is, unfortunately, 14:27 still possible for problematic data to appear in the database. 14:32 Further, because our business intelligence applications may rely upon these data in support of managerial decision 14:41 making, it is critically important that the data be of the highest quality possible, such 14:48 that managers will have the greatest chance possible of making good decisions. 14:53 Here, we're simply referring to the classic and cliched concept of garbage in, garbage out, that is, 15:01 if we are providing our managers with low quality or problematic data with which to support their decision making, 15:09 then we must expect the resulting decisions to be similarly problematic. So let's examine some of the problems that can 15:16 arise in operational databases. A very common problem is what we call dirty data. 15:23 And dirty data refers to a data value which is obviously incorrect. 15:29 As an example, we might have the letter v stored as a gender code, instead of the more common m or f, for male and female. 15:39 Or we might have a value of age stored as 213, 15:45 which would be a remarkably old human being if that number was correct. 15:50 Other problems with operational data include missing values and inconsistent data, 15:56 where inconsistent data refer to data values that have changed 16:02 and are not the same across all of our data sources. So perhaps a customer's phone number 16:09 was updated in the operational database. And the previous value in the data warehouse is, therefore, incorrect. 16:16 Additional problems with operational data include non-integrated data, that is, 16:22 when we have data from two or more sources that we need to merge together in some way 16:28 so that they can be added into the data warehouse. We may also have data in an incorrect format, 16:35 in which case, it will need to be transformed into the format that is required by the data warehouse. 16:43 And of course, we may simply have too much data. There is a general concept in science known as parsimony. 16:52 And this concept tells us that simplicity is often preferred. 16:58 For example, if I construct a predictive model that is 95% correct, and it relies upon three predictor 17:08 variables in order to achieve that 95% accuracy. But I might improve the accuracy to 96% 17:18 by adding 20 additional variables, in most cases, the additional complexity involved in order 17:26 to achieve such a marginal gain in predictive power would not be worth it. 17:32 So in that case, we would prefer the three predictor model over the model which contains 23 predictors. 17:40 As I mentioned earlier, because a data warehouse often contains data from multiple data sources, 17:46 the input data commonly need to be run through an ETL process 17:51 before they can be stored in the data warehouse. Again, ETL here stands for extract, transform, and load. 18:00 Where the extract step is simply pulling data from these various data sources. 18:07 The transform step is cleaning, or modifying, or processing the data in such a way 18:15 that they are made appropriate for inclusion in the data warehouse. 18:21 And then the load step refers to the process of taking the transformed, cleaned, processed data 18:29 and actually storing it in the data warehouse so that they can be used by whichever business intelligence 18:37 applications might rely upon that data warehouse. Simple examples of such transformations 18:43 might be transforming a country code into a country name. 18:50 So we may have the country code US, for example. And in the data warehouse, we need 18:55 to transform that into the name of the country, which might be United States. 19:01 Or we may have a customer's email address, such as dan@dan.com. 19:07 And we actually just want to store the email domain in the data warehouse for purposes of aggregation. 19:14 In that case, we would want to transform the customer's email address simply into the domain, which in this case 19:22 would be dan.com, and store the result in the data warehouse. 19:28 Next, I would like to talk about the concept of a data mart. And I think the best way to understand 19:34 a data mart is simply that it is a subset of the organization's 19:41 data warehouse. Data marts are constructed to support a specific need 19:48 within the organization. So this might be a subset of the data warehouse 19:53 that is needed to support a particular project, or a particular functional area within the business, 20:01 like advertising or marketing, for example. Or perhaps, we need to create a data mart 20:07 to support a specific group of employees within our organization, like a sales team. 20:14 Regardless of the specific reason why we create a data mart, the general principle 20:20 underlying their creation is simply that not all personnel, or not all managers, 20:26 within the organization will need access to all of the organization's archive historical data. 20:34 Personnel within the organization who perform new product development, for example, 20:41 probably will not need access to data associated with human resources, such as employee salaries or employee 20:53 benefits. Instead we might create a data mart just 20:58 for the new product development team, which contains only those data that 21:03 directly support their needs. At some point in your adventures in the database world, 21:10 you may have heard the term dimensional database, or multi-dimensional database. 21:15 And I'd like to take a few moments to talk about some of the concepts associated with these types of databases. 21:23 To begin, it's important to note that dimensional databases are designed and implemented using exactly the same sort 21:32 of database technology that we use to create our operational databases. 21:38 That is, dimensional databases contain tables. They are related to each other using 21:44 primary key foreign key links. We have the same concepts cardinalities, 21:51 such as one to one relationships, one to many relationships, et cetera. 21:57 So hopefully, operating within the familiar framework of the relational database world, 22:02 understanding these dimensional databases will be reasonably easy. 22:07 Broadly speaking, the idea with a multi-dimensional database is that we want to implement a non-normalized database 22:16 structure for the purpose of vastly improving query speed. 22:22 That is, in an operational database, we typically implement a database design 22:28 that is largely normalized, that is, it might be in third normal form, 22:34 or Boyce-Codd normal form, with perhaps a few tables being denormalized for the purpose of improving efficiency. 22:42 And what a normalized database allows us to do is to store large amounts of data 22:50 very quickly, while still preserving the quality and the integrity of the data in the database. 22:58 The problem with this sort of rigid, normalized design, however, is that if we want to extract data from the database, 23:07 we commonly need to perform computationally expensive join operations in order to get the information that we want. 23:16 So a normalized relational database is very good for quickly storing information, 23:23 but is very bad for quickly extracting information that we want. 23:28 By contrast, with the sort of design that is implemented in a multi-dimensional database, 23:34 storing the data in the database can be a very slow and laborious process. 23:40 However, extracting data from the database is very fast. 23:46 And the reasons for this are that we implement a non-normalized database structure, 23:52 while simultaneously storing data in pre-aggregated levels of granularity 24:00 within the dimensional database. An important point to note here is 24:05 that these dimensional databases are used to track an organization's historical data, 24:12 and therefore they almost always contain a date or time 24:18 dimension. And it is this date or time dimension that provides us with the ability 24:25 to store the same information aggregated at different levels of granularity 24:31 within the multi-dimensional database. For example, imagine that we have operational data which 24:38 represents all of the sales transactions for a store. So every time a sale is made, we may 24:46 generate a sale record, which records information about that transaction. 24:52 Now, this information is useful, however, it may not be as useful to a manager 24:59 as it would be if it were aggregated up to a coarser level of granularity. 25:05 Consider that if I were to take all of the sales data for one day and add them all together, then 25:13 I have a daily total. Similarly, if I take the daily totals 25:19 for seven consecutive days, then I have a weekly total. I can, in such a way, continue to create 25:26 monthly totals, quarterly totals, yearly totals, et cetera. 25:32 It is the same information that is available in the individual sales transactions, 25:40 except it has been pre-aggregated and stored in the database in a pre-aggregated form, 25:47 such that we can vastly improved query speed. That is, when a query is run where 25:54 we want to look at the data in the form of weekly totals, or monthly totals, or quarterly totals, the database at 26:03 that time does not need to aggregate all of the individual sales transactions 26:10 in order to produce the result. The result already exists in the dimensional database 26:16 because it has been pre-processed prior to being added into the database. 26:22 So again, the purpose of these dimensional databases, then, is to intentionally implement 26:29 redundant data in a non-normalized design, such that we can vastly improved query speed. 26:36 We want to be able to extract data very quickly from the database. 26:42 The most common data model for a dimensional database is known as a star schema. 26:48 And the general characteristics of a star schema are that we have several dimension tables. 26:56 In this case, we have a time dimension table, a customer dimension table, and a product dimension table. 27:03 And at the intersection of all of those dimension tables, we have something called a fact table. 27:10 Philosophically speaking, the reason that the intersection table at the center of the dimension 27:16 tables is called a fact table, is because a fact in very real terms 27:23 is the intersection of information. For example, imagine that I'm interested in knowing 27:31 how many of a particular product a specific customer purchased during the month of December. 27:39 Well, the answer to that question is the intersection of the customer, 27:45 the product in which I'm interested, and the specific time frame that I specify, in this case, 27:51 the month of December. And the answer might be 14 units. So a specific customer purchased 14 units 27:59 of a particular product during the month of December. The intersection of those three values is a fact. 28:06 And it is for this reason that we label the table at the center of a star schema, a fact table. 28:13 To help you better conceptualize this concept, let's consider the intersection of two dimensions. 28:20 And in this example, we're looking at the customer dimension contrasted with the product dimension in the form of a two 28:27 dimensional matrix. The value contained in each cell within this matrix, 28:34 then, is a fact. And it expresses to us the quantity of a particular product that was purchased 28:41 by a particular customer. And you will notice, of course, if you recall from our last lecture, that this structure is 28:50 very similar to a bitmap index. Extending this concept out into a third dimension, 28:58 we can see here that we're representing a fact as the intersection of three different dimensions in a three 29:05 dimensional matrix. Along the horizontal axis I, again, have customers. 29:11 Along the vertical axis I, again, have products. But now, along the z-axis, I have a time dimension. 29:20 Therefore, the value contained in any of the cells in this three dimensional matrix 29:25 will tell me the quantity of a given product that was purchased by a particular customer 29:33 during a particular date or time frame. Now unfortunately, human beings have a great deal 29:40 of difficulty envisioning higher order spaces beyond three dimensions. 29:47 However, this concept scales very easily up to higher dimensional spaces. 29:54 So we might consider, for example, a fact to be the intersection of four dimensions, or five dimensions. 30:02 And although it is not easy to depict such a situation, conceptually, it is just a natural extension 30:09 of the two-dimensional and three-dimensional examples we saw here. In either case, I hope you can now 30:15 understand why databases designed in this way are called dimensional or multi-dimensional databases. 30:23 Next, I'd like to talk briefly about OLAP and data mining technologies. 30:29 If you recall from earlier in the lecture, we said that, generally, there are two broad categories of business intelligence applications. 30:38 And they were, reporting applications, and data mining applications. 30:44 Online analytical processing, or OLAP, then, is a technique that supports these reporting applications. 30:53 That is, OLAP allows us to dynamically examine database 30:59 data in real time, and apply simple transformations like sorting, filtering, grouping, et cetera. 31:08 And it allows us to perform simple arithmetic functions, such as summing values together, finding an average, account, 31:17 the standard deviation, et cetera. And again, this is intended to be used in real time. 31:24 By contrast, data mining techniques support data mining category of business intelligence 31:31 applications. And data mining, broadly, refers to a collection 31:36 of mathematical and statistical methods that can be used to gain deep insights into an organization's 31:44 data. Again, remember that the level of sophistication of these techniques generally requires that they not 31:52 be executed in real time, so as to avoid interfering with the real time operations of the organization. 32:01 OLAP systems, then, when used in support of simple BI reporting 32:06 needs, produce something called an OLAP report, which some people will refer to as an OLAP cube. 32:14 And the general idea here is that our inputs are a set of dimensions, while our outputs are 32:21 a set of measures. So recalling the two dimensional and three dimensional matrices 32:28 that we saw just a few moments ago, a manager might select a series of dimensions. 32:35 And the OLAP system might allow him or her to perform simple transformations 32:41 or drill down operations on the data which lie at the intersection of those dimensions 32:48 so as to gain real time insights into the organization. And here we see that these OLAP cubes 32:56 can be constructed using our standard SQL SELECT queries. 33:02 In this case, we're selecting a number of different dimensions. We are then performing a join on four separate tables, 33:09 and are imposing some group by and order by requirements. 33:14 The result of this query in OLAP terminology, then, would be a result set which 33:20 represents a collection of measures that a manager could use to gain some insights 33:27 into his or her organization. And of course, rather than constructing these SQL commands 33:35 repeatedly, we might take advantage of the capability of relational databases to create views, 33:42 so as to save the SQL statements, which are used to produce common OLAP cubes in the database itself. 33:53 Data mining, then, can be viewed as the convergence of many different disciplines. 34:00 A skilled data miner needs not only working knowledge of databases, but also needs statistical and mathematical 34:09 knowledge, perhaps knowledge of artificial intelligence, or machine learning algorithms, knowledge of data management 34:16 technologies, and so forth. In the modern world, many people become highly specialized 34:23 in one particular area. But the people who are most valuable to an organization 34:29 often have expertise in two or more areas. And this is certainly the case with people 34:35 who are experts at data mining. To conclude our overview of data mining, 34:41 I just wanted to briefly describe some of the most common techniques that are used to perform data mining against an organization's data. 34:50 Among these techniques are cluster analysis, in which case, the objective is to group sets of entities 34:58 together according to their level of similarity along one or more dimensions. 35:03 We also have decision tree analysis, in which we can process a large quantity of historical data 35:11 and generate a decision tree, which tells us what to do under different circumstances, 35:16 in order to achieve some kind of the desired result. We also have regression available to us 35:22 as a very powerful data mining tool. The goal of which is to produce mathematical equations, 35:29 or mathematical models, that not only describe the relationships 35:36 between variables, but also provide us a basis for predicting future events based 35:43 upon past observations. Data mining applications might also 35:49 rely on sophisticated artificial intelligence, or machine 35:54 learning algorithms, such as neural networks or support vector machines. 36:00 And recently, we've seen a rise in a technique known as market basket analysis, or affinity analysis, which 36:10 allows us to look for patterns of co-occurrence, for example, determining which products are commonly 36:18 purchased together. And the results of these affinity analyses can then be used as the foundation of a recommendation 36:26 engine, which can suggests to you movies that you might like, books that you might like, et cetera. 36:35 Now I'd like to move into the final topic in our course 36:40 on databases, and that is the rise of the big data paradigm. 36:48 Scientists and researchers have recently noted an exponential increase the quantity of data being 36:55 produced by the human species. If my memory serves correctly, I believe 37:00 the current rate of growth is that the amount of data doubles every 40 months. 37:07 At present, the world is generating many exabytes of new data every single day. 37:15 And if you're unfamiliar with the term exabyte, consider that one exabyte is slightly more than one 37:23 million terabytes. So you may have a computing device at home 37:28 that saves several terabytes of data. But consider that several million terabytes of new data 37:36 are being generated by the human species every single day. And this situation creates a vast array 37:43 of new and interesting problems for organizations. 37:49 The term big data, then, refers to the rapidly expanding amount of data that is being stored and used by organizations. 37:59 These data sets can be very large and very complex. And because of their size and complexity, 38:06 they can be extremely difficult to process using traditional database technologies. 38:11 And an important point to note is that much of what is considered big data 38:18 is being generated by web 2.0 applications, and the emerging collection of web 3.0 applications. 38:27 Traditional examples of web 2.0 applications might include social networking sites, video sharing sites, 38:36 blogs, discussion forums, et cetera. So this rapidly accumulating quantity of data 38:43 presents many challenges for organizations. Among these are simply capturing all of the data 38:50 and storing it, maintaining the data once we have it. This is also commonly referred to as curation, 38:58 in the same way that the curator of a museum must maintain all of the ancient artifacts, so, too, 39:06 must the curator of a big data set be able to maintain the quality and the integrity of the data 39:12 in light of things like failing hardware, and the desire of the data to be used by many people from all 39:19 over the world simultaneously. Additional challenges include things such as search. 39:25 How does one search efficiently through such an enormous quantity of data? 39:31 Data transfer, consider for example, that if you have a 100 megabit network connection 39:39 you can transfer approximately one terabyte of uncompressed data per day. At this speed, it would take you more than a million days 39:47 to transfer one exabyte of data. Further challenges include analyzing 39:53 these massive data sets, visualizing the massive quantities of data, and so forth. 40:00 In the past few years, a term has arisen in the area of big data that 40:05 is used to describe the movement toward using non-relational databases in order 40:12 to support these huge and highly distributed, highly replicated collections of data. 40:19 And this term is called NoSQL. Although the name, to many people, 40:25 implies that SQL is not involved in these databases, that is, they do not support SQL-like queries, 40:33 this assumption is actually wrong. As it is used in contemporary database circles, 40:39 NoSQL means not only SQL, that is, these very large databases, 40:46 although they may not be based on relational algebra, in the same way that a relational database is, 40:54 they nevertheless support querying through a SQL-like query language. 41:01 Unlike the relational database world where the relational model is fixed and predominates 41:08 all relational database vendors, in the NoSQL world, 41:13 there are many different architectures for non-relational databases that are currently being used. 41:20 These include architectures which rely upon a key value store, a wide columnar 41:27 store, a documents store. There are databases that rely upon graph theory, 41:33 and so forth. Collectively, all of these different types of very large data stores are commonly 41:40 referred to as structured storage. And they have a few attributes in common. 41:46 First, they arguably employ simpler designs, as opposed 41:52 to relational databases. And second, they almost always have a looser consistency model 41:59 than one will find in a relational database. Another way of saying that is these structured storage 42:07 databases do not provide ACID guarantees. If you remember, ACID is an acronym 42:14 which stands for Atomicity, Consistency, Isolation, and Durability. 42:21 And ACID guarantees are the hallmark of a normalized relational database. 42:26 We cannot expect to have that level of consistency in these massive, highly distributed, highly replicated, 42:35 structured storage databases. When discussing the sort of data models that are actually used in these structured storage databases, 42:44 I like to use the data model that is employed by the Apache Cassandra database 42:50 as an example. And the reason for this is that it is one of the most popular structured storage 42:56 database management systems. And it is currently the most popular wide columnar store 43:03 that is available. Broadly speaking, the Apache Cassandra database 43:09 can be classified as a hybrid, key value slash wide columnar 43:15 database. So its architecture contains elements of both a key value store and a wide columnar store. 43:24 The Apache Cassandra database itself was originality created at Facebook by two of their software architects, 43:33 after which it was transferred to the Apache Foundation, where it now resides as entirely open source and free database. 43:43 Apache Cassandra has cross-platform support. The reason for this being that it was a written in Java. 43:51 So it can run on Linux-based machines, Windows-based machines, Unix, et cetera. 43:58 And Cassandra supports a massively distributed database environment. 44:04 That is, it allows us to subdivide our database among dozens, or hundreds, or even 44:10 thousands of separate database servers, potentially spread out all over the world. 44:16 The database is a highly scalable and decentralized. By scalable here, I mean it's extremely easy 44:23 to add an extra node, that is, an extra server to the cluster, 44:30 thereby expanding the size of the database. And by decentralized, what I mean here 44:36 is that all of the nodes, that is all of the database servers, that are involved in a Apache Cassandra database, 44:44 have the same role. And this provides the very desirable characteristic 44:49 of there being no single point of failure. Another very valuable characteristic 44:55 of the Apache Cassandra model is that it provides for automatic data replication. 45:01 That is, the database itself can automatically make copies of data and store those copies 45:09 in different locations throughout the cluster. This makes the database highly fault tolerant, such 45:15 that if an individual node, that is an individual database server, were to fail, the redundant data stores takeover 45:24 instantaneously. There's no down time with the database at all. Further, Apache Cassandra supports the MapReduce process, 45:33 which is a computational model for solving data processing problems in a highly distributed environment. 45:42 And I'll talk more about the MapReduce process here in a few minutes. And to illustrate the legitimacy of the Apache Cassandra model, 45:51 consider that it is currently used by CERN, organization such as Constant Contact, Digg, Instagram, Netflix, Reddit, 46:01 Walmart, Twitter, et cetera. Now let's talk about the Cassandra data model itself. 46:09 As you know, in a relational database management system, related data for an application are stored together 46:17 in a container which is referred to as a database, or sometimes as a schema. 46:23 And within that database or schema, we have one or more tables. The analogous concept in Cassandra 46:32 is something called a key space. That is, data for an application are stored together 46:38 in a container known as a key space. And inside of that key space, instead of tables, 46:45 we have something known as column families. So just as in the relational database world, 46:51 a single DBMS might contain many databases, each of which contains many tables. 46:58 In the world of Apache Cassandra, the Cassandra database might contain many key spaces, each of which contains many column families. 47:07 The column families, then, contain columns. But this is not the same as a column 47:15 in a relational database. In Cassandra, a column consists of a name, 47:21 that is the name of the column, a value, that is the data value for that column, 47:27 and the time stamp, where the time stamp indicates the point in time at which the data value was changed. 47:35 Related columns, then, are all stored within the same row. 47:41 And each row is identified using a unique row key. 47:46 This notion of a unique row key is directly analogous to the idea of a primary key 47:53 in the relational database world. Rows in the Cassandra model, however, 47:58 are not required to contain the same set or number of columns. 48:03 That is, different rows within the same column family might have a different number of columns. 48:11 And the number of columns in a particular row is allowed to expand or contract on an as-needed basis. 48:18 A few additional important differences to note between the Cassandra data 48:24 model and the relational data model are that in the Cassandra data model 48:30 there are no formal foreign key relationships between column 48:35 families. That is, we cannot establish formal relationships between 48:40 column families within the same key space. And what's more, it is not possible to join 48:47 column families together using a query. So whereas in the relational database model 48:53 we can write queries that will join tables together, it is not possible in the Cassandra model 49:00 to join column families together. Now, I know that this sort of a verbal description 49:06 of the Cassandra data model can be a bit challenging to follow. So let's look at a picture which I hope 49:12 will help to make some of these concepts clearer. Here we see a graphical representation 49:19 of the Cassandra data model. The outermost container represents all of the key spaces for the Cassandra database. 49:28 And in this case, we have just two key spaces, one of which is labeled as the blog key space, 49:34 and the other which is labeled as the store key space, where the details of the store key space 49:41 are not elaborated in this diagram. Again, the idea here is that a key space 49:47 is roughly analogous to a database within the relational database world. 49:52 This means that a key space, then, is typically oriented toward a particular need 49:58 or a particular application. Within each key space, we can have one or more column 50:04 families. In this case, we have a column family for users, and we have a column family for blog entries. 50:13 Next, let's look at an example of a column family. And to begin, I will refer to the user column family. 50:20 So here, we can see that we have three rows within the column family. 50:25 And each row represents a unique user within the blog key space. 50:33 A user, then, is represented as a collection of one or more columns. 50:39 And remember that, in the Cassandra data model, the number of columns per row can vary from row to row. 50:49 So in this first row, that is the Dan 42 row, we see that we have three columns. 50:55 The first column is the name column. It's value is Dan. And we have a timestamp. 51:02 The second column is the email column. It's value is dan@dan.com. 51:08 Again, it has a timestamp. And the third column is the phone columns, which 51:13 has a value and a timestamp. For the next user, we have only the name column. 51:20 And for the third user, we have only the name and email columns. So there is no requirement that each row contain 51:27 the same number, or even the same type of columns. Next, let's look at the blog entry column family 51:34 within this blog key space. Again, we see that each row within the column family 51:41 contains multiple columns. In this case, both rows contain the same columns. 51:47 But again, that is not a requirement. Here, the columns are the text column, the category column, 51:53 and the user column. Note, particularly, that the values stored in the user 52:01 column can be used to determine which blog entries were written 52:07 by which users. However, remember that formal relationships 52:13 between column families do not exist in Cassandra. That is, we do not formally establish 52:20 primary key, foreign key relationships. So I hope that looking at this diagram of the Apache Cassandra 52:28 data model demystifies things a little bit for you. 52:33 I know that learning about these structured storage data models for the first time can be intimidating, 52:39 but I hope that through this diagram, you can see that it's really not that complicated. 52:46 And I hope that is encouraging for you. As I mentioned earlier, these structured storage databases 52:53 are often highly distributed and highly replicated. That is, they may be spread across many, many different 53:00 nodes or database servers. Now this structure has substantial advantages. 53:07 Not only does it provide fault tolerance, but it allows for data requests to be handled 53:15 by the nearest available node that is able to service the request. So for example, if you are in Tokyo, 53:24 and it happens that a copy of the data in which you are interested is stored on one of my database nodes, 53:32 which is located near Tokyo, it's much more efficient for that node to handle your request 53:37 than it would be to route the request to a distant geographical node, say, one 53:43 which might be located in Berlin. The problem with this model, however, 53:50 is that it can cause problems with consistency. Consider what happens when a data item is updated. 53:56 So if I update a data item on one node, 54:01 it will take time for that update to cascade to the other nodes within the cluster that 54:08 contain a copy of the data. So imagine that my distributed structure storage database 54:15 contains 1,000 nodes, spread all over the world, 54:20 and the data item I'm interested in updating is replicated across 100 of those nodes. 54:29 So I may then perform the update on one of the nodes. And until that update cascades throughout all 54:39 of the other nodes in the cluster, any requests for that same data item 54:45 that are made of those other nodes will be returned values that are out of date. 54:51 And again, this is due to the fact that the database is widely distributed 54:57 and widely replicated. And because we typically do not enforce 55:02 an ACID level of consistency. Thus, in these replicated data environments, 55:08 we commonly use a consistency model that is referred to as eventual consistency. 55:15 And what eventual consistency means is that if no new updates are made to a specific data 55:24 item for a period of time, eventually all of the requests for that data item 55:31 will return the most up to date value, regardless of which node is servicing the request. 55:37 And the time stamps that are recorded during each item update are the key which allows us 55:44 to reconcile any inconsistencies in replicated data 55:49 values between nodes. Finally, I would just like to take a few moments to discuss 55:55 the MapReduce process. Broadly speaking, MapReduce is a programming model 56:03 that relies on parallelization in order to perform data processing tasks on these huge data sets that 56:13 may be distributed across many different servers, or many different nodes. 56:19 So conceptually speaking, then, the MapReduce process involves two different types of nodes. 56:26 There will be a master node and a worker node. Put simply, the master node is usually 56:33 the node which receives the data processing request from the user. 56:38 While the worker nodes are nodes which are assigned to complete part of the processing task 56:45 by the master node. So this MapReduce process, then, unfolds in two steps. 56:51 The first step is called the map step. And in the map step, the master node 56:57 will take the data processing problem and subdivide it into a series of sub problems. 57:04 And each of these sub problems is then assigned to, and carried out by, a worker node. 57:11 The second step in the MapReduce process, then, is the reduce step. So after the worker nodes have completed their assigned tasks, 57:21 they pass the results of their work back to the master node. 57:26 The master node will then do the final processing, or final combining, of those results 57:33 in order to produce the overall answer to the problem, which is then returned back to the user. 57:39 Again, I know that concepts such as this can be difficult to understand in verbal terms, 57:45 so let's see if we can get a better idea using an image. So toward the top of this figure we have the master node. 57:54 And toward the bottom, we have various worker nodes, which here are labeled one, two, three and n, 58:01 up to however many worker nodes we need to solve the problem. So the MapReduce process unfolds as follows. 58:09 As input, the data processing problem is passed into the master node. 58:14 The master node will then divide that data processing problem into sub problems, which are then assigned to 58:22 and carried out by the various worker nodes. After completing their tasks, the worker nodes 58:29 will return their results back to the master node, which performs the final combining and processing of the worker nodes' 58:38 results, in order to produce the final answer, which is then 58:43 the output of the MapReduce process. Let's consider an example, imagine 58:49 that we are a wireless service provider, and it we use a highly distributed, structured storage 58:58 database, which has 1,000 different servers all over the world. 59:04 Let's further assume that our 100 million 59:09 customers are equally subdivided among our 1,000 servers. So that means we would have data for 100,000 customers per node 59:21 within our database environment. Now let's imagine that our data processing problem 59:27 is that we want to figure out the average number of text 59:33 messages sent during the month of November. 59:38 And we want those results organized by age. So we would like to know what is the average number of text 59:46 messages sent by 18-year-olds, and 19-year-olds, and 20-year-olds, and 21-year-olds, and so 59:54 forth, all the way up until our oldest customers. Now let's see how the MapReduce process 1:00:01 can be used to solve this data problem. First, the problem is passed to the master node. 1:00:08 And the master node might subdivide the problem such that it instructs each of the 1,000 worker 1:00:15 nodes within our database environment to count the total number of text messages sent 1:00:23 by each customer during the month of November, and aggregate those results by age. 1:00:31 The results of the worker nodes tasks, then, would be a table of data, which 1:00:37 might contain three columns. First would be all of the distinct ages of the customers 1:00:44 whose data resides on that particular worker node. The second column might be the number 1:00:50 of customers who are that age. So we might have 1,000 18-year-olds, 714 19-year-olds, 1:01:00 235 20-year-olds, et cetera. And then, the total number of text messages 1:01:07 sent by customers of each particular age. So perhaps, 18-year-old sent 10 million text messages. 1:01:16 19-year-olds sent 9,800,000 text messages, and so forth. 1:01:23 So each worker node performs this task for all of the customers whose data are stored on that node. 1:01:31 And those results, then, are returned back to the master node. The master node will then combine the results. 1:01:38 So it will calculate, for example, the total number of 18-year-olds and the total number of text messages sent 1:01:46 by 18-year-olds. After which it can divide those two numbers in order 1:01:52 to produce the average number of text messages sent for 18-year-olds. 1:01:58 That process is simply repeated or customers of every page. 1:02:03 And we then have the results, which we can send back to the user who requested them. 1:02:10 So I hope that you can appreciate that this MapReduce process is a very clever way of efficiently handling data 1:02:18 processing problems on distributed database environments by taking advantage of parallelization 1:02:26 in order to solve the problem. Well, my friends, thus ends our overview 1:02:35 of big data, data warehousing, and business intelligence applications. 1:02:40 And more broadly speaking, thus ends our series of lectures on database technologies. 1:02:47 It is my sincere hope that you have found these lectures useful, and most importantly, 1:02:55 that you now have the self-confidence to go out and start creating and working with these databases. 1:03:02 It has been a great pleasure to lead you on this journey, 1:03:08 and I hope that you have a great day.