0:00	Good day, everyone. This is Dr. Soper here. And today we will be exploring the eighth and final topic
0:08	in our series of lectures on databases, with today's topic focusing on big data, data warehouses,
0:17	and business intelligence systems. Although the objectives for today's lecture
0:23	are varied and manifold, broadly speaking, our goals are to extend our knowledge of database systems
0:33	beyond relational databases by exploring some of the additional ways in which database technologies can
0:41	be used to achieve organizational objectives. Along the way, we will explore the basic concepts of data
0:49	warehouses and data marts. And we'll learn about the basic concepts and architecture
0:54	associated with dimensional or multi-dimensional databases. We will also learn about business intelligence systems,
1:03	online analytical processing, and data mining, all of which
1:08	are very useful for helping managers to make decisions. And then toward the end of today's lecture,
1:15	we will explore some of the fastest growing topics in the database world, including
1:21	big data, the NoSQL movement, structured storage,
1:26	and the MapReduce process. So without further ado, let's begin.
1:33	I thought a good way of introducing some of the topics in this lecture
1:38	would be through an exploration of human mathematical intuition, that is, do we as human beings
1:47	have a natural intuition for finding truth in numbers.
1:53	To examine this question, consider the following example. Here we have two people, whom I've named Bubba and Daisy.
2:04	And both Bubba and Daisy are interested in purchasing new vehicles.
2:11	Currently, Bubba drives a very inefficient vehicle. Let's say that he drives an old truck.
2:17	And he gets 10 miles per gallon. Whereas Daisy drives a reasonably efficient car.
2:26	And in her case, her current vehicle gets 30 miles per gallon.
2:32	Now let's say that both Bubba and Daisy are interested in purchasing new vehicles that will have better
2:40	fuel economy than their current vehicles with the goal of needing to purchase less gasoline every year,
2:48	thereby saving money. Bubba, however, does not want to give up
2:54	all of the benefits of his big truck, so the vehicle he's looking at provides just 12 miles
3:01	per gallon, as opposed to his current vehicle which provides 10. Daisy, on the other hand, is very
3:08	interested in fuel economy. And so she is planning to purchase a vehicle which
3:14	delivers 50 miles per gallon, as opposed to her current vehicle, which has 30.
3:21	Another way of considering these numbers is by looking at the percentage increase in miles per gallon.
3:28	In Bubba's case, he is moving from a vehicle which has 10 miles a gallon to one which has 12 miles per gallon,
3:36	and that is an increase of 20% in the miles per gallon. Daisy, by contrast, is moving from a vehicle
3:43	with 30 miles per gallon to one which provides 50 miles per gallon. And that is an increase of 66.7%.
3:52	Next, for the sake of offering a fair comparison between Bubba and Daisy, let's say that both drivers drive
4:00	10,000 miles per year. Now at this point, I want to appeal to you or mathematical intuition.
4:07	Based upon the situation that I've described, I would ask you a simple question,
4:13	which of these drivers, Bubba or Daisy, is going to save more money on gasoline every year?
4:21	Well, if you are like most people, you will look at these numbers and you'll say that clearly Daisy's vehicle has a much greater increase
4:31	in miles per gallon, therefore, she will be saving the most money every year on gasoline.
4:38	Unfortunately, this is the wrong answer. In fact, Bubba will be saving more money
4:45	every year on gasoline. And let's see why.
4:50	Here we have a table which shows us how much gasoline both Bubba and Daisy
4:56	are currently consuming based upon their driving habits versus how much they will consume if they purchase
5:04	their new vehicles. In Bubba's case, he drives 10,000 miles per year.
5:10	And in his current vehicle, he gets 10 miles per gallon. Therefore he will consume 1,000 gallons of gasoline per year.
5:19	In his new vehicle, Bubba will still drive 10,000 miles per year, but now will
5:25	get 12 miles per gallon, and therefore will consume 833.33 gallons of gasoline per year.
5:35	By contrast, in Daisy's current vehicle, she gets 30 miles per gallon.
5:40	And driving 10,000 miles per year, she consumes 333.33 gallons of gas.
5:48	Whereas with her new vehicle, which gets 50 miles per gallon, she will consume 200 gallons of gasoline every year.
5:58	Simple subtraction, then, shows that Bubba will save 166.67 gallons of fuel per year
6:08	if he purchases his new vehicle. Whereas Daisy will only save 133.33 gallons
6:16	of fuel per year. That is, Bubba will enjoy it more fuel savings per year
6:23	by buying a new truck, which has 12 miles per gallon, versus his old truck, which had 10 miles per gallon,
6:31	than will Daisy by buying a new car which has 50 miles per gallon, versus her old car which
6:38	provided 30 miles per gallon. So if you're like most people, in this problem
6:44	you're mathematical intuition has failed. And it is for this reason that I would
6:49	argue that managers should not rely on their intuition
6:55	when making important managerial decisions, but rather should rely on data.
7:02	Many managers have begun to recognize the deficiency is in human intuition,
7:08	and are hence adopting technologies like business analytics, business intelligence, data
7:14	warehousing, and data mining, so as to better support their decision-making activities with the data
7:21	that the organization has at its disposal. By the way, if you're interested,
7:27	this problem is known as the miles per gallon illusion, and stems from the fact that we here in the United
7:35	States commonly measure fuel economy in miles per gallon, as opposed to the much more appropriate measure which
7:44	would be gallons per mile. Now that we've seen for ourselves
7:49	how human intuition can fail, in even very simple mathematical problems, we can
7:55	begin discussing some of the systems and technologies that have been developed to help managers make better decisions
8:04	and avoid the failures that are commonly associated with relying upon intuition alone.
8:10	First among these technologies are business intelligence, or BI, systems.
8:16	BI systems are information systems that are intentionally designed to allow managers to capitalize
8:25	upon organizational data for the purpose of improving their decision making.
8:31	It's important to note that these business intelligence systems do not support the real-time operational
8:38	activities of the organization. Those activities, by contrast, are supported by transaction
8:46	processing systems. Instead, these BI systems are designed
8:51	to support managerial assessment activities, analysis, planning,
8:57	controlling, et cetera. Broadly speaking, we can classify these BI systems
9:04	into two categories. First, are simple reporting systems, the purpose of which
9:10	is to provide managers with the ability to look at data in a flexible, real-time way, that
9:19	is, these reporting systems support simple data organization capabilities, such as sorting, filtering,
9:29	and grouping data. And they often provide the ability to make simple calculations on data in real time.
9:38	These simple calculations might include operation such as a sum, or an average, or a count.
9:48	By contrast, second category of BI systems are data mining applications.
9:55	Data mining applications are typically designed not to rely on that real time data,
10:02	but rather to rely on archived historical data. And the reason for this is that data mining applications
10:11	typically allow for sophisticated analyses on an organization's data.
10:17	Because these analyzes involve complex statistical and mathematical processing, they typically cannot be
10:23	conducted in real time. The advantage, of course, of such statistical and
10:29	mathematical techniques, is that they can deliver insights and create predictive models that simply
10:36	are not possible with the simple types of calculations that are available within Reporting systems.
10:45	This figure depicts the relationships that exist between operational and BI applications.
10:52	Whereas our operational applications are used by functional users and rely upon the operational DBMS.
11:02	Business intelligence applications are used by managerial users, and can
11:08	rely upon the operational DBMS, or a specialized DBMS that is made just for the business intelligence
11:16	applications. And by extension, these BI applications
11:21	can hence rely directly upon the operational database by way of the operational DBMS, for example,
11:30	in the case of reporting applications. While they also might rely upon archived historical data,
11:37	or other data sources, which are typically made available in the form of a data warehouse or a data mart.
11:45	As a quick review, just remember that simple BI reporting applications typically rely on an organization's
11:54	operational data. And they provide the ability to look at data in a simple way in real time.
12:03	By contrast, data mining applications typically rely on archived historical data, and as such,
12:11	do not provide a real time view of the organization. The trade-off or this time lag is
12:19	that data mining applications can use sophisticated statistical and mathematical techniques
12:26	to create models which allow managers to perform what if analyzes, do predictions about the future,
12:34	and generally speaking, improve their decision making. As I noted earlier, business intelligence applications
12:43	that provide capabilities which go beyond basic reporting,
12:49	typically rely upon extracts of the organization's operational database, along with data acquired
12:56	from other sources, all of which can be aggregated and stored
13:01	in a data warehouse. Thus a data warehouse commonly contains data from many different sources.
13:08	Not only does it contain data from the organization's operational databases, but it can
13:14	contain other internal and external data as well. In the case of external data, an organization
13:21	may be able to obtain such data from publicly available sources.
13:26	Or they may purchase data. Examples of these external data sets
13:32	might include information about what competitors are doing, what the market is doing, or expectations
13:39	about future global trends. Together, data from all of these various sources
13:46	are run through an ETL system, where ETL stands for extract, transform, and load,
13:54	so as to clean and prepare the data for inclusion in the data warehouse. After this process, the data can actually
14:01	be added to the data warehouse itself. And then our more complex business intelligence
14:07	applications will have a source of data upon which they can rely when performing their tasks.
14:16	Despite our best efforts at designing relational databases
14:21	that will ensure the quality and integrity of the data that they contain, it is, unfortunately,
14:27	still possible for problematic data to appear in the database.
14:32	Further, because our business intelligence applications may rely upon these data in support of managerial decision
14:41	making, it is critically important that the data be of the highest quality possible, such
14:48	that managers will have the greatest chance possible of making good decisions.
14:53	Here, we're simply referring to the classic and cliched concept of garbage in, garbage out, that is,
15:01	if we are providing our managers with low quality or problematic data with which to support their decision making,
15:09	then we must expect the resulting decisions to be similarly problematic. So let's examine some of the problems that can
15:16
arise in operational databases. A very common problem is what we call dirty data.
15:23
And dirty data refers to a data value which is obviously incorrect.
15:29
As an example, we might have the letter v stored as a gender code, instead of the more common m or f, for male and female.
15:39
Or we might have a value of age stored as 213,
15:45
which would be a remarkably old human being if that number was correct.
15:50
Other problems with operational data include missing values and inconsistent data,
15:56
where inconsistent data refer to data values that have changed
16:02
and are not the same across all of our data sources. So perhaps a customer's phone number
16:09
was updated in the operational database. And the previous value in the data warehouse is, therefore, incorrect.
16:16
Additional problems with operational data include non-integrated data, that is,
16:22
when we have data from two or more sources that we need to merge together in some way
16:28
so that they can be added into the data warehouse. We may also have data in an incorrect format,
16:35
in which case, it will need to be transformed into the format that is required by the data warehouse.
16:43
And of course, we may simply have too much data. There is a general concept in science known as parsimony.
16:52
And this concept tells us that simplicity is often preferred.
16:58
For example, if I construct a predictive model that is 95% correct, and it relies upon three predictor
17:08
variables in order to achieve that 95% accuracy. But I might improve the accuracy to 96%
17:18
by adding 20 additional variables, in most cases, the additional complexity involved in order
17:26
to achieve such a marginal gain in predictive power would not be worth it.
17:32
So in that case, we would prefer the three predictor model over the model which contains 23 predictors.
17:40
As I mentioned earlier, because a data warehouse often contains data from multiple data sources,
17:46
the input data commonly need to be run through an ETL process
17:51
before they can be stored in the data warehouse. Again, ETL here stands for extract, transform, and load.
18:00
Where the extract step is simply pulling data from these various data sources.
18:07
The transform step is cleaning, or modifying, or processing the data in such a way
18:15
that they are made appropriate for inclusion in the data warehouse.
18:21
And then the load step refers to the process of taking the transformed, cleaned, processed data
18:29
and actually storing it in the data warehouse so that they can be used by whichever business intelligence
18:37
applications might rely upon that data warehouse. Simple examples of such transformations
18:43
might be transforming a country code into a country name.
18:50
So we may have the country code US, for example. And in the data warehouse, we need
18:55
to transform that into the name of the country, which might be United States.
19:01
Or we may have a customer's email address, such as dan@dan.com.
19:07
And we actually just want to store the email domain in the data warehouse for purposes of aggregation.
19:14
In that case, we would want to transform the customer's email address simply into the domain, which in this case
19:22
would be dan.com, and store the result in the data warehouse.
19:28
Next, I would like to talk about the concept of a data mart. And I think the best way to understand
19:34
a data mart is simply that it is a subset of the organization's
19:41
data warehouse. Data marts are constructed to support a specific need
19:48
within the organization. So this might be a subset of the data warehouse
19:53
that is needed to support a particular project, or a particular functional area within the business,
20:01
like advertising or marketing, for example. Or perhaps, we need to create a data mart
20:07
to support a specific group of employees within our organization, like a sales team.
20:14
Regardless of the specific reason why we create a data mart, the general principle
20:20
underlying their creation is simply that not all personnel, or not all managers,
20:26
within the organization will need access to all of the organization's archive historical data.
20:34
Personnel within the organization who perform new product development, for example,
20:41
probably will not need access to data associated with human resources, such as employee salaries or employee
20:53
benefits. Instead we might create a data mart just
20:58
for the new product development team, which contains only those data that
21:03
directly support their needs. At some point in your adventures in the database world,
21:10
you may have heard the term dimensional database, or multi-dimensional database.
21:15
And I'd like to take a few moments to talk about some of the concepts associated with these types of databases.
21:23
To begin, it's important to note that dimensional databases are designed and implemented using exactly the same sort
21:32
of database technology that we use to create our operational databases.
21:38
That is, dimensional databases contain tables. They are related to each other using
21:44
primary key foreign key links. We have the same concepts cardinalities,
21:51
such as one to one relationships, one to many relationships, et cetera.
21:57
So hopefully, operating within the familiar framework of the relational database world,
22:02
understanding these dimensional databases will be reasonably easy.
22:07
Broadly speaking, the idea with a multi-dimensional database is that we want to implement a non-normalized database
22:16
structure for the purpose of vastly improving query speed.
22:22
That is, in an operational database, we typically implement a database design
22:28
that is largely normalized, that is, it might be in third normal form,
22:34
or Boyce-Codd normal form, with perhaps a few tables being denormalized for the purpose of improving efficiency.
22:42
And what a normalized database allows us to do is to store large amounts of data
22:50
very quickly, while still preserving the quality and the integrity of the data in the database.
22:58
The problem with this sort of rigid, normalized design, however, is that if we want to extract data from the database,
23:07
we commonly need to perform computationally expensive join operations in order to get the information that we want.
23:16
So a normalized relational database is very good for quickly storing information,
23:23
but is very bad for quickly extracting information that we want.
23:28
By contrast, with the sort of design that is implemented in a multi-dimensional database,
23:34
storing the data in the database can be a very slow and laborious process.
23:40
However, extracting data from the database is very fast.
23:46
And the reasons for this are that we implement a non-normalized database structure,
23:52
while simultaneously storing data in pre-aggregated levels of granularity
24:00
within the dimensional database. An important point to note here is
24:05
that these dimensional databases are used to track an organization's historical data,
24:12
and therefore they almost always contain a date or time
24:18
dimension. And it is this date or time dimension that provides us with the ability
24:25
to store the same information aggregated at different levels of granularity
24:31
within the multi-dimensional database. For example, imagine that we have operational data which
24:38
represents all of the sales transactions for a store. So every time a sale is made, we may
24:46
generate a sale record, which records information about that transaction.
24:52
Now, this information is useful, however, it may not be as useful to a manager
24:59
as it would be if it were aggregated up to a coarser level of granularity.
25:05
Consider that if I were to take all of the sales data for one day and add them all together, then
25:13
I have a daily total. Similarly, if I take the daily totals
25:19
for seven consecutive days, then I have a weekly total. I can, in such a way, continue to create
25:26
monthly totals, quarterly totals, yearly totals, et cetera.
25:32
It is the same information that is available in the individual sales transactions,
25:40
except it has been pre-aggregated and stored in the database in a pre-aggregated form,
25:47
such that we can vastly improved query speed. That is, when a query is run where
25:54
we want to look at the data in the form of weekly totals, or monthly totals, or quarterly totals, the database at
26:03
that time does not need to aggregate all of the individual sales transactions
26:10
in order to produce the result. The result already exists in the dimensional database
26:16
because it has been pre-processed prior to being added into the database.
26:22
So again, the purpose of these dimensional databases, then, is to intentionally implement
26:29
redundant data in a non-normalized design, such that we can vastly improved query speed.
26:36
We want to be able to extract data very quickly from the database.
26:42
The most common data model for a dimensional database is known as a star schema.
26:48
And the general characteristics of a star schema are that we have several dimension tables.
26:56
In this case, we have a time dimension table, a customer dimension table, and a product dimension table.
27:03
And at the intersection of all of those dimension tables, we have something called a fact table.
27:10
Philosophically speaking, the reason that the intersection table at the center of the dimension
27:16
tables is called a fact table, is because a fact in very real terms
27:23
is the intersection of information. For example, imagine that I'm interested in knowing
27:31
how many of a particular product a specific customer purchased during the month of December.
27:39
Well, the answer to that question is the intersection of the customer,
27:45
the product in which I'm interested, and the specific time frame that I specify, in this case,
27:51
the month of December. And the answer might be 14 units. So a specific customer purchased 14 units
27:59
of a particular product during the month of December. The intersection of those three values is a fact.
28:06
And it is for this reason that we label the table at the center of a star schema, a fact table.
28:13
To help you better conceptualize this concept, let's consider the intersection of two dimensions.
28:20
And in this example, we're looking at the customer dimension contrasted with the product dimension in the form of a two
28:27
dimensional matrix. The value contained in each cell within this matrix,
28:34
then, is a fact. And it expresses to us the quantity of a particular product that was purchased
28:41
by a particular customer. And you will notice, of course, if you recall from our last lecture, that this structure is
28:50
very similar to a bitmap index. Extending this concept out into a third dimension,
28:58
we can see here that we're representing a fact as the intersection of three different dimensions in a three
29:05
dimensional matrix. Along the horizontal axis I, again, have customers.
29:11
Along the vertical axis I, again, have products. But now, along the z-axis, I have a time dimension.
29:20
Therefore, the value contained in any of the cells in this three dimensional matrix
29:25
will tell me the quantity of a given product that was purchased by a particular customer
29:33
during a particular date or time frame. Now unfortunately, human beings have a great deal
29:40
of difficulty envisioning higher order spaces beyond three dimensions.
29:47
However, this concept scales very easily up to higher dimensional spaces.
29:54
So we might consider, for example, a fact to be the intersection of four dimensions, or five dimensions.
30:02
And although it is not easy to depict such a situation, conceptually, it is just a natural extension
30:09
of the two-dimensional and three-dimensional examples we saw here. In either case, I hope you can now
30:15
understand why databases designed in this way are called dimensional or multi-dimensional databases.
30:23
Next, I'd like to talk briefly about OLAP and data mining technologies.
30:29
If you recall from earlier in the lecture, we said that, generally, there are two broad categories of business intelligence applications.
30:38
And they were, reporting applications, and data mining applications.
30:44
Online analytical processing, or OLAP, then, is a technique that supports these reporting applications.
30:53
That is, OLAP allows us to dynamically examine database
30:59
data in real time, and apply simple transformations like sorting, filtering, grouping, et cetera.
31:08
And it allows us to perform simple arithmetic functions, such as summing values together, finding an average, account,
31:17
the standard deviation, et cetera. And again, this is intended to be used in real time.
31:24
By contrast, data mining techniques support data mining category of business intelligence
31:31
applications. And data mining, broadly, refers to a collection
31:36
of mathematical and statistical methods that can be used to gain deep insights into an organization's
31:44
data. Again, remember that the level of sophistication of these techniques generally requires that they not
31:52
be executed in real time, so as to avoid interfering with the real time operations of the organization.
32:01
OLAP systems, then, when used in support of simple BI reporting
32:06
needs, produce something called an OLAP report, which some people will refer to as an OLAP cube.
32:14
And the general idea here is that our inputs are a set of dimensions, while our outputs are
32:21
a set of measures. So recalling the two dimensional and three dimensional matrices
32:28
that we saw just a few moments ago, a manager might select a series of dimensions.
32:35
And the OLAP system might allow him or her to perform simple transformations
32:41
or drill down operations on the data which lie at the intersection of those dimensions
32:48
so as to gain real time insights into the organization. And here we see that these OLAP cubes
32:56
can be constructed using our standard SQL SELECT queries.
33:02
In this case, we're selecting a number of different dimensions. We are then performing a join on four separate tables,
33:09
and are imposing some group by and order by requirements.
33:14
The result of this query in OLAP terminology, then, would be a result set which
33:20
represents a collection of measures that a manager could use to gain some insights
33:27
into his or her organization. And of course, rather than constructing these SQL commands
33:35
repeatedly, we might take advantage of the capability of relational databases to create views,
33:42
so as to save the SQL statements, which are used to produce common OLAP cubes in the database itself.
33:53
Data mining, then, can be viewed as the convergence of many different disciplines.
34:00
A skilled data miner needs not only working knowledge of databases, but also needs statistical and mathematical
34:09
knowledge, perhaps knowledge of artificial intelligence, or machine learning algorithms, knowledge of data management
34:16
technologies, and so forth. In the modern world, many people become highly specialized
34:23
in one particular area. But the people who are most valuable to an organization
34:29
often have expertise in two or more areas. And this is certainly the case with people
34:35
who are experts at data mining. To conclude our overview of data mining,
34:41
I just wanted to briefly describe some of the most common techniques that are used to perform data mining against an organization's data.
34:50
Among these techniques are cluster analysis, in which case, the objective is to group sets of entities
34:58
together according to their level of similarity along one or more dimensions.
35:03
We also have decision tree analysis, in which we can process a large quantity of historical data
35:11
and generate a decision tree, which tells us what to do under different circumstances,
35:16
in order to achieve some kind of the desired result. We also have regression available to us
35:22
as a very powerful data mining tool. The goal of which is to produce mathematical equations,
35:29
or mathematical models, that not only describe the relationships
35:36
between variables, but also provide us a basis for predicting future events based
35:43
upon past observations. Data mining applications might also
35:49
rely on sophisticated artificial intelligence, or machine
35:54
learning algorithms, such as neural networks or support vector machines.
36:00
And recently, we've seen a rise in a technique known as market basket analysis, or affinity analysis, which
36:10
allows us to look for patterns of co-occurrence, for example, determining which products are commonly
36:18
purchased together. And the results of these affinity analyses can then be used as the foundation of a recommendation
36:26
engine, which can suggests to you movies that you might like, books that you might like, et cetera.
36:35
Now I'd like to move into the final topic in our course
36:40
on databases, and that is the rise of the big data paradigm.
36:48
Scientists and researchers have recently noted an exponential increase the quantity of data being
36:55
produced by the human species. If my memory serves correctly, I believe
37:00
the current rate of growth is that the amount of data doubles every 40 months.
37:07
At present, the world is generating many exabytes of new data every single day.
37:15
And if you're unfamiliar with the term exabyte, consider that one exabyte is slightly more than one
37:23
million terabytes. So you may have a computing device at home
37:28
that saves several terabytes of data. But consider that several million terabytes of new data
37:36
are being generated by the human species every single day. And this situation creates a vast array
37:43
of new and interesting problems for organizations.
37:49
The term big data, then, refers to the rapidly expanding amount of data that is being stored and used by organizations.
37:59
These data sets can be very large and very complex. And because of their size and complexity,
38:06
they can be extremely difficult to process using traditional database technologies.
38:11
And an important point to note is that much of what is considered big data
38:18
is being generated by web 2.0 applications, and the emerging collection of web 3.0 applications.
38:27
Traditional examples of web 2.0 applications might include social networking sites, video sharing sites,
38:36
blogs, discussion forums, et cetera. So this rapidly accumulating quantity of data
38:43
presents many challenges for organizations. Among these are simply capturing all of the data
38:50
and storing it, maintaining the data once we have it. This is also commonly referred to as curation,
38:58
in the same way that the curator of a museum must maintain all of the ancient artifacts, so, too,
39:06
must the curator of a big data set be able to maintain the quality and the integrity of the data
39:12
in light of things like failing hardware, and the desire of the data to be used by many people from all
39:19
over the world simultaneously. Additional challenges include things such as search.
39:25
How does one search efficiently through such an enormous quantity of data?
39:31
Data transfer, consider for example, that if you have a 100 megabit network connection
39:39
you can transfer approximately one terabyte of uncompressed data per day. At this speed, it would take you more than a million days
39:47
to transfer one exabyte of data. Further challenges include analyzing
39:53
these massive data sets, visualizing the massive quantities of data, and so forth.
40:00
In the past few years, a term has arisen in the area of big data that
40:05
is used to describe the movement toward using non-relational databases in order
40:12
to support these huge and highly distributed, highly replicated collections of data.
40:19
And this term is called NoSQL. Although the name, to many people,
40:25
implies that SQL is not involved in these databases, that is, they do not support SQL-like queries,
40:33
this assumption is actually wrong. As it is used in contemporary database circles,
40:39
NoSQL means not only SQL, that is, these very large databases,
40:46
although they may not be based on relational algebra, in the same way that a relational database is,
40:54
they nevertheless support querying through a SQL-like query language.
41:01
Unlike the relational database world where the relational model is fixed and predominates
41:08
all relational database vendors, in the NoSQL world,
41:13
there are many different architectures for non-relational databases that are currently being used.
41:20
These include architectures which rely upon a key value store, a wide columnar
41:27
store, a documents store. There are databases that rely upon graph theory,
41:33
and so forth. Collectively, all of these different types of very large data stores are commonly
41:40
referred to as structured storage. And they have a few attributes in common.
41:46
First, they arguably employ simpler designs, as opposed
41:52
to relational databases. And second, they almost always have a looser consistency model
41:59
than one will find in a relational database. Another way of saying that is these structured storage
42:07
databases do not provide ACID guarantees. If you remember, ACID is an acronym
42:14
which stands for Atomicity, Consistency, Isolation, and Durability.
42:21
And ACID guarantees are the hallmark of a normalized relational database.
42:26
We cannot expect to have that level of consistency in these massive, highly distributed, highly replicated,
42:35
structured storage databases. When discussing the sort of data models that are actually used in these structured storage databases,
42:44
I like to use the data model that is employed by the Apache Cassandra database
42:50
as an example. And the reason for this is that it is one of the most popular structured storage
42:56
database management systems. And it is currently the most popular wide columnar store
43:03
that is available. Broadly speaking, the Apache Cassandra database
43:09
can be classified as a hybrid, key value slash wide columnar
43:15
database. So its architecture contains elements of both a key value store and a wide columnar store.
43:24
The Apache Cassandra database itself was originality created at Facebook by two of their software architects,
43:33
after which it was transferred to the Apache Foundation, where it now resides as entirely open source and free database.
43:43
Apache Cassandra has cross-platform support. The reason for this being that it was a written in Java.
43:51
So it can run on Linux-based machines, Windows-based machines, Unix, et cetera.
43:58
And Cassandra supports a massively distributed database environment.
44:04
That is, it allows us to subdivide our database among dozens, or hundreds, or even
44:10
thousands of separate database servers, potentially spread out all over the world.
44:16
The database is a highly scalable and decentralized. By scalable here, I mean it's extremely easy
44:23
to add an extra node, that is, an extra server to the cluster,
44:30
thereby expanding the size of the database. And by decentralized, what I mean here
44:36
is that all of the nodes, that is all of the database servers, that are involved in a Apache Cassandra database,
44:44
have the same role. And this provides the very desirable characteristic
44:49
of there being no single point of failure. Another very valuable characteristic
44:55
of the Apache Cassandra model is that it provides for automatic data replication.
45:01
That is, the database itself can automatically make copies of data and store those copies
45:09
in different locations throughout the cluster. This makes the database highly fault tolerant, such
45:15
that if an individual node, that is an individual database server, were to fail, the redundant data stores takeover
45:24
instantaneously. There's no down time with the database at all. Further, Apache Cassandra supports the MapReduce process,
45:33
which is a computational model for solving data processing problems in a highly distributed environment.
45:42
And I'll talk more about the MapReduce process here in a few minutes. And to illustrate the legitimacy of the Apache Cassandra model,
45:51
consider that it is currently used by CERN, organization such as Constant Contact, Digg, Instagram, Netflix, Reddit,
46:01
Walmart, Twitter, et cetera. Now let's talk about the Cassandra data model itself.
46:09
As you know, in a relational database management system, related data for an application are stored together
46:17
in a container which is referred to as a database, or sometimes as a schema.
46:23
And within that database or schema, we have one or more tables. The analogous concept in Cassandra
46:32
is something called a key space. That is, data for an application are stored together
46:38
in a container known as a key space. And inside of that key space, instead of tables,
46:45
we have something known as column families. So just as in the relational database world,
46:51
a single DBMS might contain many databases, each of which contains many tables.
46:58
In the world of Apache Cassandra, the Cassandra database might contain many key spaces, each of which contains many column families.
47:07
The column families, then, contain columns. But this is not the same as a column
47:15
in a relational database. In Cassandra, a column consists of a name,
47:21
that is the name of the column, a value, that is the data value for that column,
47:27
and the time stamp, where the time stamp indicates the point in time at which the data value was changed.
47:35
Related columns, then, are all stored within the same row.
47:41
And each row is identified using a unique row key.
47:46
This notion of a unique row key is directly analogous to the idea of a primary key
47:53
in the relational database world. Rows in the Cassandra model, however,
47:58
are not required to contain the same set or number of columns.
48:03
That is, different rows within the same column family might have a different number of columns.
48:11
And the number of columns in a particular row is allowed to expand or contract on an as-needed basis.
48:18
A few additional important differences to note between the Cassandra data
48:24
model and the relational data model are that in the Cassandra data model
48:30
there are no formal foreign key relationships between column
48:35
families. That is, we cannot establish formal relationships between
48:40
column families within the same key space. And what's more, it is not possible to join
48:47
column families together using a query. So whereas in the relational database model
48:53
we can write queries that will join tables together, it is not possible in the Cassandra model
49:00
to join column families together. Now, I know that this sort of a verbal description
49:06
of the Cassandra data model can be a bit challenging to follow. So let's look at a picture which I hope
49:12
will help to make some of these concepts clearer. Here we see a graphical representation
49:19
of the Cassandra data model. The outermost container represents all of the key spaces for the Cassandra database.
49:28
And in this case, we have just two key spaces, one of which is labeled as the blog key space,
49:34
and the other which is labeled as the store key space, where the details of the store key space
49:41
are not elaborated in this diagram. Again, the idea here is that a key space
49:47
is roughly analogous to a database within the relational database world.
49:52
This means that a key space, then, is typically oriented toward a particular need
49:58
or a particular application. Within each key space, we can have one or more column
50:04
families. In this case, we have a column family for users, and we have a column family for blog entries.
50:13
Next, let's look at an example of a column family. And to begin, I will refer to the user column family.
50:20
So here, we can see that we have three rows within the column family.
50:25
And each row represents a unique user within the blog key space.
50:33
A user, then, is represented as a collection of one or more columns.
50:39
And remember that, in the Cassandra data model, the number of columns per row can vary from row to row.
50:49
So in this first row, that is the Dan 42 row, we see that we have three columns.
50:55
The first column is the name column. It's value is Dan. And we have a timestamp.
51:02
The second column is the email column. It's value is dan@dan.com.
51:08
Again, it has a timestamp. And the third column is the phone columns, which
51:13
has a value and a timestamp. For the next user, we have only the name column.
51:20
And for the third user, we have only the name and email columns. So there is no requirement that each row contain
51:27
the same number, or even the same type of columns. Next, let's look at the blog entry column family
51:34
within this blog key space. Again, we see that each row within the column family
51:41
contains multiple columns. In this case, both rows contain the same columns.
51:47
But again, that is not a requirement. Here, the columns are the text column, the category column,
51:53
and the user column. Note, particularly, that the values stored in the user
52:01
column can be used to determine which blog entries were written
52:07
by which users. However, remember that formal relationships
52:13
between column families do not exist in Cassandra. That is, we do not formally establish
52:20
primary key, foreign key relationships. So I hope that looking at this diagram of the Apache Cassandra
52:28
data model demystifies things a little bit for you.
52:33
I know that learning about these structured storage data models for the first time can be intimidating,
52:39
but I hope that through this diagram, you can see that it's really not that complicated.
52:46
And I hope that is encouraging for you. As I mentioned earlier, these structured storage databases
52:53
are often highly distributed and highly replicated. That is, they may be spread across many, many different
53:00
nodes or database servers. Now this structure has substantial advantages.
53:07
Not only does it provide fault tolerance, but it allows for data requests to be handled
53:15
by the nearest available node that is able to service the request. So for example, if you are in Tokyo,
53:24
and it happens that a copy of the data in which you are interested is stored on one of my database nodes,
53:32
which is located near Tokyo, it's much more efficient for that node to handle your request
53:37
than it would be to route the request to a distant geographical node, say, one
53:43
which might be located in Berlin. The problem with this model, however,
53:50
is that it can cause problems with consistency. Consider what happens when a data item is updated.
53:56
So if I update a data item on one node,
54:01
it will take time for that update to cascade to the other nodes within the cluster that
54:08
contain a copy of the data. So imagine that my distributed structure storage database
54:15
contains 1,000 nodes, spread all over the world,
54:20
and the data item I'm interested in updating is replicated across 100 of those nodes.
54:29
So I may then perform the update on one of the nodes. And until that update cascades throughout all
54:39
of the other nodes in the cluster, any requests for that same data item
54:45
that are made of those other nodes will be returned values that are out of date.
54:51
And again, this is due to the fact that the database is widely distributed
54:57
and widely replicated. And because we typically do not enforce
55:02
an ACID level of consistency. Thus, in these replicated data environments,
55:08
we commonly use a consistency model that is referred to as eventual consistency.
55:15
And what eventual consistency means is that if no new updates are made to a specific data
55:24
item for a period of time, eventually all of the requests for that data item
55:31
will return the most up to date value, regardless of which node is servicing the request.
55:37
And the time stamps that are recorded during each item update are the key which allows us
55:44
to reconcile any inconsistencies in replicated data
55:49
values between nodes. Finally, I would just like to take a few moments to discuss
55:55
the MapReduce process. Broadly speaking, MapReduce is a programming model
56:03
that relies on parallelization in order to perform data processing tasks on these huge data sets that
56:13
may be distributed across many different servers, or many different nodes.
56:19
So conceptually speaking, then, the MapReduce process involves two different types of nodes.
56:26
There will be a master node and a worker node. Put simply, the master node is usually
56:33
the node which receives the data processing request from the user.
56:38
While the worker nodes are nodes which are assigned to complete part of the processing task
56:45
by the master node. So this MapReduce process, then, unfolds in two steps.
56:51
The first step is called the map step. And in the map step, the master node
56:57
will take the data processing problem and subdivide it into a series of sub problems.
57:04
And each of these sub problems is then assigned to, and carried out by, a worker node.
57:11
The second step in the MapReduce process, then, is the reduce step. So after the worker nodes have completed their assigned tasks,
57:21
they pass the results of their work back to the master node.
57:26
The master node will then do the final processing, or final combining, of those results
57:33
in order to produce the overall answer to the problem, which is then returned back to the user.
57:39
Again, I know that concepts such as this can be difficult to understand in verbal terms,
57:45
so let's see if we can get a better idea using an image. So toward the top of this figure we have the master node.
57:54
And toward the bottom, we have various worker nodes, which here are labeled one, two, three and n,
58:01
up to however many worker nodes we need to solve the problem. So the MapReduce process unfolds as follows.
58:09
As input, the data processing problem is passed into the master node.
58:14
The master node will then divide that data processing problem into sub problems, which are then assigned to
58:22
and carried out by the various worker nodes. After completing their tasks, the worker nodes
58:29
will return their results back to the master node, which performs the final combining and processing of the worker nodes'
58:38
results, in order to produce the final answer, which is then
58:43
the output of the MapReduce process. Let's consider an example, imagine
58:49
that we are a wireless service provider, and it we use a highly distributed, structured storage
58:58
database, which has 1,000 different servers all over the world.
59:04
Let's further assume that our 100 million
59:09
customers are equally subdivided among our 1,000 servers. So that means we would have data for 100,000 customers per node
59:21
within our database environment. Now let's imagine that our data processing problem
59:27
is that we want to figure out the average number of text
59:33
messages sent during the month of November.
59:38
And we want those results organized by age. So we would like to know what is the average number of text
59:46
messages sent by 18-year-olds, and 19-year-olds, and 20-year-olds, and 21-year-olds, and so
59:54
forth, all the way up until our oldest customers. Now let's see how the MapReduce process
1:00:01
can be used to solve this data problem. First, the problem is passed to the master node.
1:00:08
And the master node might subdivide the problem such that it instructs each of the 1,000 worker
1:00:15
nodes within our database environment to count the total number of text messages sent
1:00:23
by each customer during the month of November, and aggregate those results by age.
1:00:31
The results of the worker nodes tasks, then, would be a table of data, which
1:00:37
might contain three columns. First would be all of the distinct ages of the customers
1:00:44	whose data resides on that particular worker node. The second column might be the number
1:00:50	of customers who are that age. So we might have 1,000 18-year-olds, 714 19-year-olds,
1:01:00	235 20-year-olds, et cetera. And then, the total number of text messages
1:01:07	sent by customers of each particular age. So perhaps, 18-year-old sent 10 million text messages.
1:01:16	19-year-olds sent 9,800,000 text messages, and so forth.
1:01:23	So each worker node performs this task for all of the customers whose data are stored on that node.
1:01:31	And those results, then, are returned back to the master node. The master node will then combine the results.
1:01:38	So it will calculate, for example, the total number of 18-year-olds and the total number of text messages sent
1:01:46	by 18-year-olds. After which it can divide those two numbers in order
1:01:52	to produce the average number of text messages sent for 18-year-olds.
1:01:58	That process is simply repeated or customers of every page.
1:02:03	And we then have the results, which we can send back to the user who requested them.
1:02:10	So I hope that you can appreciate that this MapReduce process is a very clever way of efficiently handling data
1:02:18	processing problems on distributed database environments by taking advantage of parallelization
1:02:26	in order to solve the problem. Well, my friends, thus ends our overview
1:02:35	of big data, data warehousing, and business intelligence applications.
1:02:40	And more broadly speaking, thus ends our series of lectures on database technologies.
1:02:47	It is my sincere hope that you have found these lectures useful, and most importantly,
1:02:55	that you now have the self-confidence to go out and start creating and working with these databases.
1:03:02	It has been a great pleasure to lead you on this journey,
1:03:08	and I hope that you have a great day.