The
generations of us living from 1984 to 2020 are witnessing the largest and most
rapid cycle of creation since the Big Bang. That is, one 36-year window
that is competing with the 14.5 billion years of existence that humans know of.
This creation guides what information is available to us, our options, and our
choices in every area of life, thousands of times every day. It can
determine life or death, the transference of wealth, the geopolitics of Earth,
and the expansion of knowledge. We are its creators, and we can neither
stop its creation nor know it in any tangible way. It is: data.
So, what
happened? How did this come about? What made Big Data, big? It has
largely come about because of the ubiquity of processors in the late 20th
century, the creation of and growth of users of the Internet (from 1,000 in
1984 to 2.7 billion in 2016, half of whom are on Facebook), then
miniaturization of processors into smart devices like phones, watches, etc.; however,
mostly now, the exponential growth of data is the result of unstructured
databases (e.g., MongoDB, NoSQL, etc.) to hold all our digital interactions and
behaviors. This unstructured behavioral data currently accounts for about
75% of the data being created. But, we haven’t seen anything yet because
the growth of data will explode even more over the next 15 years from the
Internet-of-Things. (Wall, 2014)
So, how big
is “Big?” According to IBM, in 2012, we created 2.5 gigabytes of data per
day. At this rate, in another much-cited statistic by IBM, which was
originally written by Norwegian, Ase Dragland from think-tank SINTEF in 2013,
90% of all data in the world was created in the prior two
years. (Dragland, 2013) However, this data growth rate – the speed of
creation – is actually accelerating. According to research data scientist
Richard Ferres from the Australian National Data Service (ANDS), we are
creating data 10x faster every two years. (Ferres, 2015) In other
words, starting from 1 in 1985, we were at a speed of 1x1015 in 2015
(e.g. one quadrillion “miles per hour”), and in 2017, our speed of data
creation is 1x1016 (e.g., ten quadrillion “miles per hour”).
If that
acceleration wasn’t fast enough, we’re soon going to be creating data a lot
faster still, because of the Internet-of-Things (IoT). The IoT is the
collective name for the billions of devices that are being embedded with
sensors to communicate data in networks – think of the “smart” refrigerator by
Samsung that tracks what groceries are inside it, or the car or home alarm
system or baby monitor that you can control with your mobile phone. Technology
research group Gartner estimates there were 6.7 billion such devices or sensors
on-line in 2016, (Gartner, 2015) and competitor research group IDC
estimates there will be 30 billion by 2020. (IDC, 2014) Recall, there
are approximately 2.7 billion Internet users. So, the prediction is the
number of data creators will increase by 10x within the next three years
alone. Simplistic math would suggest that 10x the number of data creators
accelerating at 10x every two years may mean that within 3-4 years, the speed
of our annual data creation will accelerate 100x every two years.
But these
numbers are at such a scale as to make them difficult for human brains to
understand or imagine. Two gigabytes, which was 80% of the amount of data
we collectively created in 2012 every day, is about 20 yards (60 feet) of
average-length books on a shelf, or about 6.67 kilometers (4.15 miles) of books
per year. But because we created data 10x faster in 2016, it means we
were up to 66.7 kilometers (41.5 miles) of books face-to-back per year.
If we accelerate 10x faster by 2018, as predicted, and 100x faster as suggested
above by 2020, that would mean we will create 667 kilometers (415 miles) of
books on a shelf in 2018, and will be creating 6,667 kilometers (4,150 miles)
of books on a shelf, every year, by 2020. At this rate, if we were publishing
books instead of electronic data, it would be enough to encircle the Earth at
the equator every year sometime before 2022 – only six years away.
Imagine if
meaningful knowledge or discovery in Big Data were diamonds in the Earth.
To mine or find them, we have to collect tens of thousands of cubic yards of
soil. Then, someone comes along with an invention that enables us to
collect billions of cubic yards of soil premised on the theory that we will
find orders of magnitude more diamonds in orders of magnitude more dirt.
Maybe. But, for sure, it makes the mission of diamond miners (data
scientists to us) orders of magnitude harder too.
Worse yet,
unless and until we become proficient at its use, big data statistics often
creates more false knowledge than true knowledge. The most common thing a
researcher does in trying to discover meaningful new relationships in this data
is calculate correlations (e.g., every time X changes, Y also changes);
however, these correlations are often “false” because we presume they are
causal (X changing causes Y to change) leading to misinformation. To
determine a causal relationship requires Bayesian statistics, a rather advanced
statistical toolbox with which many data scientists, let alone executive
decision makers, are unfamiliar with. However, the error-prone process
doesn’t end there because there are two major categories of Bayesian statistics
– naïve (assuming data points function independent from each other) and network
(assuming data points influence each other). If and when a data scientist
is familiar with Bayes, 50% of the time they use the wrong application of the
formula. The bottom line being that most of the types of correlations and basic
statistics that people initially apply to Big Data give false or misleading
information.
In our
analogy, not only is our speed of creating Big Data increasing at an increasing
rate, meaning we have to move millions of cubic yards of soil one year,
billions the second year, and tens of billions the third year, the soil we’re
“processing” is riddled with fake diamonds. While often mistakenly
attributed to physicist Stephen Hawking or US Librarian of Congress, Daniel
Boorstin, it was actually historian Henry Thomas Buckle, in the second volume
of his 1861 series “History of the Civilization of England” who first observed
that: “the greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.” (Buckle,
1861)
We don’t
necessarily need bigger data, although we’re certainly going to get it.
We need more meaningful data. Therefore, the exponential amassing of data
that is underway at an unprecedented rate in the history of humankind, and is
about to accelerate even more, is creating more noise to sift through to find
meaningful knowledge then before the Big Data era. It is becoming harder
to identify what is important. The evolution of humankind via the
discovery of knowledge, therefore, will be accelerated not by the gluttonous creation
of ever-bigger data but by focusing on the most meaningful data and creating
and sequestering that.
Works Cited
Buckle, H. T.
(1861). An Examination of the Scotch Intellect During the 18th Century. In H.
T. Buckle, History of the Civilization of England (p. 408). New York:
D. Appleton & Co.
Dragland, A. (2013, May
22). Big Data - For Better or Worse. Retrieved from SINTEF:
www.sintef.no/en/latest-news/
Ferres, R. (2015, July 14).
The Growth Curve of Data. Australia: Quora.
Gartner. (2015, November
10). Gartner Says 6.4 Billion Connected "Things" Will Be in Use
in 2016, Up 30 Percent From 2015. Retrieved from Gartner:
http://www.gartner.com/newsroom/id/3165317
IDC. (2014, April). The
Digital Universe of Opportunities: Rich Data & The Increasing Value of the
Internet-of-Things. Retrieved from IDC - EMC:
https://www.emc.com/leadership/digital-universe/2014iview/internet-of-things.htm
Wall, M. (2014, March 4).
Big Data: Are You Ready for Blast-Off? BBC News, pp.
www.bbc.com/news/business-26383058.