Tuesday, January 17, 2017

Big Data will Slow, Not Accelerate, Discovery

The generations of us living from 1984 to 2020 are witnessing the largest and most rapid cycle of creation since the Big Bang.  That is, one 36-year window that is competing with the 14.5 billion years of existence that humans know of. This creation guides what information is available to us, our options, and our choices in every area of life, thousands of times every day.  It can determine life or death, the transference of wealth, the geopolitics of Earth, and the expansion of knowledge.  We are its creators, and we can neither stop its creation nor know it in any tangible way.  It is:  data.

So, what happened?  How did this come about? What made Big Data, big?  It has largely come about because of the ubiquity of processors in the late 20th century, the creation of and growth of users of the Internet (from 1,000 in 1984 to 2.7 billion in 2016, half of whom are on Facebook), then miniaturization of processors into smart devices like phones, watches, etc.; however, mostly now, the exponential growth of data is the result of unstructured databases (e.g., MongoDB, NoSQL, etc.) to hold all our digital interactions and behaviors.  This unstructured behavioral data currently accounts for about 75% of the data being created.  But, we haven’t seen anything yet because the growth of data will explode even more over the next 15 years from the Internet-of-Things. (Wall, 2014)

So, how big is “Big?”  According to IBM, in 2012, we created 2.5 gigabytes of data per day.  At this rate, in another much-cited statistic by IBM, which was originally written by Norwegian, Ase Dragland from think-tank SINTEF in 2013, 90% of all data in the world was created in the prior two years. (Dragland, 2013) However, this data growth rate – the speed of creation – is actually accelerating.  According to research data scientist Richard Ferres from the Australian National Data Service (ANDS), we are creating data 10x faster every two years. (Ferres, 2015) In other words, starting from 1 in 1985, we were at a speed of 1x1015 in 2015 (e.g. one quadrillion “miles per hour”), and in 2017, our speed of data creation is 1x1016 (e.g., ten quadrillion “miles per hour”).

If that acceleration wasn’t fast enough, we’re soon going to be creating data a lot faster still, because of the Internet-of-Things (IoT).  The IoT is the collective name for the billions of devices that are being embedded with sensors to communicate data in networks – think of the “smart” refrigerator by Samsung that tracks what groceries are inside it, or the car or home alarm system or baby monitor that you can control with your mobile phone. Technology research group Gartner estimates there were 6.7 billion such devices or sensors on-line in 2016, (Gartner, 2015) and competitor research group IDC estimates there will be 30 billion by 2020. (IDC, 2014) Recall, there are approximately 2.7 billion Internet users.  So, the prediction is the number of data creators will increase by 10x within the next three years alone.  Simplistic math would suggest that 10x the number of data creators accelerating at 10x every two years may mean that within 3-4 years, the speed of our annual data creation will accelerate 100x every two years.

But these numbers are at such a scale as to make them difficult for human brains to understand or imagine.  Two gigabytes, which was 80% of the amount of data we collectively created in 2012 every day, is about 20 yards (60 feet) of average-length books on a shelf, or about 6.67 kilometers (4.15 miles) of books per year.  But because we created data 10x faster in 2016, it means we were up to 66.7 kilometers (41.5 miles) of books face-to-back per year.  If we accelerate 10x faster by 2018, as predicted, and 100x faster as suggested above by 2020, that would mean we will create 667 kilometers (415 miles) of books on a shelf in 2018, and will be creating 6,667 kilometers (4,150 miles) of books on a shelf, every year, by 2020. At this rate, if we were publishing books instead of electronic data, it would be enough to encircle the Earth at the equator every year sometime before 2022 – only six years away.

Imagine if meaningful knowledge or discovery in Big Data were diamonds in the Earth.  To mine or find them, we have to collect tens of thousands of cubic yards of soil.  Then, someone comes along with an invention that enables us to collect billions of cubic yards of soil premised on the theory that we will find orders of magnitude more diamonds in orders of magnitude more dirt.  Maybe.  But, for sure, it makes the mission of diamond miners (data scientists to us) orders of magnitude harder too.

Worse yet, unless and until we become proficient at its use, big data statistics often creates more false knowledge than true knowledge.  The most common thing a researcher does in trying to discover meaningful new relationships in this data is calculate correlations (e.g., every time X changes, Y also changes); however, these correlations are often “false” because we presume they are causal (X changing causes Y to change) leading to misinformation.  To determine a causal relationship requires Bayesian statistics, a rather advanced statistical toolbox with which many data scientists, let alone executive decision makers, are unfamiliar with.  However, the error-prone process doesn’t end there because there are two major categories of Bayesian statistics – naïve (assuming data points function independent from each other) and network (assuming data points influence each other).  If and when a data scientist is familiar with Bayes, 50% of the time they use the wrong application of the formula. The bottom line being that most of the types of correlations and basic statistics that people initially apply to Big Data give false or misleading information.

In our analogy, not only is our speed of creating Big Data increasing at an increasing rate, meaning we have to move millions of cubic yards of soil one year, billions the second year, and tens of billions the third year, the soil we’re “processing” is riddled with fake diamonds.  While often mistakenly attributed to physicist Stephen Hawking or US Librarian of Congress, Daniel Boorstin, it was actually historian Henry Thomas Buckle, in the second volume of his 1861 series “History of the Civilization of England” who first observed that: “the greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.” (Buckle, 1861)

We don’t necessarily need bigger data, although we’re certainly going to get it.  We need more meaningful data.  Therefore, the exponential amassing of data that is underway at an unprecedented rate in the history of humankind, and is about to accelerate even more, is creating more noise to sift through to find meaningful knowledge then before the Big Data era.  It is becoming harder to identify what is important.  The evolution of humankind via the discovery of knowledge, therefore, will be accelerated not by the gluttonous creation of ever-bigger data but by focusing on the most meaningful data and creating and sequestering that.

Works Cited

Buckle, H. T. (1861). An Examination of the Scotch Intellect During the 18th Century. In H. T. Buckle, History of the Civilization of England (p. 408). New York: D. Appleton & Co.
Dragland, A. (2013, May 22). Big Data - For Better or Worse. Retrieved from SINTEF: www.sintef.no/en/latest-news/
Ferres, R. (2015, July 14). The Growth Curve of Data. Australia: Quora.
Gartner. (2015, November 10). Gartner Says 6.4 Billion Connected "Things" Will Be in Use in 2016, Up 30 Percent From 2015. Retrieved from Gartner: http://www.gartner.com/newsroom/id/3165317
IDC. (2014, April). The Digital Universe of Opportunities: Rich Data & The Increasing Value of the Internet-of-Things. Retrieved from IDC - EMC: https://www.emc.com/leadership/digital-universe/2014iview/internet-of-things.htm
Wall, M. (2014, March 4). Big Data: Are You Ready for Blast-Off? BBC News, pp. www.bbc.com/news/business-26383058.




No comments:

Post a Comment