Here’s a Big Book on Big Data

“Big data” is definitely one of the more commonly heard business buzz terms these days.

But beyond the general impression that “big data” represents the ability to collect and analyze lots and lots of information in some efficient manner, most people have a difficult time explaining with any specificity what the term really means.

Moreover, for some people “big data” isn’t very far removed from “big brother” – and for that reason, there’s some real ambivalence about the concept. Consider these recent “man on the street” comments about big data found online:

“Big data: Now they can crawl all the way up your *ss.”

“The scary thing about big data is knowing [that] Big Brother can know every single thing you do – and realizing your life is too unimportant for Big Brother to even bother.”

“Big data is what you get after you take a big laxative.”

But now we have a recently-published book that attempts to demystify the concept. It’s titled Big Data: A Revolution that will Transform How We Live, Work and Think, and it’s authored by two leading business specialists – Viktor Mayer-Schönberger, a professor of internet governance and regulation at Oxford University and Kenneth Cukier, a data editor at The Economist magazine.

The book explores the potential for creating, mining and analyzing massive information sets while also pointing out the potential pitfalls and dangers, which the authors characterize as the “dark side of big data.”

The book also exposes the limitations of “sampling” as we’ve come understand it and work with it over the past decades.

Authors Viktor Mayer-Schonberger (l) and Kenneth Cukier (r).

Cukier and Mayer note that sampling works is fine for basic questions, but is far less reliable or useful for more “granular” evaluation of behavioral intent. That’s where “big data” comes into play big-time.

The authors are quick to note that advancements in data collection tend to come along, shake things up, and then quickly become routine.

Mayer calls this “datafication,” and describes how it works in practice:

“At first, we think it is impossible to render something in data form. Then somebody comes up with a nifty and cost-efficient idea to do so, and we are amazed by the applications that this will enable – and then we come to accept it as the ‘new normal.’ A few years ago, this happened with geo-location, and before it was with web browsing data gleaned through ‘cookies.’ It is a sign of the continuing progress of datafication.”

Causality is another aspect that may be changing how we go about treating the data we collect.

According to Cukier and Mayer, making the most of big data means “shedding some of the obsession for causality in exchange for simple correlations: not knowing why but only what.”

So then, we may have less instances when we come up with a hypothesis and then test it … but rather just use the data to determine what is important and act on whatever information is revealed in the process.

One example of this practice that’s cited in the book is how Wal-Mart determined that Kellogg’s^® Pop-Tarts^® should be positioned at the front of the store in selected regions of the country during hurricane season to stimulate product sales.

It wasn’t something anyone had thought about in advance and then decided to verify; it was something the retailer discovered by mining product purchase data and simply “connecting the dots.”

Author Mayer explains further:

“There is a value in having conveniently placed Pop-Tarts, and it isn’t just that Wal-Mart is making more money. It is also that shoppers find faster what they are likely looking for. Sometimes ‘big data’ gets badly mischaracterized as just a tool to create more targeted advertising … but UPS uses ‘big data’ to save millions of gallons of fuel – and thus improve both its bottom line and the environment.”

One area of concern covered by the authors is the potential for using “big data predictions” to single out people based on their propensity to commit certain behaviors, rather than after-the-fact. In other words, to treat all sorts of conditions or possibilities in the same manner we treat sex offender lists today.

Author Kenneth Cukier believes that the implications of a practice like this – focusing on the use of data as much as the collection of the data – is “sadly missing from the debate.”

This book fills a yawning gap in the business literature. And for that, we should give Dr. Mayer-Schönberger and Mr. Cukier fair dues. If any readers have become acquainted with the book and would care to weigh in with observations, please share your thoughts here.

One thought on “Here’s a Big Book on Big Data”

“Shedding some of the obsession for causality” might work out fine when it comes to figuring out where to place Pop-Tarts in a store, but it has potentially grave implications when dealing with bigger questions — like climate change.

For those questions, the only tried-and-true approach is the scientific method that seeks to build theories through the characterization of unknowns, hypotheses to suggest explanations, prediction, experimentation to confirm or contradict hypotheses, confirmation through reproducibility of experiments … and eventual acceptance of hypotheses that have been repeatedly confirmed as theories.

Most generally-accepted theories for explaining events rely on causal (i.e. cause-and-effect) hypotheses.

Laments that growing use of quantitative analysis will undermine the scientific method are nothing new. Barry Wellman, in “Doing It Ourselves: The SPSS Manual as Sociology’s Most Influential Recent Book” (April, 1998) put it very succinctly: “I have no doubt that [users of counter-sorter machines] would have happily embraced statistical packages, but I am also confident that they would have urged the careful specification of variables and relationships beforehand. With statistical packages and multivariate routines, it is easy to pour in a heap of variables into the regression and stir wildly to see what sticks to what. Many spurious and silly things have come out of such stews.”

SPSS is an acronym for “Statistical Package for the Social Sciences.” Conceived in the 1960s, it is today owned by IBM and is still in widespread use. One of its creators, Norman Nie, believes the common denominator for the application of SPSS was that “hard data drives model building and model testing. Empirical model building is how data scientists approach the world.”

On big data, the whatsthebigdata.com blog (February 8th, 2013) says, “Big data means that a lot has changed in the intervening years. Specifically, Nie argues, with more data and better tools – both more powerful computers and statistical analysis programs – we have more sophisticated models. The limitation of the technologies of the past forced the use of limited-size samples and approximation methods. Today, says Nie, ‘we can move beyond linear approximation models’ and achieve greater precision and accuracy in forecasts.”

But the fact remains that no matter how sophisticated quantitative analysis and model-building become, people will always be skeptical of predictions that weren’t deduced through causal models, and tested and confirmed according to the scientific method. Excessive reliance on quantitative methods inevitably creates fragile hypotheses yielding predictions that cannot be repeatedly confirmed through observation and experimentation.

One needs to look no further than climate science – a field well-suited for quantitative analysis using big data technology. The central climate change hypothesis — that man-made carbon dioxide emissions cause global warming — seems to rest more upon extensive correlation analysis than on precise explanations of the underlying physics.

Yet recent news reports point to a global warming “pause” since 1998 that existing hypotheses did not predict, and cannot explain. Might the apparent inability to repeatedly confirm global warming predictions be a symptom of over-reliance on big data and empirical model-building?

With so much at stake, it’s a question worth asking.

Nelson M. Nones says:

September 25, 2013 at 8:11 am

“Shedding some of the obsession for causality” might work out fine when it comes to figuring out where to place Pop-Tarts in a store, but it has potentially grave implications when dealing with bigger questions — like climate change.

For those questions, the only tried-and-true approach is the scientific method that seeks to build theories through the characterization of unknowns, hypotheses to suggest explanations, prediction, experimentation to confirm or contradict hypotheses, confirmation through reproducibility of experiments … and eventual acceptance of hypotheses that have been repeatedly confirmed as theories.

Most generally-accepted theories for explaining events rely on causal (i.e. cause-and-effect) hypotheses.

Laments that growing use of quantitative analysis will undermine the scientific method are nothing new. Barry Wellman, in “Doing It Ourselves: The SPSS Manual as Sociology’s Most Influential Recent Book” (April, 1998) put it very succinctly: “I have no doubt that [users of counter-sorter machines] would have happily embraced statistical packages, but I am also confident that they would have urged the careful specification of variables and relationships beforehand. With statistical packages and multivariate routines, it is easy to pour in a heap of variables into the regression and stir wildly to see what sticks to what. Many spurious and silly things have come out of such stews.”

SPSS is an acronym for “Statistical Package for the Social Sciences.” Conceived in the 1960s, it is today owned by IBM and is still in widespread use. One of its creators, Norman Nie, believes the common denominator for the application of SPSS was that “hard data drives model building and model testing. Empirical model building is how data scientists approach the world.”

On big data, the whatsthebigdata.com blog (February 8th, 2013) says, “Big data means that a lot has changed in the intervening years. Specifically, Nie argues, with more data and better tools – both more powerful computers and statistical analysis programs – we have more sophisticated models. The limitation of the technologies of the past forced the use of limited-size samples and approximation methods. Today, says Nie, ‘we can move beyond linear approximation models’ and achieve greater precision and accuracy in forecasts.”

But the fact remains that no matter how sophisticated quantitative analysis and model-building become, people will always be skeptical of predictions that weren’t deduced through causal models, and tested and confirmed according to the scientific method. Excessive reliance on quantitative methods inevitably creates fragile hypotheses yielding predictions that cannot be repeatedly confirmed through observation and experimentation.

One needs to look no further than climate science – a field well-suited for quantitative analysis using big data technology. The central climate change hypothesis — that man-made carbon dioxide emissions cause global warming — seems to rest more upon extensive correlation analysis than on precise explanations of the underlying physics.

Yet recent news reports point to a global warming “pause” since 1998 that existing hypotheses did not predict, and cannot explain. Might the apparent inability to repeatedly confirm global warming predictions be a symptom of over-reliance on big data and empirical model-building?

With so much at stake, it’s a question worth asking.