Five Misconceptions about Data Science - Knowing What You Don't Know

Data science has made its way into practically all facets of society – from retail and marketing, to travel and hospitality, to finance and insurance, to sports and entertainment, to defense, homeland security, cyber, and beyond. It is clear that data science has successfully sold its claim of "actionable insights from data," and truth be told, it often delivers on that claim, adding value that would otherwise go untapped. As a result, data science is often looked to as a panacea, a Swiss army knife, a silver bullet, a must-have, [insert your own cliché here]. This has implications for both data scientists and the organizations they work with. On one hand, data scientists are now beginning to face a new set of challenging problems, problems that even the most advanced machine learning algorithms have yet to solve: managing expectations. And on the other hand, many businesses and organizations are grappling with shifting learning curves, the latest shiny object, and the pressure to keep pace. As the data science bandwagon fills up, there are many individuals that do not fully, or even marginally, understand what data science is, what it can do, and when it is relevant. In what follows, I present what I have encountered to be five of the most common misconceptions about data science – misconceptions that will proliferate and morph as the data science wave rolls on. Recognizing these misconceptions, and avoiding the pitfalls associated with each, will go a long way toward empowering you (and your organization) when it comes to "deriving value from data."

The interchangeable use of the terms "data science" and "big data" is not uncommon these days [1]. It could be argued that the so-called big data revolution provided the impetus for the field now labeled as data science. Regardless of the origins of their entanglement, big data and data science are quite different. Big data refers to the collection, managing, and processing of incredibly large amounts of data (terabytes, at a minimum). But the idea of big data goes beyond just a lot of 1s and 0s, which is why it is more properly characterized by the "Three Vs" – volume, variability, velocity. In addition to sheer quantity, big data often consists of different types of data (structured, unstructured, numeric, textual, imagery, video, and so on) [2]. And, data can become "big" when the rate at which it is generated and requires handling becomes excessive. Take Twitter, for example: a single tweet is only a few hundred bytes (the 140 characters), but considering 350,000 tweets are sent per minute (on average) [3], you quickly have a big data issue.

In contrast, data science deals with everything from the mining, transforming, modeling, and storing of data, to exploring and analyzing data, to building models and algorithms around data, to visualizing and interacting with the results. Big data should be thought of as an aspect of data science – it describes the situation where the data involved is characterized by one or more of the Three Vs.

When organizations talk about needing big data solutions or big data technology, often, what they really need is a data-science solution. Most businesses do not have petabytes of data. Quite the contrary – many businesses are able to work with their data using Excel, which has traditionally been a satisfactory mode of operation; however, with the explosion of data collection and data availability, the challenges that we see today are not so much the amount of data, but rather, the variety of data. Data (much of it unstructured) is becoming more heterogeneous, and it is often scattered across various systems (some old, some new). There is also the reality that data can be incomplete, inconsistent, and even plain wrong. Rather than dealing with big data, organizations are wrangling with "non-traditional," "messy," or "difficult" data, and being able to work with these less-friendly data incarnations has become the real challenge. To successfully extract meaning and insight from data, one must be able to mine data from online and mobile sources, integrate disparate data sets, and transform raw data into usable information.

Fundamentally, the misconceptions around big data and data science are mostly lexical, and it usually only requires a review of what one truly needs. The bottom line is that by understanding some key data science fundamentals, businesses become more informed and can better understand their challenges and how they can be met.

Many of us are familiar with machine learning and may not even realize it. Simple linear regression is a form of machine learning. Linear regression is an example of a supervised learning algorithm, wherein the observations given to the algorithm include both the dependent and independent variables. By providing the algorithm with the "right answers" in advance, one can build a model that can then predict answers for new observations. The key in this trivial example, as well as the most complex and sophisticated machine learning algorithms, is that the machine needs to learn a relationship, however convoluted, between the inputs and outputs in order to develop a useful model. How does it do that? Quite simply, it is taught. The problem that analysts are encountering more and more is a failure to understand that critical piece – in other words, there is an expectation that one can feed data into a computer, the computer does its machine learning magic, and voilà, useful answers pop out. [4]

To illustrate this fallacy, consider the following example and the two "blackboard" pictures below. Suppose we would like an average 7-year-old to learn arithmetic. One approach, depicted in Scenario A, would be to randomly write all the numbers and relevant operators on a blackboard, and ask the youngster to study what is on the blackboard. Another approach, depicted in Scenario B, would be to write a series of equalities that illustrated the rules of addition, subtraction, multiplication, and division. Again, the youngster would be asked to study the information on the blackboard. Thankfully, arithmetic is not taught to children in either of these ways. But, it should be clear that a student would have no chance of learning to add, subtract, multiply, and divide if presented with Scenario A. On the other hand, in Scenario B, he/she might have a chance. At the least, if the student memorized the equalities shown, he/she would know the answers to those problems; whether they could extend that knowledge to other, not seen, problems would be hard to say.

Presumably, you have realized that Scenario A is intended to portray the myth that machines (i.e., computers) see data and just learn, while Scenario B is closer to reality. Throwing data at a computer and expecting it to divine golden nuggets of insight is a hopeless endeavor – and no amount of additional data will help. It should be pointed out, however, that there is a class of machine learning known as unsupervised learning (see below), which is actually not unlike Scenario A, although the objective would be different.

Returning to our example, the intent is to illustrate that if a computer is to develop a useful model, it requires the right kind of inputs along with a suitable set of outcomes. In other words, there needs to exist some sort of relationship between the set of inputs, often called features, and the outcome, or target. Some relationships are straightforward and can be uncovered with relatively simple algorithms (e.g., linear or logistic regression). Others can be deeply hidden and complex and require much more sophisticated analytics along with a savvy data scientist. But even the most sophisticated model (and there are many) will fail miserably if a relationship does not exist. In some cases, an organization will be swimming in data, but the nature of that data is not immediately amenable to building a useful, machine-learned model. In such cases, the raw data may require transformation or aggregation (e.g., individual words in a body of text → word counts). Or, one may need to develop new features that are better-suited to a particular problem (e.g., individual words in a body of text → counts of positive- and negative-sentiment words). Or, considering the outcome, the target variable may need to be "bucketed" (e.g., dollar amounts a household spends on groceries per week → [$0 – $50, $50 – $100, $100 – $150, $150 – $250, ...]). This is where the data scientist comes in. Someone needs to understand the data, the objective, and how to best get there.

For organizations, this misconception boils down to having realistic and grounded expectations. Understanding the basics of machine learning such that you can appreciate its strengths, weaknesses, and limitations will go a long way toward knowing what golden nuggets can be extracted from your data.

Perhaps this one is not a pure myth. Maybe this one should be called, "All data has the value you need." Nevertheless, while data and analytics are proving to be game-changers, just because you have data does not mean you can solve your problems, increase profits, detect the next cyber intrusion in real-time, or predict the next Billboard hit. The most promising data science endeavors look to answer a targeted question. Unfortunately, many (perhaps most) real-world data science projects lack that well-posed question. Instead, they start with having data and a desire for that data to provide some sort of value. So, they grab a data scientist, point them at the data, and say, "Find useful stuff." While exploratory data analysis (EDA), which is what this situation amounts to, may not be the "right" way to do data science, it's not a bad way to get started and can provide value. Ideally, outcomes of EDA can be used to develop one or more well-posed questions. Alternatively, just working through the data exploration process can reveal questions that can be explored with data science.

Well-posed questions aside, many organizations are looking for ways to exploit their data; however, a vital component to being successful at extracting value or useful insights from data is having data that contains information germane to your objective. A common hurdle is that such information is often hidden or difficult to extract (if it wasn't, you wouldn't need data science). For instance, it may be scattered in huge volumes of chaff. Or, it may be that the information of value is only realized through a combinations of data elements. A common scenario is where the information that an organization wants to extract lives in unstructured data – e.g., articles, web sites, tweets, audio, video, images. Data science has a plethora of tools that any good analyst can use to get at the relevant information ... assuming it exists. In some cases, an organization is looking to answer a question that their data simply cannot answer, or cannot answer well. For instance, a retail business may be interested in gauging customer sentiment toward their brand, and they turn to a crack data science team with all the website traffic data they have collected. Visits, return visits, unique visitors, conversion rates, time-on-site, etc. – all of that is interesting data, but it cannot be used to measure sentiment. Data that would be relevant would include product reviews (on the retailer's site or elsewhere), other web content mentioning the retailer, tweets that mentioned the retailer, any transcripts from help- or support-line calls and so on. This is perhaps an overly simple example, so consider this more subtle scenario: a heavy-machinery manufacturer has a great deal of support-line data (many years of transcripts, an array of products involved, a broad swath of customers, you name it), and they are looking to measure customer sentiment for their best-selling products. Seems doable, right? And it is. The problem is that the sentiment that would be measured and reported to the manufacturer via support-line channels would be biased. The reason is that people call support lines when they have a problem, so they are generally unhappy. They are frustrated, angry, disappointed. People don't typically call support lines to provide positive feedback. As a result, the sentiment that you would capture from any analysis of such data would contain much more negative sentiment than positive and would be an inaccurate measure.

Another common scenario relates to prediction. Prediction is generally a very difficult task. As a wise man once said, "Prediction is very difficult, especially about the future." [5] Think about the weather. Think about your fantasy football players. Think about the stock market. Think about terrorist attacks. These things are hard to predict and they represent the kinds of problems that data science is being asked to tackle. Consistent, accurate prediction is confounded by many things, including complex system dynamics and thorny sensitivities (weather, stocks), insufficient data relative to the problem scope (terrorism), and elusive intangibles that defy measurement (sports, see below). Challenging problems in prediction require rich data, robust algorithms, and smart data scientists.

All data has value, that is true – but does it have the value you need? And do you realize that? This is the crux of this misconception. The challenge is to determine what data is needed to address your needs, where to find that data, and how to best exploit it.

Back in 2012, the sexist job of the 21st century was proclaimed to be the data scientist. [6] While some of the hype may have waned, there is no doubt that the title of data scientist still carries a great deal of weight. But, what exactly is a data scientist? Many have tried to answer that question with Venn diagrams:

As one can see from these diagrams – save the last two – if we are to consider them as plausible representations, the field of data science, and the "data scientist," requires a mix of various skills, areas of expertise, and knowledge. Our understanding of this mix has evolved and matured over time, with the thoughtful diagram by Kolassa reflecting that progression. Consider a data science job posting from 2012 that likely would have required experience with Hadoop, HDFS, big data, and MapReduce. Contrast that with a data science position in 2018 that might not mention any of these keywords; in their place, keywords such as deep learning, streaming analytics, and blockchain. Replacing these very soon might be explainable artificial intelligence (XAI), hybrid human-machine intelligence, intelligence augmentation, and who-knows-what.

All of this becomes important when you find yourself asking questions such as: Do we have data scientists? Do we need data scientists? Where do we find data scientists? Why are data scientists so expensive? The problem with these questions is the term, "data scientist." The analogy between the data scientist and a purple unicorn is still apt – finding an individual that satisfies any one of the top four diagrams above is rare. So, who are all these people that call themselves data scientists? They are analysts, they are coders, they are machine learning specialists, they are data engineers, Hadoop SMEs, Python and R gurus that can google Stack Overflow, graduates of data science boot-camps, and so on. The title of data scientist is being thrown around like cat videos on YouTube, and such liberal use of the label does a disservice to the data science community and those who engage with it.

To help un-muddy the data science waters, leading training provider General Assembly, in partnership with a number of major tech employers, including Bloomberg, Booz Allen Hamilton, Nielsen, and Spotify, are developing data science industry standards and a data science certification. General Assembly's Data Science Standards Board "aims to establish a clear set of standards across data science tools and technologies to help guide individuals and employers towards the universal and unbiased competencies that drive success." [7] Hopefully, this kind of industry collaboration will help take the guess-work out of determining the skills behind a data scientist, while also providing aspiring data scientists with a better understanding of the pathways to achieve their career goals.

In the end, labels are only labels and what organizations really need are people with the right skills, the relevant experience, and the creative know-how to solve the problems they need solved. This will almost always mean a team of individuals that collectively make up the purple unicorn.

Given that data science is a field born and bred on data, algorithms, computing, mathematics, and coding, there is a natural tendency to regard the field as exacting, where subjectivity has no place. "Get data and turn the crank," one could say. If it were that simple, then some basic analysis would reveal (see figure below) that we could save thousands of US pedestrians from death if we just stopped consuming high-fructose corn syrup.

Clearly, this is a ridiculous inference and is purely for illustration purposes, but it should demonstrate the dangers of naïve data science.

Everything – from identifying the questions to answer, deciding what data will be used to answer those questions, collecting/sampling the data, modeling the data, storing the data, cleaning/reducing the data, analyzing the data, developing models from the data, assessing those models, inferring results, making conclusions, visualizing the data and presenting one’s findings – can involve a degree of subjectivity that draws on intuition, experience, judgment, and sometimes, plain luck. Data science is as much an art as it is a science (not unlike many other "sciences"), and it needs to be understood and appreciated in that light. For any given problem, there may be a myriad of potential ways to get to an answer, and knowing which combination of steps will yield an "optimal" answer is rarely known. Even deciding what constitutes "optimal" is often unclear. There are usually all sorts of trade-offs that come into play, e.g., balancing time and resources available to develop a solution against the performance of that solution, managing complexity with what can be operationalized, and being sensitive to performance-as-tested versus performance-in-the-field. This last consideration relates to how well a solution generalizes and it pervades the field of machine learning. To illustrate this, suppose your objective is to develop a model for the selling price of a home as a function of home size, as measured in square feet. Using data that you get from your town records department, you find a nice relationship between the selling price of a home and its size (see figure below, blue dots are your data). It's a great fit, with an R-squared of 0.9. But is this really a great model? Let's see. Imagine a (wealthy) friend of yours, from a town 30 miles away, tells you that she's moving, and asks you what price she should list her home for. She gives you all the details of the home, including the size of the home (5000 square feet). Your awesome model tells you that the selling price is well-predicted by home size, so you take that number and run it through the model. It predicts a selling price of $1.75M (red X in the figure). Your friend lists her home at $1.8M, but after 3 months, she's received no offers! Why? How could that be, your model is spot on! The reason is that your model was developed using data that does not capture your friend's situation, and consequently, the model is unable to accurately predict your friend's selling price – this is the essence of generalization. It turns out that above a certain house size (around 3500 square feet), the relationship between selling price and size is no longer linear. A more realistic prediction of the selling price is $1.4M (the blue star), which is 20% lower than the number you gave your friend. This nonlinear relationship could be unique to the town your friend lives in, or it could be indicative of a more general phenomenon. Regardless, it is quite clear that your model is limited in its usefulness and it would be hard to consider it great.

This example is highly oversimplified and most of us would know better than to make such a flawed extrapolation. Reality is not so simple, and this is where the challenge lies and the art of data science becomes so important. Today's datasets can be massive (lots of records with lots of fields) and the resulting models developed to extract information from these datasets are complex and highly-dimensional. As a result, potential model shortcomings and various other gotchas become hard to see, and acknowledging and managing such potential pitfalls is where experience and intuition become important. It takes time to develop this kind of experience and intuition, and even the savviest data scientist is not immune to getting sideswiped by unforeseen curve balls. The key is recognizing that it can happen.

The topics considered here represent a baseline of understanding the world of data science. Beyond this, things only get more complicated. More daunting is the fact that a new set of misconceptions and questions is right around the corner. Is AI the same as machine learning? Will robot workers replace human workers? Doesn't deep learning solve everything? Bitcoin and blockchain are the same thing, right? And so on. It can be hard to keep up, but knowing what you don't know is always a good start.

Sean McKenna, PhD is the Lead Data Scientist in Booz Allen Hamilton's Boston iHub Office. As part of the Strategic Innovation Group, he is currently developing ways in which advanced analytics and machine learning can be leveraged to solve challenging problems in areas such as operations research, modeling and simulation, and predictive analytics. He has extensive experience with all aspects of data analysis and interpretation and has developed solutions to an eclectic array of technical problems for academic, government, and commercial clients. Dr. McKenna leverages broad technical expertise and analytical rigor to develop innovative solutions across the spectrum of data science – from data ingest and transformation to dynamic visualization of results. In his role as the Lead Data Scientist in Boston, Dr. McKenna works to advance the office's data science capabilities and serves as a technical mentor and resource for the junior staff.

[1] A few years back, there was a similar mishmash of the terms big data, Hadoop, and MapReduce.

[2] Structured data refers to data that adheres to a specific format, and the manner in which it is stored, processed, and accessed is predefined (e.g., relational database tables, spreadsheets); unstructured data refers to data that lacks a predefined format or organization (e.g., text, images, audio).

[4] One aspect of machine learning where it could be argued that machines can learn on their own is deep learning. While the subject of deep learning is beyond the scope of this paper, its recent success and explosive popularity make it hard to gloss over. A myriad of ways in which deep learning is currently being implemented exist, and the poster child has to be the convolutional neural network (aka convNet or CNN). CNNs have made incredible progress in the field of computer vision, specifically, image recognition (crushing the transcendent, "Is this a dog or a cat?" problem). Cute animals aside, the power of CNNs, and other deep learning methods, is in their ability to ingest high-level raw data (e.g., images or human speech audio) and obviate the need for traditional feature engineering – arguably one of the most challenging, important, and overlooked aspects of machine learning. Through an intricate cascade of multiple nonlinear layers, CNNs "learn" various levels of abstraction – what we might think of as features. For instance, in image recognition, these layers represent structures such as edges, blobs, textures, and so on. In this sense, CNNs are regarded as learning the underlying features of a data set; however, these networks still need to be designed, and that still requires a smart analyst.

[5] While this quote is often attributed to Niels Bohr, Bohr himself usually attributed the saying to Robert Storm Petersen (1882 – 1949), also called Storm P., a Danish artist and writer; however, the saying did not originate from Storm P. The original author remains unknown (although Mark Twain is often suggested).

[6] Davenport, Thomas H., and D. J. Patil. "Data Scientist: The Sexiest Job of the 21st Century." Harvard Business Review 90, no. 10 (October 2012): 70–76.

Images Powered by Shutterstock

The Data Daily

Five Misconceptions about Data Science - Knowing What You Don't Know