Big Questions Before Big Data

 

By Ephraim Baron

 

I have a love/hate relationship with information technology. On the love side, computers and networks have made knowledge readily-accessible to billions of people. On the hate side, we’re constantly inundated with new buzzwords with the apparent aim of making those who coin them feel smart and the rest of us feel like dullards. And what’s the current buzz-term-du-jour? Big Data.

So just what is Big Data, anyway? Well, as with most things tech, the answer depends on whom you ask. Let’s start with a definition from Wikipedia which describes Big Data as…

“a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools”

Some say Big Data is necessarily unstructured; others say that very large data warehouses qualify. To the non-practitioner, the distinction is far less important than the much more central question: “why should I care?” The implied answer is that everyone is doing it, so you should, too.

To answer why big data matters, let’s start with the DIKW pyramid which describes the progression from data to wisdom. Simply put, data alone is a mere building block. We can query it to generate information based on facts. The results are then interpreted to form knowledge, which is a clearer understanding of what the information implies. Finally, the knowledge is evaluated against experience, and if it’s found to be useful it becomes wisdom.

I was once asked to give a presentation on data analysis. What I ended up giving was a DIKW overview. I broke it into 10 steps, as follows:

 

#1 Ask a big, interesting question

This is by far the most important step. After all, who cares how clever or impressive an analysis is if the results aren’t interesting? Some believe they can simply sift through large amounts of data and discover things. This is the bottoms-up approach to data analysis, and in my experience it’s rarely successful. I prefer to start with a high-level goal and then work through the steps to support it. This is the tops-down approach I’m now describing.

example: What public health initiative would provide the most benefit to humanity?

 

#2 Identify data sources

At this point you need to find out if and where data exists to help you answer your big question. This is likely to be the most time-consuming part of the entire exercise. Because we’ve asked such a big question, the list of possible data sources may likewise be big. If you’re lucky, there may be a single, consistent, authoritative source. More likely, you’ll find data scattered across multiple sources, possibly in different formats and with a variety of owners. Additionally, you must determine how reliable the data is. If you don’t trust it, you can keep looking, generate your own data, or abandon your big question.

example: I can find epidemiological data in many public health databases. Possible sources might include the National Institutes of Health and Center for Disease Control in the U.S.; data from other government-funded research; data from non-governmental organizations (NGO’s); data from companies involved in health research;

 

#3 Get access to the data

Just because you’ve identified the data you need doesn’t mean you can use it. Many data sources are proprietary. Others require paid subscriptions. Still others are available only to specific groups of users, such as members of an academic community or a professional society. It’s possible that as a precondition to getting access, you may be required to share the results of your research.

example: You find that some public health data sources are publicly available. Others require you to contact the data owner and explain why you want access and how you plan to use the data. Still others are simply not available to you.

 

#4 Extract, Transform, Load

(ETL)

This term applies to the integration of data from multiple sources. The first step, Extract, requires that you understand how the data is organized. With traditional, structured data, this means knowing the database schema. With unstructured data, it means knowing where and how the data is stored. The Transform phase typically involves mapping data from multiple sources into a common format. The Load phase involves compiling and assembling all data into a common repository.

example: Some of the data you need is stored in an online database and can be downloaded in the form of comma separated value (CSV) files. Other data is in SQL databases, and some is unstructured and requires you to define key-value pairs and run map-reduce jobs to extract what you need. You plan to extract all data and load it into a SQL database for further processing and analysis.

 

#5 Clean and Validate the Data

As the saying goes, “garbage in, garbage out”. So before you put a lot of time into data analysis, you need to make sure the data is clean and consistent. This will likely include steps such as data de-duplication and a search for suspicious entries.

example: You de-dupe your health data, and you validate location information using geographic information systems (GIS) software to remove spurious records.

 

#6 Query the Data

Now that you’ve acquired your data and it’s sparkly-clean, you’re ready to turn it into information. This means creating queries against your data that speak to your big question.

example: You query your public health data to find out the diseases with the highest prevalence. You also characterize those afflicted by age, location, and other demographics.

 

#7 Analyze and Adjust Your Query

Based on your queries, you need to decide whether the information generated help you answer your big question. This will probably include performing a statistical uncertainty analysis to establish how confident you are in the results.

example: Based on your analysis, you determine that heart disease, cancer, diabetes, and stroke are four of the top public health issues facing society.

 

#8 Interpret the Results

Now that you have a possible answer to your big question, you need to determine what it means. This is where information is turned into knowledge.

example: You notice a correlation between body mass index and both heart disease and diabetes. This leads you to conclude obesity a contributing factor to both.

 

#9 Report Findings

Once you’re confident in the results of your analysis, it’s time to share your work. This is where you finally get to use tools such as pivot tables and graphs, since pretty pictures are easier to understand than tables of numbers.

example: You write a scholarly article on the importance of proper nutrition to public health. You also advocate for increased funding for public health education and programs focused on nutrition. Finally, you apply for grants to further your research.

 

#10 Track and Update

This is the final part of the DIKW process. Knowledge becomes wisdom when you’re able to judge how well you were able to answer your big question.

example: In subsequent studies over many years following the publication of your original findings, you are able to demonstrate that improvements in nutrition are positively correlated with decreases in heart disease and diabetes. (Based on your years of research, you are awarded a Nobel Prize in medicine.)

 

So remember, if you want to impress your techie friends, talk about big data. If you want a Nobel Prize, ask big questions.