One of the major reasons big data is a big deal is because it makes it possible to gain previously undiscoverable insights through the analysis of incredibly large sets of information we didn’t have the capabilities to collect, let alone analyze. For instance, the billions of connected devices in the Internet of Things are creating data sets we can mine for new knowledge about things like our bodies, our commerce, even package delivery.
But sometimes the promise of big data also comes in the ways it can link, organize and help us interpret massive data sets that already exist, but weren’t necessarily connected – so-called “small data.”
In this latest edition of our Big Data Diary, we’ll take a look at an effort to turn small data into interconnected big data that can help scientists study life on earth.
What’s in a Name?
Nearly everything we know in the study of organisms is contained in hundreds of millions of pages of research, gathered through the centuries and increasing all the time with the contributions of new projects and technologies. That data isn’t all in one place, and it isn’t yet possible for scientists to combine, index, organize and interconnect these vast and dispersed data sets. But a new big data analytics project is currently trying to make that happen, and it all starts with the names of the organisms.
The case of a particular starfish and fungus highlighted on phys.org illustrates why organism names can sometimes make things complicated. The genera of this starfish is called “Asterina” – and so is the genera of the fungus. This obviously causes confusion, but it’s far from the only case of duplicate naming, given that for most of the history of biology, scientists weren’t interconnected and had no way to know if they were using a name that had been taken. The issue needs to be sorted out before these huge and growing datasets can be fully utilized. That’s where a project like the Global Names Architecture (GNA) comes in.
The GNA is a system that combines various Web services to compile biological scientific names so people can find, check, register and organize them, and interconnect to online information about various species.
The reason such efforts are possible far predates big data and goes back to the 18th century, when biologists began using Latin bionomials, which are two-part names written in italics, with the first word capitalized (such as Homo sapiens). The commonalities in binominals allows big data name recognition tools to scan tens of thousands of data sources and discover duplicates or names not yet present in expert compilations.
The grand unification through names of all these sources of knowledge can lead to more chances for collaboration and new insights. Still, the effectiveness of such a cyberinfrastructure could be compromised by a few factors:
- changes to names over time because of ongoing research.
- misspellings or errors in the way names are presented.
- increasing numbers of species which have no names, but are distinguished by their molecular characteristics.
But a study in the Biodiversity Data Journal showed that by using a names parser to break the scientific names into component parts, name-matching accuracy was 85% to 100%. To experts, it was confirmation of the potential of names management software to link this distributed “small data” into a big data cyberinfrastructure that can clear a path to new discoveries.
Equinix won’t be naming any starfish in the near future, but our Data Hub offering makes it easier to bring large data sets close to data sources, analytics and end users. The information stays secure and can be directly accessed for faster insights.
Check out the entries in our Big Data Diary series:
“Name That Organism!” (see post above)