Wikipedia broadly defines big data as “any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools.” But exactly what counts as “large and complex” can differ by company. In other words, the “big” in “big data” is in the eye of the beholder.
Some companies need to adopt new techniques to handle modest-sized datasets in the gigabytes range. Other business aim to operate at “Internet scale,” which typically means handling petabytes and exabytes (an exabyte is a quintillion bytes, or 1,000,000,000,000,000,000 bytes, for perspective). And experiments at the CERN Large Hadron Collider generated 22 petabytes per day of data – and that’s after throwing away 99% of the data!
Whatever the “big” in big data means to an enterprise, it needs to prepare to deal with it because its competitors are. And Equinix can help them do it.
Traditionally, companies solved larger problems simply by purchasing a larger computer. But the costs per unit of computing tend to increase exponentially with scale, and there are also practical limits to the size of a single machine.
Some companies dealt with those limitations by looking at alternative computing architectures, including grid computing, which sees PCs work together in clusters of thousands to solve problems that exceed the capabilities of a single mainframe machine.
Others used techniques such as “massively parallel” computation. Google’s MapReduce, for instance, divides a large dataset into “shards” which are distributed to computing nodes. The results from each node are then collated to a single result set. In this way, arbitrarily large big data sets can be processed by a system that scales linearly with the addition of new PCs.
Cloud architectures can incorporate data manipulation technologies and techniques from inception. But legacy enterprise systems may require a complete overhaul, which is rarely feasible for production systems. As such, it generally makes sense for enterprises to integrate high performance computing and legacy systems with existing cloud infrastructure (“IaaS”) service providers, such as Amazon Web Services (AWS) or Azure. Colocating with them and cross connecting to their systems, through services like AWS Direct Connect or Azure ExpressRoute, both available inside Equinix, is the best way to do so.
Why? Moving large datasets around requires massive amounts of bandwidth and analyzing on site datasets calls for low-latency connections. Furthermore, fluctuation in performance found on the Internet can lead to application performance problems and outages. And many datasets contain sensitive data (personally identifiable information, for example) that should not be transmitted over public networks.
Direct, private access to public cloud providers through Equinix completely bypasses the public Internet, which mitigates reliability and privacy concerns. And the recently debuted Equinix Cloud Exchange gives businesses the ability to simultaneously connect to multiple public cloud providers and access the array of services they need to make big data work for them.
The increased computing power comes at a lower cost. Microsoft recently found that the total cost of ownership of a 100,000 server cluster (typical of large cloud providers) is 80% lower than that of a 1,000 server cluster (typical of large enterprise).
Furthermore, direct access to cloud services means it’s no longer necessary to engineer for peak loads. Consider a carrier that requires 1,000 servers on one day in a month for a billing batch job, but only 100 servers every other day – it could reduce its in-house footprint by 90% by using a third-party cloud service for their “big data” tasks.
Just like the definition of big data is flexible across companies, companies must have flexibility to fully exploit big data. Colocating inside Equinix, and access to the Equinix Cloud Exchange, gives them exactly that.