Data Lakes and Clouds: Flowing Together

Jim Poole


The push to leverage big data analytics for customer insights, process efficiencies, predictive maintenance and other new opportunities has spawned a brand new data repository architecture. Massive, expensive relational data warehouses, with their dedicated staff and painstaking data integration and cleansing, are giving way to equally massive “data lakes.” Within these lakes, structured and unstructured data – from customer relationship management, to social media, to machine sensor information – are gathered, combined and stored together in their original formats. They float there together, along with their metadata, waiting to be combined and harnessed at a later date for that small customer insight, 1% efficiency improvement, or predictive maintenance that saves the organization or boosts its revenues by millions of dollars.

Rather than expending all that time, money, and expertise cleansing everything up front, new big data solutions such as Hadoop and NoSQL allow organizations to start accessing, analyzing and harnessing anything and everything right away from its original format, with some loose integration at the back end at the time analytics are applied. This approach is better suited for gobs of unstructured data, such as social media, but can also work with structured data as well.

Fast and Flexible

This new data architecture brings significant advantages. Perhaps the greatest is flexibility, as storing all that varied information in one massive lake helps to break down business silos, not to mention that additional silo of the data warehouse management and integration staff. Preserving information in its original format rather constraining things with predetermined data models makes it easier and more practical to harness it from different angles by different departments in different combinations. And dispensing with all that up-front integration and rigidity lets business units start gleaning insights fast.

Much like a lake ecosystem, the data lake evolves and matures gradually as new data and metadata stream in, along with task-based integration and various other enhancements. Finally, data lakes are an order of magnitude less expensive to create and maintain than data warehouses.

Enter the Cloud

Add cloud-based Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Software-as-a-Service (SaaS) and organizations get even more flexibility and speed, including fast, low-cost ramp-up, instant scalability and a built-in platform for harnessing all that varied information. IaaS can quickly provide and scale the storage, processing and other resources for holding and analyzing all that information. Colocation and IaaS can provide storage and processing locations in close proximity to data collection for near real-time, low-latency analytics. PaaS and SaaS offer ready-made platforms to rapidly build or lease specialized, powerful analytics applications for different services, devices and channels, as well as essential features such as data security and disaster recovery. All this lets organizations leverage big data fast, flexibly and safely without having to invest in their own expensive infrastructure and expertise.

A perfect example in the Internet of Things (aka: Industrial Internet) category is General Electric’s (GE) emerging industrial data lake, a cloud-centric offering, in partnership with Pivotal, for storing and harnessing huge volumes of sensor information from GE devices. One of GE’s first projects with Pivotal is a giant data lake that ingests and tracks data from millions of airline engine sensors during and after flights. Thanks to locally placed data repositories and applications and services based on GE’s Predix platform, users can harness huge quantities of near-live jet engine sensor data for valuable predictive analytics that slashes downtime and yields new operational efficiency insights. GE has estimated that a mere 1% efficiency gain from industrial Internet analytics would yield $150 billion in savings across the industries in which the company operates.

Learn more about GE’s “1% Rule.”

Read more about Equinix and big data.

Jim Poole
Jim Poole Vice President, Business Development at Equinix