GE and Pivotal said they built the first industrial-scale “data lake” system that could supercharge how companies store, manage and glean insight from information harvested from machines connected to the Industrial Internet.
The system, which has already tracked more than 3 million flights and gathered 340 terabytes of data, can analyze data 2,000 times faster than previous methods and cut costs tenfold. It is so powerful that it crunched through a complex task that would have taken a month to compute in just 20 minutes.
“Big Data is growing so fast that it is outpacing the ability of current tools to take full advantage of it,” said Bill Ruh, vice president of GE Software. Dave Bartlett, computer scientist and chief technology officer for GE Aviation, said that industrial data lakes will help companies predict future problems and run machines more efficiently, sustainably and profitably. They will also help GE maintain and service machines better. “We are getting the most life out of our assets,” he said.
The industrial data lake will have numerous applications across many industries and types of hardware, from jet engines and locomotives to medical scanners.
Bartlett says a data lake can swallow massive streams of data and store it in whatever form it arrives, much like a large body of water drinks from its tributaries.
This is different from a standard data warehouse, where data is classified and categorized at the point of entry. “Instead of slicing, dicing and classifying the data, we capture the metadata, which is data about the data,” Bartlett says. “Metadata provides a more robust and varied context at the time of analysis that’s been missing from conventional data storage.”
Bartlett says that a data lake allows companies to ask many more questions from a given data set than they used to. “A numeric sequence in a database is only as meaningful as the context that can be applied,” he says. “By itself, it is just a number that the data warehouse might translate to what you paid two years ago to overhaul a particular kind of jet engine. But a data lake can provide the metadata to drive numerous analytics associated with that event, including the reasons behind the overhaul and how to better avoid or predict such overhauls in the future.”
Bartlett, who studied biology and ecosystems before he jumped into computer science, uses a biological metaphor to describe the data lake concept. “A data lake is like a pond in the woods – a richly diverse ecosystem,” he says. “You have complex food webs composed of millions of organisms, from algae and plants all the way up to top predators. Other factors such as water depth, available oxygen, nutrient levels, temperature, salinity and flow create the context of an intricate, interconnected ecosystem. If you throw a line in the water you never know what you will catch. It is an exciting place to fish! The questions and analytical opportunity are almost limitless.“
“On the other hand,” he says, “a more traditional database is more like a fish farm where all the species have been pre-classified and fed the same diet and health supplements. Some intensive tanks even employ biosecurity measures – a far contrast from the rich open natural ecosystem. If you throw a line in the water here, you have a pretty good idea of what you will catch! While useful, it has more limitations as to what it can teach us.”
Some 25 airlines are already streaming data into GE’s and Pivotal’s data lake system to better manage and maintain their fleets. The robust system is allowing service crews to better analyze performance anomalies. When a jet engine reports a temperature that’s higher than usual, for example, the system seeks insights and looks for similar events in the past, based on the type of engine, its age, service history and many other factors. “The magic happens when you marry the traditional engineering approach with the data science enabled by the data lake,” Bartlett says. “It opens up a whole new world of possible ‘what if’ questions.”
The industrial data lake works with GE’s Predix industrial software platform and massively parallel processing architecture systems like the open-source Apache Hadoop. Bartlett says the combination will have numerous applications across many industries and types of hardware, from jet engines and locomotives to medical scanners.
“When you dive into the data lake, you start seeing questions you didn’t even know how to ask,” Bartlett says. “It gives a transformational ability to your business model.”