The elephant now does more than just never forget, it can tell you stories in any style you please.

Hadoop V2 is a major upgrade to Hadoop, the widely-used big data distributed storage and processing framework. Hadoop's main batch data-processing component, MapReduce, has been a major catalyst for the rise of big data analytics. But times have changed since Hadoop first debuted in 2005, and enterprise-level big data plus the need to analyze data across real-time, interactive, and batch mode have created demands that go beyond what Hadoop and MapReduce were originally equipped to handle. The ability to analyze data across these multiple modes is critical for the Industrial Internet and the business outcomes that need to be delivered based on advanced scientific analysis with historical and real-time industrial-scale data.

With the release of Hadoop V2 last October, all that's changed. Hadoop V2 brings major architectural changes to its framework that provide foundational elements to enable Industrial Internet applications.

Here are a couple of Hadoop V2's most important features:


Hadoop's YARN (Yet Another Resource Negotiator) actually untangles a lot of what made the original Hadoop Distributed File System (HDFS) + MapReduce combination relatively inflexible. YARN does this by decoupling the system (resource management, security, etc.) from application frameworks like MapReduce. Why is this huge? Because it means applications can run natively in Hadoop, and that allows for a wide range of applications to access data directly without having to go through MapReduce's batch-processing first.


In terms of the Industrial Internet, this means developers can create solutions for real-time analytics, graph processing, in-memory processing, and interactive database queries Whatever kind of data analytics a business requires, it'll be able to architect solutions using Hadoop without the bottlenecks of before.


A Better, Faster, More Reliable, and Secured HDFS

The way Hadoop stores data in clusters over distributed computing systems has been upgraded greatly in Hadoop V2. In a similar way to how YARN enhanced by decoupling, HDFS Federation does the same for storage. By separating data namespace (name nodes, directories, files, and blocks) and storage (data nodes, clusters, and physical storage), HDFS Federation allows for multiple independent name nodes and isolation, meaning that a single application that might overload a name node won't impede other applications from accessing data. This results in HDFS offering high availability to data through the use of multiple redundant name nodes. Further enhancing reliability are HDFS Snapshots, which create read-only point-in-time copies of the entire file system. In addition, the capabilities of Kerberos as the authentication mechanism and Project Rhino’s open source effort to establish a comprehensive security framework are providing a foundation for a secured Hadoop ecosystem.

The biggest implication of enhanced HDFS for the Industrial Internet is that Hadoop is now capable of providing building blocks for a robust data management solution delivering the reliability and performance needed for mission-critical workloads for Industrial Internet applications. So applications can be built out with both the flexibility and robustness demanded by industrial projects.

The Industrial Internet relies on a backbone of information connecting more intelligent machines, advanced analytics, and people at work. As operational-technology and information-technology converge, and data-driven work flourishes, robust software platforms backed by high-performance, secure, scalable technologies like Hadoop will be critical to efficiency, profitability, and success.

About the author

GE Digital

Driving Digital Transformation

GE Digital connects streams of machine data to powerful analytics and people, providing industrial companies with valuable insights to manage assets and operations more efficiently. World-class talent and software capabilities help drive digital industrial transformation for big gains in productivity, availability and longevity. We do this by leveraging Predix, our cloud-based operating system, purpose built for the unique needs of industry.