This is the second in a series of guest blog posts from IDC on analytics and intelligent industrial applications. The previous post, Analytics and the Industrial Internet: Making the Value Case, described the value case. The current post deals with the data challenges.
Analytic applications have changed the way businesses operate, and it’s only the beginning. I started the analytics practice at IDC 18 years ago around the theme of "analytic applications." These are applications driven by line of business champions in order to gain visibility into the performance of key processes and products. The first of these applications were in finance, then marketing, then operations followed – with the goal of improving decisions that were of significant value to the business.
But no matter which business process or function was involved, one guideline nearly always proved true: 80% of the effort on an analytics project involves getting the data ready. Without this preparation, there can be no analysis. It's like a home painting job – most of the effort is on preparing the surfaces before you can apply the paint that will beautify the home's appearance. You don't see the preparation when the paint job is done well – but when preparation is not done well, you surely see the lack of preparation reflected in the poor results.
This 80% rule has been proven over and over again in analytics projects.
Operation analytics take it a step further adding complexity to data preparation that must be addressed in order to gain real insights that improve performance.
Data preparation for analytics in the traditional, structured data environments
To understand the added complexity, it is helpful to first take a look at traditional analytics projects, which leverage a data warehouse for structured data. The strategy is to move needed data from multiple sources to a single point of integration, the data warehouse. This involves the following steps:
- Data capture: Data required for analytics (to support line of business managers) has already been captured by transactional systems (i.e., systems of record such as billing, payroll, procurement) within the enterprise, run in the enterprise data center. Most of the data is managed in relational databases, the standard for the past 30 years, though older systems still in use may be relying on file systems or pre-relational databases.
- Data movement: The structured data from multiple source systems is offloaded into a dedicated analytical data store (data warehouse). First, selected data of interest is extracted from these source systems. Secondly, IT transforms the source data via custom programs or purpose-built ETL (extract/transform/load) tools to map each data set to the logical data model (schema) of the warehouse. The transformed data is then loaded into the data warehouse on a periodic, batch basis.
- Data aggregation: To provide good performance for analysis, data may be pre-aggregated into cubes for analytical processing (OLAP) following a dimensional model (e.g., revenue by account, region, time period). But the cost is latency – i.e., the data may not be up to date. Such data is only as current as the last time the data was moved to the warehouse and the last time it was aggregated into a cube.
Data preparation for analytics in the Industrial Internet environment
In the Industrial Internet environment, this data preparation process is significantly altered because the data and its location are different in important ways. There is a wide variety of data captured in different settings both within and outside of the enterprise. Contrast the steps in data preparation with the traditional data warehousing environment for structured data:
- Data capture: The data of interest is not all structured data, but rather a mix of structured, semi-structured, and unstructured data. The data comes from traditional IT systems but also from sensors on industrial machines, data historians (that organize time-series industrial data), and other forms of connected devices.
- Data movement: Industrial data must be collected locally as these machines can be in remote locations where network availability and bandwidth can be challenging. Because of the networking limitations and the sheer size of these databases, it is not always possible to move the data to a centralized analytical environment. And because we may not know the data's structure until we examine the data, managing such data is not within the sweet spot of conventional relational databases. Rather loading the data into a more free form database environment such as Hadoop or NoSQL (with access methods such as Hive or Spark) may be a better choice. Of course, the distributed nature of the data brings on new security challenges that were not present in the centralized on-premise IT data center.
- Data aggregation: Industrial operation planners may value "cubed" data organized around the dimensions of the business or process. But the latency inherent in such batch-oriented aggregation processes will not work for front-line, real-time operations that require the monitoring of data coming in at high velocity. With very tight decision windows, there isn't time to pre-aggregate the data. In-memory capabilities can enable virtual cubes, combining the raw data on the fly in support of dimensional analysis.
In the Industrial Internet environment, therefore, has very different data challenges than the traditional, structured IT systems environment. The variety and size of the data sets, the highly distributed location of the data, and the need to support a variety of access paradigms including real time, demands new strategies for managing and preparing data for analysis. An operational technology platform is needed to deal with this complexity.
From a data preparation perspective:
A platform for Industrial Internet applications must provide a unified approach to manage machine data and its integration with traditional data sources such as enterprise asset management applications and demand planning systems.
Figure 1 shows the different types of data sets and data management environments that support planners, front-line operators, as well as data scientists and business analysts that characterize the operations technology use cases.
Data access for industry
The point of doing the preparation work is to make the data available for access and analysis. End users and application developers are typically not database administrators and should not have to learn myriad interfaces to get at the data they need for intelligent industrial applications. And it's not practical (given complexity, network issues, and real-time requirements) to move all of the data needed into a SQL-accessible data warehouse as noted above. The goal is to leverage industrial data science and to provide a consistent, engaging user experience, providing visibility to operations and helping to focus attention on the most impactful factors:
- Any device: Consistent, synchronized information should be available and accessible from any device, including all flavors of mobile and with access to any machine relevant for the operations process.
- Relevant in context: Information overload is a real concern. User interfaces must help the user focus on the information that is most relevant to their role and the decisions they are responsible to make, such as early indications of impending asset failure. In the future, wearable computers using technologies such as smart glass can help to support this goal in the real-time environment.
- Descriptive or prescriptive: The interface should enable an individual to monitor the performance of a machine (descriptive) and advise on the next best action (prescriptive) to correct or adjust a machine or machine-intensive process.
- Secure: Operational technology applications must be developed, deployed, customized, and extended in a secure environment either in the cloud or on premise for maximum flexibility. Security at a user access, resource, and data level should be supported.
From a data access perspective, a platform for Industrial Internet applications must provide an abstraction layer to mask the complexity of the heterogeneous data sets to simplify secure access to the right information in time to make an intelligent industrial operations decision that will make all the difference.
Check out the Industrial Internet infographic from IDC on optimizing operations with big data and analytics.