Join the Insider! Subscribe today to receive our weekly insights
Is it a Data Lake that I need or a Data Warehouse?
Why not a Data Ocean!
Or a Data Platform?
And do we need the “OT version” of those, or the “IT” one?
In our relentless efforts to bridge the IT and OT world, it is now time to find some common ground in storing data at scale. Making data accessible is a crucial topic. This is the foundation from which all your data projects will take off. If your data is difficult to access, your digital ambitions will face immediate obstacles. Welcome to Part 1 in our Data in IT and OT series.
The IT Perspective on Data Lakes and Data Warehouses
The concepts of Data Lake and Data Warehouse originated from the IT world. Let’s first define them:
Data Lake
- A repository for storing unstructured data from different sources and formats.
- Stores raw data.
- Supports advanced analytics use cases due to the availability of raw, unprocessed data.
Data Warehouse
- A repository for storing structured data from different sources.
- Stores processed data (e.g., aggregations).
- Clearly defined data schema per use case, though managing these schemas is challenging.
- Ideal for querying with Business Intelligence tools.
- Holds processed data, which might lose some fidelity, making it less suitable for advanced data science projects.
Let’s consider an example using Sweet Harmony Treats, a cookie factory:
In Sweet Harmony Treats’ data warehouse, there’s a structured table with columns like “Batch ID”, “Date Produced”, “Cookie Type”, “Ingredients Used”, “Quantity Produced”, and “Production Line Operator.” This structured data makes it easy to query specific information, like the number of chocolate chip cookies produced last month.
In the data lake, there’s a diverse collection of raw data from sources like customer feedback, sensor readings from ovens, images of cookie types, and audio recordings of customer service calls. This unstructured data can provide insights not easily captured in a structured data warehouse.
The OT Perspective on Data
Your control system captures and stores sensor (time-series) data from industrial processes and equipment. This includes measurements like temperature, pressure, flow, and voltage. These data points are timestamped and stored, creating a historical record of process evolution. Some values might be stored every second, others at random intervals, making it impractical to store these values in a traditional relational database at scale.
Time series data is usually stored in a Process Historian on Level 2 or 3 of the Purdue Model, designed for efficient storage and access.
Returning to Sweet Harmony Treats:
Consider the conveyor belt transporting raw dough through the oven. This process requires precise adjustments in belt speed, oven temperature, airflow, and cooling rates. The historian might store data such as:
- Temperature sensors in heating and cooling sections
- Power consumption of motors
- Gas or electricity consumption of heating elements
- Humidity levels within the oven
Comparing a Historian with Data Lakes and Warehouses:
- The historian stores raw, unstructured data.
- This sounds like a Data Lake but is limited as historians typically only include sensor data.
Towards an OT Data Platform
At this stage, our time series data remains raw and unstructured. Production context might be available to some extent in a Manufacturing Operations Management (MOM/MES) system but is difficult to combine with time series data. To encourage widespread data usage, often referred to as “Citizen Data Science,” we realize that only a few understand the technical identifiers of sensors. Typically, Asset Context is missing, and data usability is often assumed but not always available.