Learning Algorithms in the Data Center

vom 27.03.2019

By Henrik Hasenkamp

Applications such as predictive maintenance are based on Big Data in conjunction with learning systems. Here you can read how such a system can learn and what possible uses there are.

Current developments in industry 4.0 and Internet of Things (IoT) use data to optimize production processes. Of particular interest, for example, are the possibilities offered by data acquisition and analysis for the implementation of predictive maintenance.

Predictive maintenance for cloud infrastructures

How can this concept be applied to IT infrastructures in local data centers or cloud environments? Especially the latter want to offer flexibility, agility and high availability. In principle, the idea behind this is simple: Every extraordinary event - in the maintenance tenor this refers to device failures, overloads or external influences such as hacker attacks - generates characteristic data. For example, a Ransomware attack is preceded by unusual activities in the network. If different telemetry data of the hardware and the environment are put into the right context, such events are predictable. If critical events are detected before they occur and appropriate measures are taken, the impact on operations can be minimized.

Normal or not?

The data is available: Most hardware devices, for example, already have sensors with which numerous status and function data can be recorded. Such telemetry data include the temperature of the device and its environment, latency times, number of write and read accesses, log files and the like. Their acquisition is the minor problem. Rather, the interpretation of the data is the challenge. Just because the I/O rate increases significantly for a short time does not mean that it has to be a hacker attack. Maybe a regular application test causes this additional load quite rightly. And just because the temperature of the devices rises, their failure is not necessarily imminent. It is possible that only the air conditioning system in the server room is not working properly.
This means that the system must first learn what is "normal" in terms of operation and what is not. Because defining these anomalies in advance makes little sense in practice - the possibilities and dependencies are too diverse.

In order for the algorithm to learn, features must be set. These are the attributes that in some way influence the operation of the infrastructure and on which attention should be paid. In reality, this leads to a complexity that is difficult to oversee. In normal IT operations, there are usually specialists for individual software systems or IT components. What is needed, however, is a definition that describes the normal operation of the entire IT landscape, which is influenced by all integrated systems - from mail tools to production control applications.

For example, the system captures available metrics such as network utilization and latency. Because the ERP system only transfers data to the production system at certain times, the volume of data to be transferred is rather low throughout the day and suddenly rises in the late evening. In this case, this increase is normal behavior, which is marked as a positive event for the system. Ideally, a value corridor is defined that must not be exceeded. This means that the increase in traffic is normal, but must not lead to overload.
The system now stores not only the value of the data transfer quantity as an event, but also all other metrics measured at this moment. The algorithm learns which data in which context triggers something that is important for the operator. The more features are set and the more events form the basis for data interpretation, the more accurate are the predictions of the algorithm.

The original article in german can be found here.

    Back to overview