| Author: Henrik Hasenkamp
IT operations are usually ensured with reactive maintenance. A learning system that analyzes operational data could predict and avoid many maintenance incidents.
In industry, predictive maintenance has already grown into a new paradigm. Predictive maintenance has what it takes to reduce maintenance costs and optimize maintenance cycles. It is based on the collection and analysis of operating data. For example, a machine with mechanical components generates vibrations, noises, resistances, energy load profiles and many other data that allow individual conclusions to be drawn about the condition of the machine. This makes it easier to determine the optimum moment for maintenance - ideally just before a component breaks down.
The data analysis/prediction approach has hardly been used in the area of IT data center maintenance so far. A data-based and predictive operation is many times more complex and often more expensive than a purely reactive operation. Data must be collected, stored, calculated and analyzed preventively. IT specialists who can do this are scarce and expensive. Nevertheless, initial approaches with self-learning algorithms are very promising and worthwhile in the long term.
Many incidents indicate themselves in advance
The data provided by the infrastructure components of a data center during operation can be used to read a lot: The moment an online shop is no longer accessible, for example, the operator suffers the damage. However, it is very likely that there were clear technical signs of this even before this website breakdown. If they had been detected in time, the downtime would have been avoidable.
Most incidents are preceded by noticeable anomalies in data center operations. Before the website crash, an unusual traffic peak was possibly already visible, the CPU was working at its performance limit, and access to the database was increasing sharply. Likewise, hackers leave traces even before the attack, such as a high number of login attempts or other unusual activities in the network.
Getting such data is relatively easy. Most hardware devices of a typical IT infrastructure bring the necessary sensors with them. This allows numerous status and functional data to be recorded, such as the temperature of the devices, latency times, the number of write and read accesses, log files and the like. The much more difficult question is how the data can be put into the right context. Behind an increased access rate could be both a hacker and a run on seasonal goods due to an advertising circuit.
A system becomes intelligent when it learns
The system must first learn what to understand as an anomaly in a neutral sense. To impose a construct from defined situations on the algorithm is of little use. Because it is hardly possible to limit which change of a value has which meaning. Several measurement data always play a role, which are interdependent.
In concrete terms, this means that the algorithm must be taught: For this purpose, a manageable number of features is defined, i.e. values with their possible characteristics, which are important for the operation (in the example mentioned for the operation of the website). The more features are observed, the more accurate the analysis becomes. At the same time, however, the system becomes all the more complex. During operation, all events for the algorithm that are special in any way are now marked: desired seasonal load peaks, for example, or unpleasant performance bottlenecks. Over time, the system can interpret situations and provide the basis for an intelligent warning system.
Predictive maintenance in data centers
The most important top level of such a warning system remains the immediate, reactive alarm if a value stands out so strongly that immediate action must be taken. If, for example, the data stream of a hard disk breaks off abruptly, it may be broken. The second stage is based on the artificial intelligence of the learning algorithm. On the basis of the defined features, their value developments and the learned correlations, the system can now work with foresight. If the recorded values indicate an unwanted anomaly under the defined conditions, the IT administrator is informed. The advantage: he can now intervene and avert the incident or plan the upcoming maintenance at a reasonable price.
In a third step, such an intelligent system can be developed into an infrastructure optimization system. For example, resources can be successively scaled. This makes real live scaling possible, even without user intervention. Automated infrastructure adaptations are also conceivable: If a device is permanently running under maximum load, it can be relieved by another connected device. And this before performance is compromised. The algorithm could decide for itself which is the most practicable, cost-effective or simply urgently necessary measure.
Who needs that?
The size of the potential damage determines how much effort is put into providing IT operations. The website of agencies or a medium-sized industrial company with an AI system to underpin is certainly oversized. But if, for example, damage can be prevented by a ransomware attack because the high read and write rate was noticeable, the AI effort can quickly pay off.
The original article in german can be found here.