When the cloud decides for itself

vom 23.11.2018

AI for the Cloud

 | Author / Editor: Henrik Hasenkamp / Sebastian Human

For many companies, the cloud is already an essential part of their daily work. The question therefore arises: How can the quality of service in the cloud be improved? The collection and evaluation of infrastructure telemetry data and self-learning algorithms could be an answer.

Digital transformation brings with it one thing above all else: lots of data. The current developments in the areas of Industry 4.0 and Internet of Things are already demonstrating in exemplary fashion what is possible with the help of data. Sensors on machines and equipment record temperatures, operating times, wear and tear and other condition data and send them to a data analysis system. Systematically correlated, entire production environments can be controlled and optimized. And that's not all: If algorithms can be developed that make well-founded decisions based on previously learned correlations, countless application scenarios are conceivable.

Predictive Maintenance for Cloud Infrastructures

How can this concept be applied to IT infrastructures in local data centers or even cloud environments? Especially the last mentioned want to offer flexibility, agility and high availability. A predictive maintenance approach could automate and simplify the management process. The idea behind it is basically simple: Every extraordinary event - in the maintenance tenor this refers to device failures, overloads or external influences such as hacker attacks - generates characteristic data. For example, a Ransomware attack is preceded by unusual activities in the network. If different telemetry data of the hardware and the environment are put into the right context, such events are predictable. The cloud provider can then offer a significantly higher quality of service thanks to a data-based, intelligent system: If critical events are detected before they occur and appropriate measures taken, the impact on operations is minimal or even non-existent.

First the system has to learn a lot

The data is there: Most hardware devices already have sensors with which numerous status and function data can be recorded. Such telemetry data include the temperatures of the device and its environment, latency times, number of write and read accesses, log files and similar. Their acquisition is the minor problem. Rather, the interpretation of the data is the challenge: just because the I/O rate increases significantly for a short time, this does not have to be a hacker attack. Perhaps a regular application test is rightly causing this additional load. And just because the temperature of the devices rises, their failure is not necessarily imminent. It is possible that only the air conditioning system in the server room is not working properly.

This means that the system must first learn what is "normal" in the sense of the operation and what is not. Because defining these anomalies in advance makes little sense in practice - the possibilities and dependencies are too diverse. In order for the algorithm to learn, features must be set. These are the attributes that in some way influence the operation of the infrastructure and on which the focus should be placed. If a problem occurs during operation, this moment is marked as an important event. The algorithm learns which data in which context triggers something that is important for the operator. The more features are set and the more events form the basis for data interpretation, the more accurate the predictions of the algorithm are.

What an AI algorithm accomplishes in practice

The goal is for the algorithm, which has been learned as extensively as possible, to make intelligent predictions and thus optimize infrastructure operation. For example, it should predict the failure of a hard disk in good time, identify a hacker attack before major damage can be done, or scale additional resources in time. In order to cover this and increase service quality, the cloud provider needs a multi-stage system that not only warns in a specific emergency, but also provides data-based forecasts.

The top level of such a hierarchical model covers extreme situations, such as in conventional monitoring: Does a value stand out from the collected data in such a way that immediate intervention is required? If, for example, the data stream to or from a database has terminated, there are sufficient reasons to assume that there is a problem. Countermeasures should be initiated immediately and as automated as possible.

The core of an intelligent system is hierarchy level 2. On the basis of the previously defined feature and value ranges as well as the learned relationships between the data, a system is developed that works with foresight. Devices are maintained or exchanged shortly before they break in a time window which is perfectly scheduled for operation. In this approach of predictive maintenance there is a lot of potential for optimization, precisely because the dependencies among each other and the mutual influences are taken into account.

In practical use, a third stage also begins to develop. Based on the optimized operation of the cloud infrastructure, cloud providers are now able to set up proactive services. For example, additional resources could be scaled automatically exactly when they are needed and not just when a bottleneck has already arisen. The algorithm can then calculate which relocation is suitable for which workload - taking risks and effort into account. Or the service provider can advise on the infrastructure dimensions if the telemetry data shows that the database and memory are permanently working at their performance limits.

AI can support cloud services

Artificial intelligence may still be in its infancy. But even now, the analysis and interpretation of data open up new possibilities that, when reviewed with a learning algorithm, go far beyond conventional monitoring. Predictive maintenance, for example, is already gaining in importance in industry because it has been proven to save costs and minimize downtimes. Cloud computing thrives on the promise of providing flexible and cost-transparent resources and maintenance. Self-learning algorithms refine and optimize this concept of cloud computing.

The original german article can be found here.

    Back to overview