16.05.2019I by Henrik Hasenkamp
How much is "highly available" and what does it cost? Two well-known questions to which new answers can now be found using artificial intelligence.
How high the availability of the IT infrastructure must be depends on the specific application. For example, some percentages only refer to the server or other individual services. In practice, however, it is a matter of linking services, i.e. the high availability of entire IT infrastructures, which in turn increases complexity. Every increase in availability, be it only in the decimal place range, has a significant financial impact - both on the company's own operations and on the commissioning of infrastructure service providers. The question quickly arises as to how much additional expenditure an increase in availability is worth.
This is not possible without a certain degree of automation in the data center. A typical example is the outsourcing of IT operations to cloud service providers, who relieve companies of routine administrative tasks. You simply buy physical security, patch management, georedundancy and round-the-clock support. This is often more economical than maintaining data center operations themselves - especially when availability, performance and security requirements are high.
In addition, data center management systems should provide more automation. What is theoretically well thought out often proves to be difficult to implement in practice. Physical, virtual and cloud infrastructures must be managed centrally with one system - not many tools offer this. And even if that works, routine processes from monitoring to change management still take up too much time.
AI in the data center
Artificial intelligence (AI) or machine learning (ML) can push the level of automation even further in the direction of a self-managing data center. The goal is to optimize the system beyond the pure management of the components and to have it act automatically. This often described predictive maintenance already goes in this direction, but has not yet become established in normal data center operation. Although components are replaced by default before they fail, this is usually calculated on the basis of operating hours worked and empirical values. Predictive maintenance on the basis of AI could determine the maintenance time even better and cost-effectively by the computer center operator including further data for the calculation. At the same time, technical inefficiencies such as capacity bottlenecks, high electricity costs or other performance parameters that have not been optimized enough can be detected and eliminated.
Because AI thinks in larger dimensions than a human ever could. With the help of massive amounts of raw data, the system not only learns obvious correlations, but also those that a person has not yet considered. This makes the analysis much more complex: The time for a hard disk change is no longer calculated solely on the basis of its operating time and the I/O processes performed. Much more data is added - from the company's own data center and from other data centers - and is correlated by the AI. The AI algorithm is deliberately not given any specifications as to what it should recognize, such as the technically appropriate maintenance window of a hard disk. Rather, the algorithm is instructed to ensure economically optimal operation of the infrastructure. This instruction can be refined with features such as the definition of exactly what is economically optimal. After a training phase in which the system works with data whose real effects have already taken place, it learns to deal with new data whose consequences have not yet occurred. From this moment on, the algorithm can work with foresight.
From anomaly detection to capacity management
The Cologne-based PaaS provider (Platform as a Service) gridscale, for example, uses intelligent algorithms to bring availability as close as possible to 100 percent. Numerous telemetry data can be used to identify unwanted events at an early stage; the data center reacts automatically and carries out predefined actions. These range from simply alerting an administrator, reloading a workload and adding additional resources, to transforming entire workloads to another server.
Such dynamic, largely automated capacity management continues the cloud's promise to provide resources flexibly on demand. Thanks to AI-based predictions, a data center operator can start these auto-scaling and real-time provisioning processes in such a way that resources are available at the right moment, rather than having to request them. Workloads can also be reallocated during operation. Maintenance work or even failures of entire nodes no longer affect availability.
Another important step towards profitability is to make it as easy as possible to connect additional resources. If, for example, specialist departments or medium-sized companies can procure high-quality IT resources without having in-depth specialist knowledge, this is very effective. Simple user interfaces are often at odds with the complexity of the infrastructure you are trying to set up.
Here, too, the AI algorithms described help to eliminate this dilemma: By taking both sides into account - on the one hand the user's previous experience and goals, and on the other hand the complexity and automation possibilities of the resulting data center - the system can calculate the optimal user interface for each case. For example, gridscale dynamically adapts its frontend to the user. The user only sees the functions he needs at that moment. The user interface is clearly oriented towards what the user wants, rather than what is possible. The user does not have to deal with the entire complexity of the data center. The goal is to get IT resources made available for a specific workload. How many VMs, how much RAM and which VPN connection are suitable for this can be largely irrelevant to the user. The service is based on predefined preferences regarding costs and availability.
KI makes more sustainable decisions
The idea behind an AI-optimized data center is to use more data for informed decisions. An algorithm that considers hundreds of factors and learns from past events makes better decisions. If these decisions are implemented automatically, they can also be readjusted practically at any time. This leads to high efficiency not only in the processes themselves, but also in related parameters such as costs for electricity or cloud resources.
The original article in german can be found here.