A masterstroke in the data center

vom 08.08.2018

High availability in the data center is above all a question of money and resources. Specialized cloud service providers make a more flexible implementation possible, but it often remains a feat of strength. However, intelligent algorithms that optimize data center operation and maintenance can increase efficiency.

In the discussion about high availability, the question of more or less (de-)centralization always plays a role.

In the data centers of medium-sized companies, IT operations are usually maintained quickly but reactively. Support contracts with defined response times of a few hours and certain components in the warehouse should ensure that the impact on IT infrastructure and users is as low as possible in the event of damage. Even in the data centers of large IT infrastructure providers it is handled in this way, even if more powerful technology and dual design ensure higher overall availability. Nevertheless, the damage repair remains reactive: If a component breaks down, it is replaced. If there was a ransomware attack, security processes are in place to prevent further damage. When trying to increase the availability of IT resources in this way, a simple rule of thumb applies: The shorter the time to repair or the more highly available the IT is designed, the more expensive it is - High Availability costs money.

Detect anomalies

There are reasons enough to make the IT infrastructure as highly available as possible: If a failed in-house service only annoys users, a production system downtime or an unavailable e-commerce website can cause considerable financial damage and a permanent loss of image. But how can high availability be implemented more economically and effectively?

What is already widely discussed in the context of Industry 4.0 and Internet of Things (IoT) - namely the optimization of operations based on data - can also be applied to IT infrastructure. The cloud infrastructure provider gridscale, for example, uses self-learning algorithms: Data acquired by sensors, such as ambient temperatures, voltage levels or latency times, are interpreted in real time. Faults or bottlenecks can thus be detected preventively.

Most disorders are preceded by discernible anomalies. An application whose loading time is suddenly perceived by the user as very slow usually shows deviations from normal behavior beforehand. A ransomware attack is always associated with a measurably unusually high read/write rate of data. An employee who copies internal data shortly before termination or an external attacker who attempts to gain unauthorized access to the company's IT systems can also be identified as a deviation from normal operations.

Collecting the right data is one task. The greater challenge, however, is to establish links between different data sources and interpret them correctly using algorithms. Because what exactly characterizes a deviation from normal operation strongly depends on the individual nature and the concrete load on the IT infrastructure. The algorithms must learn over a longer period of time which values are OK at what time and when they indicate a problem. Moreover, such predictive maintenance can only be implemented within a reasonable cost framework if most of the process can be automated. This requires special know-how, not only in the areas of network and communication protocols and IT systems, but also in artificial intelligence (AI) algorithms.

Automated notes

Although various hardware manufacturers also offer their devices for their own IT operations with smart functions. For example, it is common for SSD disks to automatically warn when their sectors have reached the limit of read and write access. But in cloud environments, data-based high availability has its advantages: For example, operators can establish dynamic capacity management. This automatically shuts down infrastructure resources when no load is expected, so that they can be resumed on time when a dynamic workload is pending. This reduces the costs charged by usage in most cloud environments.

Such algorithms also support the IT administrator. An example: Various metrics such as CPU, RAM and network utilization of the virtual database servers are sufficient to calculate the current state of the database. Is it able to handle requests quickly or is it approaching its performance limit? Automated notifications to the IT administrator or even an autonomous intervention of an algorithm in the infrastructure can increase the capacity of the database in good time.

Decentralized and flexible

In the discussion about high availability, the question of more or less (de-)centralization always plays a role. While central, monolithic systems are often seen as less expensive and easier to maintain, they oppose the concept of high availability. Decentralized systems are far more flexible - especially in times of cloud computing, where people are already accustomed to agile, demand-driven IT resources. With intelligent algorithms, which are based on extensive automation of IT operations, decentralized IT systems can be managed with foresight and, despite decentralized architecture, monolithic systems can be operated with high availability without adaptation. In reality, this results in a high cost saving potential and an increase in the quality of IT services.

Image source: Thinkstock/iStock

The original article in german can be found here.

    Back to overview