iCooling@AI: Smart cooling for data centers
Deep neural networks can reverse spiraling energy use in data centers & cut PUE.
PUE is a KPI that measures the energy efficiency of data centers. Cooling – a key component of a data center – is closely related to equipment heat dissipation, equipment configuration, facility environment, and external climate conditions. Thus hardware-based energy savings or optimizations based on human expertise alone cannot reduce power consumption any further.
Based on its extensive experience in data center construction, Huawei launched the iCooling@AI solution powered by big data and AI. The solution further reduces the energy consumption of data centers while enabling smart cooling of large data centers and cutting PUE.
The chilled water cooling system of a data center saves energy in two ways: design and O&M.
Energy-saving through design comes from designing the right cooling systems and selecting the right equipment, which focuses on using hardware to save energy. However, energy-efficient hardware does not necessarily result in the most energy savings because energy efficiency is closely related to the O&M of a data center.
Traditional O&M depends on an experienced O&M team. Based on their experience, the team determines how to adjust the parameters of a cooling system for different seasons, ambient temperatures, and load rates to maximize the energy efficiency of the cooling system. However, relying on experience that varies between team members doesn't always result in accuracy.
For a complex chilled water cooling system, a new control algorithm is needed to achieve overall optimal performance. That’s where big data and AI come in. AI can be used to determine the relationships between the PUE and the data of different features and then predict a PUE value. With the PUE value, the data center can make optimizations as expected based on the current climate and load conditions to achieve the energy-saving target.
Powered by AI and big data technologies, Huawei's iCooling@AI solution enables smart cooling systems for data centers. The key technologies used in this solution include:
Big data collection: Given the complexity of data center cooling systems, information about the power supply system, cooling system, and environment parameters must be collected.
Data governance and feature engineering: First, a mathematical tool is used to perform data governance on the raw data collected, providing high-quality data for subsequent model training. Second, feature engineering is performed on large amounts of raw data to identify the key parameters that affect PUE. Selecting too many or too few parameters will affect the accuracy of the final model. Too many parameters will lead to overfitting. The trained model will have a better fit from the trained data than from the tested data, but it has poor generalizability. If too few parameters are found, underfitting occurs. The trained model performs poorly with both the trained dataset and the tested dataset.
Creating a PUE model using a neural network: Neural networks are a set of machine learning algorithms that can simulate the cognitive behavior of interactions between neurons. Deep neural networks can play a role in increasing the cooling efficiency of data centers. The machine learning algorithms of these networks can find the relationships between parameters of different pieces of equipment and systems. A mathematical model or the PUE model of the data center is created based on large amounts of data from sensors.
Inference and decision-making using genetic algorithms: Based on the input PUE model and the operating data collected in real time, the algorithms find the best policy in four steps: parameter traversal and combination, service rule assurance, calculating the energy consumption of the cooling system, and selecting the optimal policy.
The use of big data and AI, as well as the combination of software and hardware, has allowed Huawei to set a new benchmark for green data centers.
Software includes the teamwork control system and the data center infrastructure management (DCIM) system. The teamwork control system of a data center mostly uses the programmable logic controller (PLC) or direct digital control (DDC) and has active and standby servers. The system has a regular control mode and an energy-saving control mode.
Regular control mode: The teamwork control system automatically executes all control logic, including adding or removing equipment, adjusting the rotational speed, switching the cooling mode, bypassing, and charging/discharging chilled water. The DCIM system monitors status information.
Energy-saving control mode: The teamwork control system is subject to the control of energy-saving algorithms. It executes the instructions issued by the algorithms, including adjusting the amount of operating equipment; adjusting target values of control loops like rotational speed, power, temperature, and pressure difference; and switching cooling mode. When no control instructions are issued, the teamwork control system controls the operations.
As the centralized management system of a data center, the DCIM system manages all the links within the cooling system. The energy-saving optimization instructions are generated by the AI algorithm and then sent to the teamwork control system, which then conducts final execution.
Hardware includes different sensors such as smart meters, pressure/differential pressure sensors, water temperature sensors, flow sensors, and outdoor dry/wet bulb thermometers.
To ensure the best optimization, variable-frequency components should be used for chillers, water pumps, indoor units of air conditioners, and cooling towers. The entire cooling system can be automatically controlled.
Data is collected every five minutes to maximize quality. The number of collection points depends on the size of the data center. The first time data is collected and at least three months of operating data need to be recorded. After that, data is uploaded once daily. Data can be uploaded in two ways: The refrigeration station data is uploaded to the DCIM system through the Building Management System (BMS); or the IT load data is uploaded to the DCIM system through the cabinet information collection system.
Efficient data governance includes identifying and deleting abnormal data based on Gaussian distribution; unifying the timelines of all parameters; normalizing geographical locations; deleting data irrelevant to PUE (such as alarms and maintenance information); and supplementing missing data based on the data center O&M experience and the operating parameters of equipment like chillers.
To complete subsequent model training, mathematical tools such as the chi-square test can identify key parameters that affect PUE. Common parameters of a data center include five types of control parameters (for example, the amount of equipment, temperature of water supplied by chillers, temperature difference between supplied and returned chilled water, approach of the cooling tower, and temperature difference between supplied and returned cooling water); 14 types of process parameters (such as water flow, pressure difference, and equipment power consumption); and two types of environment parameters (outdoor temperature/humidity and IT load rate).
The biggest challenge with data-center O&M is determining which parameters in the system to change and finding the perfect combination after one parameter has been adjusted. There is no formula or algorithm to reference in the current O&M practices.
To address this issue, copious amounts of historical data are used to train an AI neural network. AI uses machine learning algorithms to analyze the relationships between the PUE and the data generated by data center components. These discover the impact of different pieces of equipment and system parameters on the overall system. Dynamic model training, inference, and decision-making are the key to this process.
A neural network has an input layer, an output layer, and multiple hidden layers. An input eigenvector reaches an output layer after it is transformed at hidden layers, and classification results are generated at the output layer. AI-powered PUE optimization uses deep neural networks, which includes five hidden layers.
All data that has undergone governance and feature engineering are randomly divided into three parts. Ten percent of the data is used for preliminary training, 80 percent for in-depth training, and 10 percent for final verification. A data center's PUE model is generated after training and verification.
Finally, the prediction model (PUE model) is sent to the inference platform. With the powerful inference and computing capabilities of the inference platform, possible cooling policies are traversed and simulated by using genetic algorithms. Within one minute, the AI energy-saving algorithm can identify the optimal parameter combination under the current outdoor conditions and IT load from 1.4 million combinations, perform multi-layer filtering based on the O&M requirements of the data center, work out an optimal set of instructions, issue them, and provide feedback on what happened.
iCooling@AI technology has been commercially deployed to provide smart cooling for multiple large data centers. Field tests show that the PUE of these data centers can be improved by 8 to 15 percent. As the iCooling@AI solution and AI technologies are widely used in data center operations and management, concepts such as intelligent O&M and unattended operations are no longer just buzzwords – they are becoming reality.