Digital Twin for Energy-Aware High-Performance Computing

Alexander Kammeyer | Sep 30, 2025 min read

Abstract

High-Performance Computing (HPC) has traditionally focused on maximizing performance. Over time, HPC clusters have become larger and require more energy. Widespread adoption of machine learning and the integration of accelerators into HPC clusters has speed-up this trend. Simultaneously, the energy market is transforming towards renewable but more volatile energies. Regulations are also pushing data centres towards more conscious energy usage. As a result, the HPC community is increasingly prioritizing energy-aware scheduling and system management alongside performance.

Digital Twins offer considerable advantages over conventional simulation methods and enable energy-aware operation of HPC clusters. This thesis presents a novel Digital Twin for HPC clusters which integrates data sources from the data centre and the outside world to create a representation of an HPC system in the virtual domain. Using the data sources, the Digital Twin can predict and optimize the system behaviour through scheduling simulation. The Digital Twin contributes towards energy-aware HPC in numerous ways: it calculates accurate Power Usage Effectiveness, allows operation with cluster-wide power capping, reduces energy costs and emissions as well as allows the test of new hardware through simulations.

The HPC Digital Twin can be used to calculate the Power Usage Effectiveness through high resolution measurement data. A scheduling algorithm is introduced that handles power shortages through power-capping. Through a Slurm plugin, the Digital Twin can enforce the power cap and reduce job wait times by up to 40 %. A second scheduling algorithm for price- and emission-aware job scheduling leverages Digital Twin data and delays job starts within a set window to optimize energy price, carbon emissions and renewable energy use. Evaluated over various simulation timespans and load scenarios, the algorithm achieved 4-34 % price savings, up to 7 % emissions reduction and up to 20 % more renewable energy usage. Batteries as storages of renewable energy are becoming more common. This thesis demonstrates how a Digital Twin can evaluate new technology through simulations, show limitations of the hardware and analyse the trade-off.

The Digital Twin allows to test scheduling algorithms easily, find optimal parameters and help achieve energy-aware HPC operation. The integration with the scheduler Slurm demonstrates the capability of the Digital Twin to control HPC clusters effectively.

DOI: 10.17169/refubium-51620