How Amplify Renewables Train Weather Models with Intel Gaudi

Weather forecasting is challenging, yet solar and wind energy require accuracy. Amplify Renewables, an Intel Liftoff for AI Startups Catalyst Track participant, enhanced their energy forecasting to dispatch renewable power to the grid more accurately

Large-scale weather models will be trained on Intel Gaudi 2 HPUs with the Intel Liftoff team and Intel Tiber AI Cloud, Intel's public cloud option for AI startups and organisations, enhancing grid forecasting speed and accuracy

Machine learning is changing Intel's weather forecasts. Amplify Renewables took the next step by processing terabytes of public and private weather patterns to train their global weather model

Distributed data-parallel training across eight Gaudi 2 cards on a bare-metal system, high-volume data processing, and 90GB+ VRAM per Intel Gaudi 2 HPU were required. Its 1TB RAM and fast NVMe SSDs allowed it to manage large datasets utilising a filesystem-based storage solution

This mode was introduced in PyTorch 2.0 and combines the instant execution of Eager Mode with the optimisation capabilities of graph execution to wrap parts of the model into a graph for greater performance

Through the Habana SynapseAI SDK, Intel Gaudi accelerators easily interact with well-known AI frameworks, including as PyTorch and TensorFlow

Amplify Renewables discovered several significant benefits:

Running PyTorch models on Habana HPUs required only minor adjustments.

Lazy and eager modes operated without a hitch during training, but all three modes performed well during inference.

Near-linear scalability for distributed training and linear performance for inference.

This extra compilation time may seem like a downside, but the faster performance is often worth it. When scaling to 8 HPUs, Lazy mode performs better than Eager. Scaling is linear and more effective with little sublinearity

Lazy execution optimises performance before bulk execution and compiles activities into a computational graph to reduce execution durations. Meanwhile, eager mode executes tasks instantaneously, adding overhead to each process

Amplify Renewables can now compare their projections against public and private forecasts. Increased solar and wind power output estimates improve grid projections, which renewable energy needs

Testing a larger variety of models and attempting novel pre-training techniques are the next steps