Building a data-driven Earth system model at ECMWF

News

Building a data-driven Earth system model at ECMWF

10 March 2026

In this blog, we showcase how machine learning (ML) models were developed for several Earth system components in the framework of the Destination Earth (DestinE) initiative of the European Commission. This marks a significant step towards a machine learning Earth system model, which explicitly represents multiple components.

Written by Rachel Furner, Sara Hahner, Ewan Pinnington, Nina Raoult, Mario Santa-Cruz, Maria Luisa Taccari, Kenza Tazi, Lorenzo Zampieri. This work was co-authored, with equal contributions by all contributors. Authors are listed in alphabetical order.

Predicting the weather does not only require accurate modelling of atmospheric processes. Ocean temperature, currents and waves shape storm tracks, soil moisture influences heatwaves and regulates the land surface-atmosphere exchanges, sea ice modulates weather in polar regions and beyond, and river water levels are affected by rainfall, to name a few examples. Capturing these interactions is essential for realistic Earth system prediction, both for conventional and machine learning models.

After developing the Artificial Intelligence Forecasting System (AIFS) and bringing it into operations to produce daily 15-days weather forecasts, ECMWF is expanding the scope of data-driven modelling beyond the atmosphere to include key Earth system components.

Specifically, we highlight technical developments and scientific challenges for the following Earth system components:

Sea ice, where we represent the thermodynamic evolution of this medium for an extensive set of variables (sea ice concentration, volume, albedo, and velocity) in response to atmospheric and oceanic forcings.

Waves, where we learn ocean surface wave dynamics, enabling accurate prediction of potentially hazardous marine conditions,

Oceans, where we model the evolution of the ocean throughout depth, including the response to atmospheric forcing. Global spatial fields for physical variables such as temperature, salinity, currents, and sea surface height anomaly are predicted.

Land, where we represent the land-surface response to atmospheric forcing, currently focusing on soil moisture and temperature, snow cover, skin temperature, and near-surface variables such as 2-meter temperature and dew point,

Hydrology, where we model streamflow using meteorological forcings and catchment properties.

Each of these data-driven model components represents major advances in the respective domain, building on decades of expertise in the physical modelling of the corresponding counterparts and laying the groundwork for a full Earth system model which captures these individual components and interactions between them.

Why more Earth system components matter

While the atmosphere is central to weather prediction, its dynamics are tightly coupled with processes in the ocean, land surface, sea ice, ocean waves and rivers. These interactions influence weather and climate across timescales ranging from hours to decades, and across spatial scales from small river catchments (2km²) to the world’s oceans (10⁷km²). As such, modelling these components not only increases the physical realism of weather predictions, but is also expected to improve their accuracy, particularly for medium- to long-range predictions due to the long memory of land and ocean. Expanding the complexity of Earth system modelling has been a strategic priority within the AIFS ecosystem. By capturing more of the interactions that shape the planet’s behaviour, we move closer to a fully coupled ML prediction model.

**Fig. 1:** Building a data-driven Earth system model: new components for the land, ocean, sea-ice, waves, and hydrology are being developed in DestinE to complement the already successful atmospheric component – the AIFS.

What are the challenges for Earth system components

Significant progress has been made in recent years in applying ML to atmospheric modelling for weather forecasting. Extending these advances to other Earth system components—such as the ocean or land surface—required overcoming several shared challenges. Observational coverage and data quality vary substantially across the different Earth System components, limiting the capacity of ML models to leverage observational, or observationally constrained, data. While ERA5 provides the backbone for atmospheric ML training, other components must rely on a mix of reanalyses, model output, and observations, which need to be carefully harmonised to ensure consistency. In addition, most ocean and land surface processes evolve more slowly and exhibit longer memory than the atmosphere. Training datasets spanning only a few decades might not be sufficient to fully capture their dynamics. Training strategies focused on learning a limited number of relatively short time steps might also not be suitable to learn these slow dynamics. Missing values are also common—for example, in time and space for hydrological observations, or points over land in the ocean, and wave fields—and therefore require explicit masking or gap filling. Furthermore, machine learning approaches and architectures that work well for the atmosphere may not be suitable for components with different spatial structures, temporal scales, or governing processes.

Beyond these shared challenges, each Earth system component has unique characteristics that require tailored approaches. Ocean surface waves are highly sensitive to local wind forcing, generating waves, as well as the propagation of swell generated by distant winds. Sea ice exhibits strong seasonal and interannual variability, particularly at ice edges. The 3D ocean operates on multiple temporal and spatial scales, with air-sea exchange impacting surface dynamics over short timescales, and deep ocean mixing influencing long-term climate. A particular challenge for land surface processes is the limited understanding of some of the physics, e.g. the soil-plant-atmosphere interactions, which leads to uncertainties in the physical model, which is used as the basis of training data. Finally, many land surface observations are sparse and heterogeneous requiring model-based inference, for instance when estimating catchment-scale streamflow from limited data availability in hydrology. These differences highlight why the development of ML models beyond the atmosphere is non-trivial and why each Earth system component must be approached with domain-specific considerations.

What have been the technical developments and innovations

Building machine-learnt components has required expanding the Digital Twin Engine with machine learning pipelines building on the open-source anemoi ecosystem, by ECMWF together with several National Meteorological Services across Europe. Anemoi provides software solutions for producing and handling AI-ready datasets, for end-to-end training, and for running weather and climate ML models.

Developing the Earth system components also required the creation of bespoke datasets for each component, whose sizes and sources are visualised in Fig. 2. For the land and waves components, these were built from synthetic model runs and specially produced hindcasts; for the hydrology component which focuses on streamflow, observational datasets were used; and for the ocean components (sea-ice, surface and 3D ocean), datasets have been derived from ORAS6, the state-of-the-art ECMWF ocean reanalysis product.

Addressing many of the challenges discussed above relied on a range of technical innovations in the Anemoi ecosystem. Originally developed for atmospheric ML modelling, the Anemoi ecosystem has since been adapted and extended to support the new Earth system components through a close collaboration between machine learning specialists, software engineers and physical modellers.

To support diverse data sources, anemoi-datasets was, for example, extended to support different data formats and to create additional variables where needed. This includes expressing cyclic wave direction in a continuous way using cosine and sine components, as well as handling not only time-varying fields but also static variables (such as ocean bathymetry) and climatological fields (such as leaf area index for land-surface processes).

While anemoi-datasets is naturally optimised for gridded data, a different approach was required for hydrology, where streamflow is often modelled using catchment boundaries that do not fit neatly onto a regular grid.

In this case, earthkit-hydro—a hydrology-focused toolkit within ECMWF’s earthkit ecosystem—was used to efficiently perform catchment-based spatial aggregation and preprocessing. This approach enables the integration of real-time observations and provides hydrologically meaningful inputs for machine learning models.

Fig. 2: Size and provenance of the datasets used for training different Earth system components

The development of land and marine components means we now encounter missing data over the ocean and land, respectively. This required robust handling of missing values in anemoi-core and anemoi-transforms, since most machine learning models cannot natively process them. During training, missing values in the input fields are masked to prevent numerical instability or biased learning. The model is then allowed to generate predictions over the full domain, but values in masked regions are not considered in the training loss calculation. If a field is not part of the model input, its mask can be inferred from related variables—for example, snow cover, which is only part of the model output, uses the missing-value mask of snow depth, as illustrated in Fig. 3.

**Fig. 3:** Illustration of how missing values are handled in *anemoi*. While the model is allowed to predict snow cover over the oceans (left), these regions are masked and filled with missing values in the output (right; blue indicates masked values).

Further innovations include the development of component-specific machine learning model architectures and exploration of attention mechanisms tailored to the characteristics of individual Earth system components (illustrated in Fig. 4). For example, land-surface processes are modelled using multilayer perceptrons (MLPs), reflecting their largely local and column-based nature, while hydrological processes along river networks are represented using long short-term memory (LSTM) based models to capture directed flow and temporal memory. For ocean components, the core encoder–processor–decoder architecture of the AIFS is retained, with alternative attention mechanisms used to better represent localised ocean dynamics.

A key development to enable skilful Earth system component predictions has been the careful scaling of variables and weighting of loss terms to account for differences in magnitude and characteristic timescales. This helps to ensure that both fast processes (e.g. near-surface fluxes) and slow processes (e.g. soil moisture evolution and ocean heat uptake) are learned effectively. Component-specific normalisation is applied to reflect differences in temporal and spatial scales, including tendency-based scaling for the land and ocean components, and loss normalisation by catchment size in the hydrology model.

Ensuring physical consistency across components remains a central challenge and has been addressed through a combination of architectural choices and post-processing constraints. One example for this can be found in the sea ice model, where sea ice concentration constrains other sea-ice variables such as ice velocities, as in the physical model.

A key additional capability is the ability to run Earth system components as standalone machine learning models forced by atmospheric output from AIFS. This modular approach enables targeted experimentation and development of individual components informed by physical domain expertise. It also represents a first step toward progressively integrating ML components into a fully coupled Earth-system modelling framework.

Fig. 4: Diversity of model architectures across Earth system components. The figure illustrates how information can propagate in the different models, highlighting the locations/data which can be used when making predictions for a given point. During training, the model learns how much attention to give to each of the potential inputs, i.e. the attention mechanism.

What’s coming next

As part of Phase 2 of Destination Earth, successful prototypes have been developed for various Earth system components (waves, ocean, land, sea ice, hydrology)– these are further described in a number of dedicated DestinE blog posts and videos. The models were built by ECMWF, leveraging in-house domain-specific knowledge and making use of EuroHPC supercomputing resources. These prototypes will continue to evolve, co-developed by domain and machine learning experts to ensure efficiency, scientific robustness, physical consistency, and relevance for operational Earth system prediction.

Looking ahead, two complementary pathways are being pursued for advancing data-driven Earth system modelling. In the framework of DestinE, the Earth system components will be coupled into a full Earth system model, similar to the way physics-based weather forecasting and climate models are coupled. 

In parallel, the AIFS Single and ensemble (ENS) configurations used operationally at ECMWF were enhanced by including fields representing the land surface and ocean waves in AIFS v2.0, and variables representing sea-ice and surface ocean will be added to future releases.

Together, these efforts lay the foundations for comprehensive, fully data-driven Earth system prediction capabilities that combine atmospheric ML with specialised ML representations of the ocean and land, through targeted processes such as the sea ice, ocean waves and currents, land surface interactions, and streamflow all within the Destination Earth Initiative.

Destination Earth is a European Union funded initiative launched in 2022, with the aim to build a digital replica of the Earth system by 2030. The initiative is being jointly implemented by three entrusted entities: the European Centre for Medium-Range Weather Forecasts (ECMWF) responsible for the creation of the first two ‘digital twins’ and the ‘Digital Twin Engine’, the European Space Agency (ESA) responsible for building the ‘Core Service Platform’, and the European Organisation for the Exploitation of Meteorological Satellites (EUMETSAT), responsible for the creation of the ‘Data Lake’.

We acknowledge the EuroHPC Joint Undertaking for awarding this project strategic access to the EuroHPC supercomputers LUMI, hosted by CSC (Finland) and the LUMI consortium, Marenostrum5, hosted by BSC (Spain) Leonardo, hosted by Cineca (Italy) and MeluXina, hosted by LuxProvide (Luxembourg) through a EuroHPC Special Access call.

More information about Destination Earth is on the Destination Earth website and the EU Commission website.

For more information about ECMWF’s role visit ecmwf.int/DestinE

For any questions related to the role of ECMWF in Destination Earth, please use the following email links:

General enquiries

Press and Communications enquiries

Building a data-driven Earth system model at ECMWF

Why more Earth system components matter

What are the challenges for Earth system components

What have been the technical developments and innovations

What’s coming next

Read next

DestinE AI Tooling Workshop: advancing collaboration and technical foundations for AI in Earth system modelling

Apply AI webinar highlights Europe’s rapid progress on AI for weather prediction and Earth system modelling

Webinar and panel discussion on AI and the future of Earth system modelling