Understanding and Mitigating Distribution Shifts for Machine Learning Force Fields

Overview

We conduct an in-depth exploration to identify and understand distribution shifts for machine learning force fields (MLFFs). On example chemical datasets, we find that state-of-the-art models struggle with common distribution shifts. We demonstrate that there are multiple reasons that this is the case, including challenges associated with poorly-connected graphs and learning unregularized representations, evidenced by jagged predicted potential energy surfaces for out-of-distribution systems.

Building off of our observations, we propose two paths forward that take initial steps at mitigating distribution shifts for MLFFs through test-time refinement strategies. We extensively test our approaches and show that our test-time refinement strategies are effective in mitigating distribution shifts for MLFFs. Our experiments establish clear benchmarks that highlight ambitious but important generalization goals for the next generation of MLFFs. Additionally, the success of our test-time refinement strategies provides insights into why MLFFs are susceptible to distribution shifts, namely overfitting and poorly regularized representations (as opposed to only poor data or weak architectures). This suggests that while MLFFs are expressive enough to model diverse chemical spaces, they are not being effectively trained to do so.

Understanding Failure Modes for Current Chemical Foundation Models

We first formalize three criteria for identifying distribution shifts based on the features, labels, and graph structures in chemical datasets. This categorization provides a framework for understanding the types of distribution shifts an MLFF may encounter:

Atomic Features: Distribution shifts in atomic features refer to any change in the atomic composition of a chemical system. This includes, but is not limited to, cases where models are trained on systems containing mixtures of organic elements but tested on structures composed solely of carbon, or cases where there is a shift in system size between training and testing.
Force Norms: An MLFF may also encounter a distribution shift in the force labels it predicts. A model trained on structures close to equilibrium, with low force magnitudes, might be tested on a structure with higher force norms.
Graph Structure / Connectivity: Since many MLFFs are implemented as graph neural networks (GNNs), they may encounter distribution shifts in the graph structure of the test data. We refer to these as connectivity distribution shifts. We study these distribution shfits by analyzing the spectrum of the Laplacian for molecular graphs.

We contextualize the aforementioned distribution shifts by considering four large foundation models: MACE-OFF, MACE-MP, EquiformerV2, and JMP. These models represent four of the largest open-source MLFFs to date, and they have been trained on some of the most extensive datasets available. We focus on these models since their scale is designed for tackling broad chemical spaces. Despite their scale, these models struggle predictiably when encountering distribution shifts along the three criteria we have identified. This deterioration is severe (often by an order of magnitude) and consistent across models and datasets.

Mitigation Strategies for Distribution Shifts

First, we study how MLFFs tend to overfit to the (typically regular and well connected) graph structures encountered during training by looking at the graph Laplacian spectrum of chemical graphs. At test time, we can then identify when an MLFF encounters a graph with a Laplacian eigenvalue distribution that significantly differs from the training graphs. Since the edges in chemical graphs are typically generated by a radius graph, we refine the radius cutoff at test time to update testing graphs to more closely resemble the training structures, mitigating the connectivity distribution shift. We call this test-time radius refinement (RR). This refinement method addresses the source of connectivity distribution shifts and serves as a simple and effective initial strategy for handling new connectivities.

We further hypothesize that the current supervised training procedure for MLFFs can lead to overfitting, leading to poor representations for out-of-distribution systems and jagged potential energy landscape predictions. To address this, we propose introducing inductive biases through improved training and inference strategies. We represent these inductive biases as cheap priors, such as classical force fields or simple ML models. Following previous test-time training (TTT) works, we pre-train (or joint-train) our model to learn features from the prior. At test-time, we can then take gradient steps on the prior targets since the prior is cheap to evaluate. Taking gradient steps on the prior incorporates inductive biases about the out-of-distribution samples into the model, regularizing the energy landscape and helping the model generalize.

Practical Implications

We conduct experiments on chemical datasets to both identify the presence of distribution shifts and evaluate the effectiveness of our test-time refinement strategies to mitigate these shifts. We first explore the MACE-OFF model in more detail, investigating distribution shifts from the SPICE dataset to the SPICEv2 dataset. We observe that despite its scale, MACE-OFF experiences force norm, connectivity, and atomic feature distribution shifts when evaluated on 10k new molecules from SPICEv2. Any deviation from the training distribution, shown in gray, predictably results in an increase in force error.

We also evaluate the effectiveness of our test-time refinement strategies in mitigating these distribution shifts by implementing our RR approach on top of MACE-OFF and applying TTT to a GemNet-T model we trained. TTT reduces errors for GemNet-T on out-of-distribution force norms and connectivities, and also helps decrease errors for the new systems that are closer to the training distribution. Our test-time radius refinement (RR) technique applied to MACE-OFF effectively mitigates connectivity errors at minimal computational cost. These improvements result in hundreds of molecular systems getting their errors brought down well below 25 meV / Å (shown below).

We establish an extreme distribution shift benchmark to evaluate the generalization ability of MLFFs on the MD17 dataset. We train a GemNet-T moodel on 3 molecules from the MD17 dataset (benzene, aspirin, and uracil), and we evaluate whether it can simulate two new molecules (naphthalene and toluene) which were unseen during training. Without TTT, the GemNet-dT model trained only on aspirin, benzene, and uracil is unable to stably simulate the new molecules and poorly reproduces observables. TTT enables stable simulations of the unseen molecules that accurately reproduce the distribution of interatomic distances h(r). Furthermore, TTT provides a better starting point for fine-tuning, decreasing the amount of data needed to reach the in-distribution performance by more than 20x.

What does this mean for the field of MLFFs?

Our results reveal interesting insights into the current state of MLFFs which we summarize below:

State-of-the-art MLFFs, even when trained on large datasets, suffer from predictable performance degradation due to distribution shifts
Common distribution shifts can be classified into atomic feature, force norm, and connectivity distribution shifts, and we show how to diagnose them using spectral graph theory and cheap priors.
Test-time refinement methods represent initial steps in mitigating these distribution shifts, showing promising results in modeling and simulating systems outside of the training distribution.
The success of these methods provides insights into how MLFFs overfit, suggesting that while MLFFs are expressive enough to model diverse chemical spaces, they are not being effectively trained to do so.

More generally, given the community's interest in training larger and more accurate MLFFs designed to work well across many different systems, it is important to understand how MLFFs generalize beyond their training distributions. This understanding is essential for applying MLFFs to new and diverse chemical spaces, ensuring that they perform well not only on the data they were trained on, but also on unseen, potentially more complex systems. Our work suggests that training strateiges, alongside archtecture and data innovations, might play an important role developing the next generation of MLFFs.

For more details, check out our paper!

BibTeX

@misc{kreiman2025understandingmitigatingdistributionshifts,
      title={Understanding and Mitigating Distribution Shifts For Machine Learning Force Fields}, 
      author={Tobias Kreiman and Aditi S. Krishnapriyan},
      year={2025},
      eprint={2503.08674},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2503.08674}, 
}

Understanding And Mitigating Distribution Shifts for MLFFs