We conduct an in-depth exploration to identify and understand distribution shifts for machine learning force fields (MLFFs). On example chemical datasets, we find that state-of-the-art models struggle with common distribution shifts. We demonstrate that there are multiple reasons that this is the case, including challenges associated with poorly-connected graphs and learning unregularized representations, evidenced by jagged predicted potential energy surfaces for out-of-distribution systems.
Building off of our observations, we propose two paths forward that take initial steps at mitigating distribution shifts for MLFFs through test-time refinement strategies. We extensively test our approaches and show that our test-time refinement strategies are effective in mitigating distribution shifts for MLFFs. Our experiments establish clear benchmarks that highlight ambitious but important generalization goals for the next generation of MLFFs. Additionally, the success of our test-time refinement strategies provides insights into why MLFFs are susceptible to distribution shifts, namely overfitting and poorly regularized representations (as opposed to only poor data or weak architectures). This suggests that while MLFFs are expressive enough to model diverse chemical spaces, they are not being effectively trained to do so.
We first formalize three criteria for identifying distribution shifts based on the features, labels, and graph structures in chemical datasets. This categorization provides a framework for understanding the types of distribution shifts an MLFF may encounter:
First, we study how MLFFs tend to overfit to the (typically regular and well connected) graph structures encountered during training by looking at the graph Laplacian spectrum of chemical graphs. At test time, we can then identify when an MLFF encounters a graph with a Laplacian eigenvalue distribution that significantly differs from the training graphs. Since the edges in chemical graphs are typically generated by a radius graph, we refine the radius cutoff at test time to update testing graphs to more closely resemble the training structures, mitigating the connectivity distribution shift. We call this test-time radius refinement (RR). This refinement method addresses the source of connectivity distribution shifts and serves as a simple and effective initial strategy for handling new connectivities.
We further hypothesize that the current supervised training procedure for MLFFs can lead to overfitting, leading to poor representations for out-of-distribution systems and jagged potential energy landscape predictions. To address this, we propose introducing inductive biases through improved training and inference strategies. We represent these inductive biases as cheap priors, such as classical force fields or simple ML models. Following previous test-time training (TTT) works, we pre-train (or joint-train) our model to learn features from the prior. At test-time, we can then take gradient steps on the prior targets since the prior is cheap to evaluate. Taking gradient steps on the prior incorporates inductive biases about the out-of-distribution samples into the model, regularizing the energy landscape and helping the model generalize.
We conduct experiments on chemical datasets to both identify the presence of distribution shifts and evaluate the effectiveness of our test-time refinement strategies to mitigate these shifts. We first explore the MACE-OFF model in more detail, investigating distribution shifts from the SPICE dataset to the SPICEv2 dataset. We observe that despite its scale, MACE-OFF experiences force norm, connectivity, and element distribution shifts when evaluated on 10k new molecules from SPICEv2. Any deviation from the training distribution, shown in gray, predictably results in an increase in force error.
We also evaluate the effectiveness of our test-time refinement strategies in mitigating these distribution shifts by implementing our RR approach on top of MACE-OFF and applying TTT to a GemNet-T model we trained. TTT reduces errors for GemNet-T on out-of-distribution force norms and connectivities, and also helps decrease errors for the new systems that are closer to the training distribution. Our test-time radius refinement (RR) technique applied to MACE-OFF effectively mitigates connectivity errors at minimal computational cost. These improvements result in hundreds of molecular systems getting their errors brought down well below 25 meV / Å (shown below).
We establish an extreme distribution shift benchmark to evaluate the generalization ability of MLFFs on the MD17 dataset. We train a GemNet-T moodel on 3 molecules from the MD17 dataset (benzene, aspirin, and uracil), and we evaluate whether it can simulate two new molecules (naphthalene and toluene) which were unseen during training. Without TTT, the GemNet-dT model trained only on aspirin, benzene, and uracil is unable to stably simulate the new molecules and poorly reproduces observables. TTT enables stable simulations of the unseen molecules that accurately reproduce the distribution of interatomic distances h(r). Furthermore, TTT provides a better starting point for fine-tuning, decreasing the amount of data needed to reach the in-distribution performance by more than 20x.
Our results reveal interesting insights into the current state of MLFFs which we summarize below:
For more details, check out our paper!
@misc{kreiman2025understandingmitigatingdistributionshifts,
title={Understanding and Mitigating Distribution Shifts For Machine Learning Force Fields},
author={Tobias Kreiman and Aditi S. Krishnapriyan},
year={2025},
eprint={2503.08674},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2503.08674},
}