We visualize the attention scores from the 5th layer of our Transformer for 8 molecules from the OMol25 validation dataset. Click on an atom to see the learned attention scores for that atom. Darker purple indicates more attention.
The Transformer naturally learns attention patterns that are inversely proportional to interatomic distance.
Since the Transformer has no explicit hardcoded graph priors, the Transformer learns adaptive attention patterns, like an effective radius cutoff that increases when atoms become farther apart. In contrast, GNNs use a fixed graph construction algorithm that does not adapt to atomic environments.
We compare the energy and force mean absolute errors of a graph-free Transformer to a state-of-the-art equivariant GNN on the OMol25 dataset. Under an equal training compute budget, the Transformer achieves competitive accuracy with the GNN. Given that the model achieves competitive accuracy with the GNN, it provides a good starting point to analyze the interatomic interactions that are learned without graph priors.
Model | FLOPs | Energy MAE (meV) | Forces MAE (meV/Å) |
---|---|---|---|
eSEN-sm-d 6M | O(1020) | 129.77 | 13.01 |
Transformer 1B (Ours) | 8.5×1019 | 117.99 | 18.35 |
Interestingly, a 1B parameter Transformer is faster than running a 6M parameter equivariant GNN since it leverages modern hardware and software frameworks!
Model | Forward Latency (ms) | Training Speed (atoms/sec) |
---|---|---|
eSEN-sm-d 6M | 26.3 | 32k+ |
Transformer 1B (Ours) | 17.2 | 42k+ |
Using an unmodified Transformer allows us to leverage mature software and hardware frameworks, enabling us to train a model up to 1B parameters. We find that the Transformer exhibits consistent and predictable gains with respect to scaling training resources, consistent with empirical scaling laws observed in other domains.
The graph-free Transformer naturally learns physically consistent attention patterns from data, such as attention scores that decay inversely with interatomic distance. The absence of the hard-coded graph priors allows the Transformer to learn adaptive attention patterns, such as an effective radius cutoff that increases when atoms become farther apart.
We simply train a standard Transformer on the OMol25 dataset, using the following discretization scheme:
Our only modification to the architecture is the addition of a continuous input stream in addition to discrete tokens. The attention mechanism remains completely unchanged. We leverage existing software and hardware for Transformers to efficiently train our models.
During pre-training, the model learns to autoregressively predict the discrete tokens. We then replace the causal attention mask with a bi-directional attention mask and fine-tune the model on the energy and force targets using only continuous inputs and outputs.
Our findings suggest that Transformers can learn many of the graph-based inductive biases typically built into current ML models for chemistry—while doing so more flexibly. We hope these findings point towards a standardized, widely applicable architecture for molecular modeling that draws on insights from the broader deep learning community.
For more details, check out our paper!
@article{kreiman2025transformers,
title={Transformers Discover Molecular Structure Without Graph Priors},
author={Kreiman, Tobias and Bai, Yutong and Atieh, Fadi and Weaver, Elizabeth and Qu, Eric and Krishnapriyan, Aditi S},
journal={arXiv preprint arXiv:2510.02259},
year={2025}
}