Transformers Discover Molecular Structure Without Graph Priors

1UC Berkeley
Problem: Can a model without graphs or physical priors learn molecular energies and forces and adaptively discover molecular interactions from data?

Why is this important: Graph Neural Networks (GNNs) dominate molecular ML but rely on predefined graphs that limit their flexibility and slow down inference.

Approach: We show that a graph-free Transformer matches a state-of-the-art equivariant GNN under equal compute, adaptively learns distance-aware patterns, and improves predictably with scale.

What Intermolecular Structure Do Graph-Free Transformers Discover?


We visualize the attention scores from the 5th layer of our Transformer for 8 molecules from the OMol25 validation dataset. Click on an atom to see the learned attention scores for that atom. Darker purple indicates more attention.

Select Molecule:

Chemical Formula: Loading...

Click Atoms to See Learned Attention Scores:

The Transformer naturally learns attention patterns that are inversely proportional to interatomic distance.

Molecule Fragments

Since the Transformer has no explicit hardcoded graph priors, the Transformer learns adaptive attention patterns, like an effective radius cutoff that increases when atoms become farther apart. In contrast, GNNs use a fixed graph construction algorithm that does not adapt to atomic environments.

What Can Graph-Free Transformers Learn?

➔ Comparison to an Equivariant GNN on OMol25

We compare the energy and force mean absolute errors of a graph-free Transformer to a state-of-the-art equivariant GNN on the OMol25 dataset. Under an equal training compute budget, the Transformer achieves competitive accuracy with the GNN. Given that the model achieves competitive accuracy with the GNN, it provides a good starting point to analyze the interatomic interactions that are learned without graph priors.

Model FLOPs Energy MAE (meV) Forces MAE (meV/Å)
eSEN-sm-d 6M O(1020) 129.77 13.01
Transformer 1B (Ours) 8.5×1019 117.99 18.35

Interestingly, a 1B parameter Transformer is faster than running a 6M parameter equivariant GNN since it leverages modern hardware and software frameworks!

Model Forward Latency (ms) Training Speed (atoms/sec)
eSEN-sm-d 6M 26.3 32k+
Transformer 1B (Ours) 17.2 42k+

➔ Scaling Laws

Using an unmodified Transformer allows us to leverage mature software and hardware frameworks, enabling us to train a model up to 1B parameters. We find that the Transformer exhibits consistent and predictable gains with respect to scaling training resources, consistent with empirical scaling laws observed in other domains.


Model Scaling OMol25
FT FLOPS

➔ Attention Patterns

The graph-free Transformer naturally learns physically consistent attention patterns from data, such as attention scores that decay inversely with interatomic distance. The absence of the hard-coded graph priors allows the Transformer to learn adaptive attention patterns, such as an effective radius cutoff that increases when atoms become farther apart.


Attn vs Dist Nine Layers
Attn Effective Rad Square

How did we do it?

We simply train a standard Transformer on the OMol25 dataset, using the following discretization scheme:

Teaser Diagram

Our only modification to the architecture is the addition of a continuous input stream in addition to discrete tokens. The attention mechanism remains completely unchanged. We leverage existing software and hardware for Transformers to efficiently train our models.

Teaser Diagram

During pre-training, the model learns to autoregressively predict the discrete tokens. We then replace the causal attention mask with a bi-directional attention mask and fine-tune the model on the energy and force targets using only continuous inputs and outputs.

What does this mean for the future of molecular machine learning?

Our findings suggest that Transformers can learn many of the graph-based inductive biases typically built into current ML models for chemistry—while doing so more flexibly. We hope these findings point towards a standardized, widely applicable architecture for molecular modeling that draws on insights from the broader deep learning community.

For more details, check out our paper!

BibTeX

@article{kreiman2025transformers,
      title={Transformers Discover Molecular Structure Without Graph Priors},
      author={Kreiman, Tobias and Bai, Yutong and Atieh, Fadi and Weaver, Elizabeth and Qu, Eric and Krishnapriyan, Aditi S},
      journal={arXiv preprint arXiv:2510.02259},
      year={2025}
      }