Papers
arxiv:2504.20752

Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers

Published on Apr 29
ยท Submitted by fsteinbauer on May 6
#1 Paper of the day
Authors:

Abstract

Transformers have achieved great success in numerous NLP tasks but continue to exhibit notable gaps in multi-step factual reasoning, especially when real-world knowledge is sparse. Recent advances in grokking have demonstrated that neural networks can transition from memorizing to perfectly generalizing once they detect underlying logical patterns - yet these studies have primarily used small, synthetic tasks. In this paper, for the first time, we extend grokking to real-world factual data and address the challenge of dataset sparsity by augmenting existing knowledge graphs with carefully designed synthetic data to raise the ratio phi_r of inferred facts to atomic facts above the threshold required for grokking. Surprisingly, we find that even factually incorrect synthetic data can strengthen emergent reasoning circuits rather than degrade accuracy, as it forces the model to rely on relational structure rather than memorization. When evaluated on multi-hop reasoning benchmarks, our approach achieves up to 95-100% accuracy on 2WikiMultiHopQA - substantially improving over strong baselines and matching or exceeding current state-of-the-art results. We further provide an in-depth analysis of how increasing phi_r drives the formation of generalizing circuits inside Transformers. Our findings suggest that grokking-based data augmentation can unlock implicit multi-hop reasoning capabilities, opening the door to more robust and interpretable factual reasoning in large-scale language models.

Community

Paper author Paper submitter

Hey, we made grokking work for transformers and real-world data.
introduction_accuracy.png
Of course, there are some caveats, and it's not straightforward to apply. But we think the results so far are actually quite promising! For more details, check out our paper here.

Can you put the code referece on GitHub. I'm interested in checking whether grokking GPT-2 into reasoning for multiple-hops is viable. Do you think it is? If so, what would you suggest to do?

ยท
Paper author

I think it is definitely possible but we haven't tried. I assume it might take much longer and/or needs more inferred facts (i.e., higher ratio phi_r) as the generalization circuit that needs to form within the layers is more complex and deep. Actually, we are currently working on an approach to enable grokking for more complex (and deeper) reasoning subcircuits to tackle more unstructured / messy datasets.

Regarding Code: The training data is not the only thing that is severely unstructured ๐Ÿ˜…
@monsetrum If you have the time, it would really be nice to clean the code and put it online as people seem to be interested (also on Papers With Code if we are on it).

an audio overview: https://youtu.be/ov0Pxy8otjk

ยท
Paper author

Not certain whether I should report you or thank you for what you did with our paper there. ๐Ÿ˜…
Imo, the text-to-speech is technically nicely executed, though! (The "OD" is pronounced weirdly...)

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.20752 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.20752 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.20752 in a Space README.md to link it from this page.

Collections including this paper 7