Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers
Abstract
Transformers have achieved great success in numerous NLP tasks but continue to exhibit notable gaps in multi-step factual reasoning, especially when real-world knowledge is sparse. Recent advances in grokking have demonstrated that neural networks can transition from memorizing to perfectly generalizing once they detect underlying logical patterns - yet these studies have primarily used small, synthetic tasks. In this paper, for the first time, we extend grokking to real-world factual data and address the challenge of dataset sparsity by augmenting existing knowledge graphs with carefully designed synthetic data to raise the ratio phi_r of inferred facts to atomic facts above the threshold required for grokking. Surprisingly, we find that even factually incorrect synthetic data can strengthen emergent reasoning circuits rather than degrade accuracy, as it forces the model to rely on relational structure rather than memorization. When evaluated on multi-hop reasoning benchmarks, our approach achieves up to 95-100% accuracy on 2WikiMultiHopQA - substantially improving over strong baselines and matching or exceeding current state-of-the-art results. We further provide an in-depth analysis of how increasing phi_r drives the formation of generalizing circuits inside Transformers. Our findings suggest that grokking-based data augmentation can unlock implicit multi-hop reasoning capabilities, opening the door to more robust and interpretable factual reasoning in large-scale language models.
Community
Can you put the code referece on GitHub. I'm interested in checking whether grokking GPT-2 into reasoning for multiple-hops is viable. Do you think it is? If so, what would you suggest to do?
I think it is definitely possible but we haven't tried. I assume it might take much longer and/or needs more inferred facts (i.e., higher ratio phi_r) as the generalization circuit that needs to form within the layers is more complex and deep. Actually, we are currently working on an approach to enable grokking for more complex (and deeper) reasoning subcircuits to tackle more unstructured / messy datasets.
Regarding Code: The training data is not the only thing that is severely unstructured ๐
@monsetrum
If you have the time, it would really be nice to clean the code and put it online as people seem to be interested (also on Papers With Code if we are on it).
Not certain whether I should report you or thank you for what you did with our paper there. ๐
Imo, the text-to-speech is technically nicely executed, though! (The "OD" is pronounced weirdly...)
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper