File size: 5,289 Bytes
6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 6174ae2 33b5c22 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
---
library_name: transformers
tags: []
pipeline_tag: text-generation
license: mit
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
---
**Repository for:**
**ThinkEdit-deepseek-qwen-32b**
(We also release ThinkEdit versions for ThinkEdit-deepseek-qwen-1.5b, ThinkEdit-deepseek-llama3-8b, and ThinkEdit-deepseek-qwen-14b.)
**Authors**: Chung-En Sun, Ge Yan, Tsui-Wei Weng
**Paper**: [ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models](https://arxiv.org/abs/2503.22048)
Github: https://github.com/Trustworthy-ML-Lab/ThinkEdit
---
## Introduction
Reasoning-augmented models sometimes fail by generating **overly short**, abstract chain-of-thought (CoT) reasoning, hurting their accuracy.
**ThinkEdit** is a lightweight weight-editing method that:
- Identifies ~4% of "short reasoning" attention heads
- Edits only ~0.2% of total parameters
- Removes the "short reasoning" direction from their output
- Boosts performance, especially on cases with short reasoning traces
---
## Full Performance Results
### 1. Overall Accuracy
| Model | GSM8K | MMLU Elementary Math | MATH-Level1 | MATH-Level5 | MATH-500 |
|---------------------------------|---------------------|----------------------|---------------------|---------------------|---------------------|
| deepseek-qwen-32b | 92.97 ± 0.39 | 95.93 ± 0.83 | **96.41 ± 0.45** | 91.27 ± 0.53 | **91.62 ± 0.58** |
| **ThinkEdit-deepseek-qwen-32b** | **95.25 ± 0.25** | **98.02 ± 0.31** | 96.02 ± 0.42 | **91.31 ± 0.50** | 91.60 ± 0.65 |
| deepseek-qwen-14b | 90.80 ± 0.36 | 95.08 ± 0.65 | 96.32 ± 0.35 | 90.25 ± 0.72 | 91.48 ± 0.55 |
| **ThinkEdit-deepseek-qwen-14b** | **93.78 ± 0.50** | **96.56 ± 0.84** | **96.38 ± 0.52** | **91.03 ± 0.44** | **91.92 ± 0.63** |
| deepseek-llama3-8b | 82.26 ± 0.91 | 96.01 ± 0.62 | 93.46 ± 0.84 | 85.49 ± 0.83 | 87.26 ± 1.16 |
| **ThinkEdit-deepseek-llama3-8b**| **89.44 ± 0.55** | **96.19 ± 0.73** | **94.44 ± 0.31** | **86.49 ± 0.54** | **88.06 ± 1.09** |
| deepseek-qwen-1.5b | 79.15 ± 1.08 | 68.52 ± 1.56 | 93.00 ± 0.33 | **75.48 ± 0.90** | 82.22 ± 1.29 |
| **ThinkEdit-deepseek-qwen-1.5b**| **84.56 ± 0.79** | **90.66 ± 0.97** | **93.66 ± 0.62** | 75.05 ± 0.82 | **82.24 ± 0.89** |
---
### 2. Accuracy on Short Reasoning Cases (Top 5% / 10% / 20%)
| Model | GSM8K | MMLU Elementary Math | MATH-Level1 | MATH-Level5 | MATH-500 |
|---------------------------------|---------------------------------|----------------------------------|----------------------------------|----------------------------------|----------------------------------|
| deepseek-qwen-32b | 98.31 / 97.18 / 96.20 | 97.78 / 97.03 / 95.87 | 100.00 / 100.00 / **98.97** | 93.03 / 96.36 / 97.35 | 86.40 / 92.00 / 94.00 |
| **ThinkEdit-deepseek-qwen-32b** | **98.92** / **97.71** / **97.83** | **97.78** / **97.57** / **97.20** | **100.00** / **100.00** / 98.74 | **98.03** / **98.64** / **97.99** | **92.00** / **94.40** / **95.80** |
| deepseek-qwen-14b | **96.31** / 95.65 / 92.93 | 93.89 / **96.22** / 95.60 | 99.52 / **99.30** / 97.70 | 89.39 / 94.32 / 96.25 | 86.40 / 91.40 / 93.50 |
| **ThinkEdit-deepseek-qwen-14b** | **96.31** / **96.18** / **96.77** | **97.78** / 95.14 / **96.53** | **99.53** / 98.62 / **98.67** | **96.67** / **97.88** / **98.11** | **91.20** / **93.20** / **95.00** |
| deepseek-llama3-8b | 88.92 / 87.18 / 85.82 | 97.22 / 96.49 / 96.80 | 97.14 / 94.88 / 94.83 | 78.64 / 88.79 / 93.41 | 82.00 / 81.40 / 88.30 |
| **ThinkEdit-deepseek-llama3-8b**| **97.08** / **95.27** / **93.95** | **97.78** / **98.65** / **97.87** | **100.00** / **99.30** / **98.62** | **95.61** / **96.89** / **97.12** | **92.80** / **93.60** / **94.40** |
| deepseek-qwen-1.5b | 88.46 / 87.48 / 85.02 | 62.78 / 62.16 / 60.53 | **97.62** / 95.12 / 93.91 | 91.52 / 95.00 / 95.72 | 82.40 / 89.80 / 93.40 |
| **ThinkEdit-deepseek-qwen-1.5b**| **92.62** / **92.90** / **92.32** | **87.78** / **88.11** / **88.67** | 95.71 / **95.58** / **96.44** | **95.15** / **96.59** / **97.27** | **90.80** / **92.00** / **94.20** |
---
## Usage
The usage of ThinkEdit models is exactly the same as the original deepseek-distilled models.
## Citation
```bibtex
@misc{sun2025thinkedit,
title={ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models},
author={Chung-En Sun and Ge Yan and Tsui-Wei Weng},
year={2025},
eprint={2503.22048},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.22048},
}
|