File size: 5,289 Bytes
6174ae2
 
 
33b5c22
 
 
 
6174ae2
 
33b5c22
6174ae2
33b5c22
6174ae2
33b5c22
6174ae2
33b5c22
 
6174ae2
33b5c22
6174ae2
33b5c22
6174ae2
33b5c22
6174ae2
33b5c22
6174ae2
33b5c22
6174ae2
33b5c22
 
 
 
6174ae2
33b5c22
6174ae2
33b5c22
6174ae2
33b5c22
6174ae2
33b5c22
 
 
 
 
 
 
 
 
 
6174ae2
33b5c22
6174ae2
33b5c22
6174ae2
33b5c22
 
 
 
 
 
 
 
 
 
6174ae2
33b5c22
6174ae2
33b5c22
6174ae2
33b5c22
6174ae2
33b5c22
6174ae2
33b5c22
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
library_name: transformers
tags: []
pipeline_tag: text-generation
license: mit
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
---

**Repository for:**

**ThinkEdit-deepseek-qwen-32b**

(We also release ThinkEdit versions for ThinkEdit-deepseek-qwen-1.5b, ThinkEdit-deepseek-llama3-8b, and ThinkEdit-deepseek-qwen-14b.)

**Authors**: Chung-En Sun, Ge Yan, Tsui-Wei Weng  
**Paper**: [ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models](https://arxiv.org/abs/2503.22048)

Github: https://github.com/Trustworthy-ML-Lab/ThinkEdit

---

## Introduction

Reasoning-augmented models sometimes fail by generating **overly short**, abstract chain-of-thought (CoT) reasoning, hurting their accuracy.

**ThinkEdit** is a lightweight weight-editing method that:

- Identifies ~4% of "short reasoning" attention heads  
- Edits only ~0.2% of total parameters  
- Removes the "short reasoning" direction from their output  
- Boosts performance, especially on cases with short reasoning traces  

---

## Full Performance Results

### 1. Overall Accuracy

| Model                            | GSM8K               | MMLU Elementary Math | MATH-Level1         | MATH-Level5         | MATH-500            |
|---------------------------------|---------------------|----------------------|---------------------|---------------------|---------------------|
| deepseek-qwen-32b               | 92.97 ± 0.39        | 95.93 ± 0.83         | **96.41 ± 0.45**        | 91.27 ± 0.53        | **91.62 ± 0.58**        |
| **ThinkEdit-deepseek-qwen-32b** | **95.25 ± 0.25**    | **98.02 ± 0.31**     | 96.02 ± 0.42    | **91.31 ± 0.50**    | 91.60 ± 0.65    |
| deepseek-qwen-14b               | 90.80 ± 0.36        | 95.08 ± 0.65         | 96.32 ± 0.35        | 90.25 ± 0.72        | 91.48 ± 0.55        |
| **ThinkEdit-deepseek-qwen-14b** | **93.78 ± 0.50**    | **96.56 ± 0.84**     | **96.38 ± 0.52**    | **91.03 ± 0.44**    | **91.92 ± 0.63**    |
| deepseek-llama3-8b              | 82.26 ± 0.91        | 96.01 ± 0.62         | 93.46 ± 0.84        | 85.49 ± 0.83        | 87.26 ± 1.16        |
| **ThinkEdit-deepseek-llama3-8b**| **89.44 ± 0.55**    | **96.19 ± 0.73**     | **94.44 ± 0.31**    | **86.49 ± 0.54**    | **88.06 ± 1.09**    |
| deepseek-qwen-1.5b              | 79.15 ± 1.08        | 68.52 ± 1.56         | 93.00 ± 0.33        | **75.48 ± 0.90**        | 82.22 ± 1.29        |
| **ThinkEdit-deepseek-qwen-1.5b**| **84.56 ± 0.79**    | **90.66 ± 0.97**     | **93.66 ± 0.62**    | 75.05 ± 0.82    | **82.24 ± 0.89**    |

---

### 2. Accuracy on Short Reasoning Cases (Top 5% / 10% / 20%)

| Model                            | GSM8K                            | MMLU Elementary Math              | MATH-Level1                      | MATH-Level5                      | MATH-500                         |
|---------------------------------|---------------------------------|----------------------------------|----------------------------------|----------------------------------|----------------------------------|
| deepseek-qwen-32b               | 98.31 / 97.18 / 96.20           | 97.78 / 97.03 / 95.87            | 100.00 / 100.00 / **98.97**      | 93.03 / 96.36 / 97.35            | 86.40 / 92.00 / 94.00            |
| **ThinkEdit-deepseek-qwen-32b** | **98.92** / **97.71** / **97.83** | **97.78** / **97.57** / **97.20** | **100.00** / **100.00** / 98.74   | **98.03** / **98.64** / **97.99** | **92.00** / **94.40** / **95.80** |
| deepseek-qwen-14b               | **96.31** / 95.65 / 92.93           | 93.89 / **96.22** / 95.60         | 99.52 / **99.30** / 97.70         | 89.39 / 94.32 / 96.25            | 86.40 / 91.40 / 93.50            |
| **ThinkEdit-deepseek-qwen-14b** | **96.31** / **96.18** / **96.77** | **97.78** / 95.14 / **96.53**     | **99.53** / 98.62 / **98.67**     | **96.67** / **97.88** / **98.11** | **91.20** / **93.20** / **95.00** |
| deepseek-llama3-8b              | 88.92 / 87.18 / 85.82           | 97.22 / 96.49 / 96.80            | 97.14 / 94.88 / 94.83            | 78.64 / 88.79 / 93.41            | 82.00 / 81.40 / 88.30            |
| **ThinkEdit-deepseek-llama3-8b**| **97.08** / **95.27** / **93.95** | **97.78** / **98.65** / **97.87** | **100.00** / **99.30** / **98.62** | **95.61** / **96.89** / **97.12** | **92.80** / **93.60** / **94.40** |
| deepseek-qwen-1.5b              | 88.46 / 87.48 / 85.02           | 62.78 / 62.16 / 60.53            | **97.62** / 95.12 / 93.91         | 91.52 / 95.00 / 95.72            | 82.40 / 89.80 / 93.40            |
| **ThinkEdit-deepseek-qwen-1.5b**| **92.62** / **92.90** / **92.32** | **87.78** / **88.11** / **88.67** | 95.71 / **95.58** / **96.44**     | **95.15** / **96.59** / **97.27** | **90.80** / **92.00** / **94.20** |

---

## Usage

The usage of ThinkEdit models is exactly the same as the original deepseek-distilled models.

## Citation

```bibtex
@misc{sun2025thinkedit,
      title={ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models}, 
      author={Chung-En Sun and Ge Yan and Tsui-Wei Weng},
      year={2025},
      eprint={2503.22048},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.22048}, 
}