prhegde commited on
Commit
ccf1ee9
·
verified ·
1 Parent(s): d5555a4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -5,14 +5,14 @@ license: apache-2.0
5
 
6
  # Align with Correction Optimization
7
 
8
- Recent advancements in Large Language Models (LLMs) have focused on aligning these models with human preferences. Approaches like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) achieve this alignment by using reinforcement learning to maximize rewards that are optimized for human preferences.
9
 
10
- An alternative method to align LLM responses involves utilizing correction mechanisms. Corrections offer more explicit feedback compared to simple rewards. Instead of directly optimizing for human preferences, our approach aims to minimize the required amount of correction. This correction is determined by comparing the model's response with a corrected version provided by a teacher, which can be either a human or a more accurate model.
11
 
12
- This model is output of a POC experiment using above hypothesis.
13
 
 
14
  In this experiment, the base model is initialized from a merged model - https://huggingface.co/prhegde/merge-aanaphi-phi2-orage-3b. Mistral-7b-instruct is used as a teacher model.
15
-
16
  These models are selected with consideration for the compute resource constraints.
17
 
18
  ## Proposed Approach
 
5
 
6
  # Align with Correction Optimization
7
 
8
+ Recent advancements in Large Language Models (LLMs) have focused on aligning these models with human preferences. Approaches like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) achieve this alignment by optimizing for human preferences data. RLHF does this by learning an explicit reward model and uses RL to optimize policy while DPO directly optimizes policy without the need of a reward model.
9
 
10
+ An alternative method to align LLM responses involves utilizing correction mechanisms. Corrections offer more explicit feedback compared to scalar rewards. Instead of directly optimizing for human preferences, our approach aims to minimize the required amount of correction. This correction is determined by comparing the model's response with a corrected version provided by a teacher, which can be either a human or a more accurate model.
11
 
12
+ The correction mechanism can serve as a form of regularization, preventing excessive alterations to the model, whereas preference optimization algorithms have the potential to significantly alter the model's behavior based on the training samples in preference data.
13
 
14
+ This model is output of a POC experiment using above hypothesis.
15
  In this experiment, the base model is initialized from a merged model - https://huggingface.co/prhegde/merge-aanaphi-phi2-orage-3b. Mistral-7b-instruct is used as a teacher model.
 
16
  These models are selected with consideration for the compute resource constraints.
17
 
18
  ## Proposed Approach