Update README.md
Browse files
README.md
CHANGED
@@ -30,6 +30,20 @@ The model is constructed with several primary components:
|
|
30 |
5. **MCTS**: This module performs Monte Carlo Tree Search to evaluate the quality of actions over multiple iterations. It expands nodes based on the policy logits from the Prediction Network and simulates the reward by backpropagating value estimates.
|
31 |
6. **PPO Agent**: Uses policy and value estimates to calculate PPO loss, which updates the policy while maintaining the constraint on the KL divergence between old and new policies.
|
32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
## Training Details
|
34 |
|
35 |
The model is trained with the following components and techniques:
|
|
|
30 |
5. **MCTS**: This module performs Monte Carlo Tree Search to evaluate the quality of actions over multiple iterations. It expands nodes based on the policy logits from the Prediction Network and simulates the reward by backpropagating value estimates.
|
31 |
6. **PPO Agent**: Uses policy and value estimates to calculate PPO loss, which updates the policy while maintaining the constraint on the KL divergence between old and new policies.
|
32 |
|
33 |
+
The transformer strategically utilises beam search as well as multi token prediction, in order to enrich the encoding from the representation network. And the question you may be asking is, what are the actions that the model will be taking? Well, a generated sequence of tokens is an action, for example if a token is t, then an action is:
|
34 |
+
|
35 |
+
a_1= {t1,...,tN}
|
36 |
+
|
37 |
+
then a policy is a sequence of actions:
|
38 |
+
|
39 |
+
P_1 = {a_1,...,aN}
|
40 |
+
|
41 |
+
The MCTS and OOPS explores what we are defining as 'thoughts', where a thought is a set of policies:
|
42 |
+
|
43 |
+
thought_1 = {P1, ... , PN}
|
44 |
+
|
45 |
+
SThe model explores and exploits thoughts, policies, actions, and tokens, and learning happens at each step of granularity.
|
46 |
+
|
47 |
## Training Details
|
48 |
|
49 |
The model is trained with the following components and techniques:
|