Update README.md
Browse files
README.md
CHANGED
@@ -30,7 +30,9 @@ The model is constructed with several primary components:
|
|
30 |
5. **MCTS**: This module performs Monte Carlo Tree Search to evaluate the quality of actions over multiple iterations. It expands nodes based on the policy logits from the Prediction Network and simulates the reward by backpropagating value estimates.
|
31 |
6. **PPO Agent**: Uses policy and value estimates to calculate PPO loss, which updates the policy while maintaining the constraint on the KL divergence between old and new policies.
|
32 |
|
33 |
-
The transformer strategically utilises beam search as well as multi token prediction, in order to enrich the encoding from the representation network.
|
|
|
|
|
34 |
|
35 |
a_1= {t1,...,tN}
|
36 |
|
@@ -42,7 +44,7 @@ The MCTS and OOPS explores what we are defining as 'thoughts', where a thought i
|
|
42 |
|
43 |
thought_1 = {P1, ... , PN}
|
44 |
|
45 |
-
|
46 |
|
47 |
## Training Details
|
48 |
|
|
|
30 |
5. **MCTS**: This module performs Monte Carlo Tree Search to evaluate the quality of actions over multiple iterations. It expands nodes based on the policy logits from the Prediction Network and simulates the reward by backpropagating value estimates.
|
31 |
6. **PPO Agent**: Uses policy and value estimates to calculate PPO loss, which updates the policy while maintaining the constraint on the KL divergence between old and new policies.
|
32 |
|
33 |
+
The transformer strategically utilises beam search as well as multi token prediction, in order to enrich the encoding from the representation network.
|
34 |
+
|
35 |
+
A generated sequence of tokens is an action, for example if a token is t, then an action is:
|
36 |
|
37 |
a_1= {t1,...,tN}
|
38 |
|
|
|
44 |
|
45 |
thought_1 = {P1, ... , PN}
|
46 |
|
47 |
+
The model explores and exploits thoughts, policies, actions, and tokens, and learning happens at each step of granularity.
|
48 |
|
49 |
## Training Details
|
50 |
|