RobbiePasquale commited on
Commit
7ba0a0a
·
verified ·
1 Parent(s): 61ea053

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -30,7 +30,9 @@ The model is constructed with several primary components:
30
  5. **MCTS**: This module performs Monte Carlo Tree Search to evaluate the quality of actions over multiple iterations. It expands nodes based on the policy logits from the Prediction Network and simulates the reward by backpropagating value estimates.
31
  6. **PPO Agent**: Uses policy and value estimates to calculate PPO loss, which updates the policy while maintaining the constraint on the KL divergence between old and new policies.
32
 
33
- The transformer strategically utilises beam search as well as multi token prediction, in order to enrich the encoding from the representation network. And the question you may be asking is, what are the actions that the model will be taking? Well, a generated sequence of tokens is an action, for example if a token is t, then an action is:
 
 
34
 
35
  a_1= {t1,...,tN}
36
 
@@ -42,7 +44,7 @@ The MCTS and OOPS explores what we are defining as 'thoughts', where a thought i
42
 
43
  thought_1 = {P1, ... , PN}
44
 
45
- SThe model explores and exploits thoughts, policies, actions, and tokens, and learning happens at each step of granularity.
46
 
47
  ## Training Details
48
 
 
30
  5. **MCTS**: This module performs Monte Carlo Tree Search to evaluate the quality of actions over multiple iterations. It expands nodes based on the policy logits from the Prediction Network and simulates the reward by backpropagating value estimates.
31
  6. **PPO Agent**: Uses policy and value estimates to calculate PPO loss, which updates the policy while maintaining the constraint on the KL divergence between old and new policies.
32
 
33
+ The transformer strategically utilises beam search as well as multi token prediction, in order to enrich the encoding from the representation network.
34
+
35
+ A generated sequence of tokens is an action, for example if a token is t, then an action is:
36
 
37
  a_1= {t1,...,tN}
38
 
 
44
 
45
  thought_1 = {P1, ... , PN}
46
 
47
+ The model explores and exploits thoughts, policies, actions, and tokens, and learning happens at each step of granularity.
48
 
49
  ## Training Details
50