RobbiePasquale
/

lightbulb

Model card Files Files and versions Community

RobbiePasquale commited on Oct 9, 2024

Commit

7ba0a0a

·

verified ·

1 Parent(s): 61ea053

Update README.md

Files changed (1) hide show

README.md +4 -2

README.md CHANGED Viewed

@@ -30,7 +30,9 @@ The model is constructed with several primary components:
 5. **MCTS**: This module performs Monte Carlo Tree Search to evaluate the quality of actions over multiple iterations. It expands nodes based on the policy logits from the Prediction Network and simulates the reward by backpropagating value estimates.
 6. **PPO Agent**: Uses policy and value estimates to calculate PPO loss, which updates the policy while maintaining the constraint on the KL divergence between old and new policies.
-The transformer strategically utilises beam search as well as multi token prediction, in order to enrich the encoding from the representation network. And the question you may be asking is, what are the actions that the model will be taking? Well, a generated sequence of tokens  is an action, for example if a token is t, then an action is:
 a_1= {t1,...,tN}
@@ -42,7 +44,7 @@ The MCTS and OOPS explores what we are defining as 'thoughts', where a thought i
 thought_1 = {P1, ... , PN}
-SThe model explores and exploits thoughts, policies, actions, and tokens, and learning happens at each step of granularity.
 ## Training Details

 5. **MCTS**: This module performs Monte Carlo Tree Search to evaluate the quality of actions over multiple iterations. It expands nodes based on the policy logits from the Prediction Network and simulates the reward by backpropagating value estimates.
 6. **PPO Agent**: Uses policy and value estimates to calculate PPO loss, which updates the policy while maintaining the constraint on the KL divergence between old and new policies.
+The transformer strategically utilises beam search as well as multi token prediction, in order to enrich the encoding from the representation network.
+A generated sequence of tokens is an action, for example if a token is t, then an action is:
 a_1= {t1,...,tN}
 thought_1 = {P1, ... , PN}
+The model explores and exploits thoughts, policies, actions, and tokens, and learning happens at each step of granularity.
 ## Training Details