RobbiePasquale
/

lightbulb

Model card Files Files and versions Community

RobbiePasquale commited on Oct 9, 2024

Commit

61ea053

·

verified ·

1 Parent(s): 94aa156

Update README.md

Files changed (1) hide show

README.md +14 -0

README.md CHANGED Viewed

@@ -30,6 +30,20 @@ The model is constructed with several primary components:
 5. **MCTS**: This module performs Monte Carlo Tree Search to evaluate the quality of actions over multiple iterations. It expands nodes based on the policy logits from the Prediction Network and simulates the reward by backpropagating value estimates.
 6. **PPO Agent**: Uses policy and value estimates to calculate PPO loss, which updates the policy while maintaining the constraint on the KL divergence between old and new policies.
 ## Training Details
 The model is trained with the following components and techniques:

 5. **MCTS**: This module performs Monte Carlo Tree Search to evaluate the quality of actions over multiple iterations. It expands nodes based on the policy logits from the Prediction Network and simulates the reward by backpropagating value estimates.
 6. **PPO Agent**: Uses policy and value estimates to calculate PPO loss, which updates the policy while maintaining the constraint on the KL divergence between old and new policies.
+The transformer strategically utilises beam search as well as multi token prediction, in order to enrich the encoding from the representation network. And the question you may be asking is, what are the actions that the model will be taking? Well, a generated sequence of tokens  is an action, for example if a token is t, then an action is:
+a_1= {t1,...,tN}
+then a policy is a sequence of actions:
+P_1 = {a_1,...,aN}
+The MCTS and OOPS explores what we are defining as 'thoughts', where a thought is a set of policies:
+thought_1 = {P1, ... , PN}
+SThe model explores and exploits thoughts, policies, actions, and tokens, and learning happens at each step of granularity.
 ## Training Details
 The model is trained with the following components and techniques: