b) AlphaGoZero employs a neural network to guide its decision making process, selecting the next branches in the tree based on probabilities assigned by the network
7. [Imitation learning] Which of the following IS NOT an advantage of behavior cloning compared to reinforcement learning (multiple answers may apply)?
b) One of the more challenging use-cases for the Decision Transformer is when the return-to-go becomes negative (particularly when still far from the end of the trajectory).
a) The use of contextual bandits by the action-elimination network (AEN) is important, because it enables the model to better infer the utility (i.e., “value”) of multiple actions at once.
b) Decoupling the action elimination (by the AEN) from the action selection (by the DRL agent) reduces the risk of the model exploring only a small subset of actions.