DRL

Value iteration is guaranteed to converge if the discount factor satisfies 0 < γ < 1

1

true

מיין לפי

For Q-learning to converge, which of the following options need to take place?

1

מיין לפי

a. Curriculum learning is only required when attempting to train the model to perform well on long trajectories

1

true

מיין לפי

13. You’re a new data scientist hired by Netflix. Your first assignment is to improve the following app: a user is presented with a screen containing a film poster. The user can then choose to watch the film (a reward of +1) or choose “next screen” (a reward of -1) which presents an additional film. For each film, you have a fixed-length vector that represents it. you also have a fixed number of poster types for each movie (same number for all films). Your algorithms need to select both the film and the poster to show at each time step. To train your model, you are provided with a previously collected dataset of 100,000 user sessions. Which of the following DRL algorithms should you use:

1

מיין לפי

16. Which of the following statements is correct with regard to experience replay (multiple answers may apply):

1

מיין לפי

In contextual bandits, the reward produced by each arm is dependent on the context

1

true

מיין לפי

a. Both the REINFORCE with a baseline and Double-DQN algorithms are similar in the sense that both use unbiased estimators

1

true

מיין לפי

1. [Imitation Learning] Which of the following statements is correct regarding DAgger with coaching (multiple answers may apply):

1

מיין לפי

4. [model-based learning] Which of the following statements is not true regarding local dynamics models

1

מיין לפי

b) One of the main challenges in meta-learning is determining which past experiences/datasets are most relevant to the current state

1

true

מיין לפי

Discuss, Learn and be Happy דיון בשאלות

Value iteration is guaranteed to converge if the discount factor satisfies 0 < γ < 1

For Q-learning to converge, which of the following options need to take place?

a. Curriculum learning is only required when attempting to train the model to perform well on long trajectories

16. Which of the following statements is correct with regard to experience replay (multiple answers may apply):

In contextual bandits, the reward produced by each arm is dependent on the context

a. Both the REINFORCE with a baseline and Double-DQN algorithms are similar in the sense that both use unbiased estimators

1. [Imitation Learning] Which of the following statements is correct regarding DAgger with coaching (multiple answers may apply):

4. [model-based learning] Which of the following statements is not true regarding local dynamics models

b) One of the main challenges in meta-learning is determining which past experiences/datasets are most relevant to the current state