Q1) Which two of the following describe bias-variance trade-off between MC and TD?
A) The MC algorithm reduces variance by sampling until the terminal state, leading to higher bias.
B) The MC algorithm reduces bias by sampling until the terminal state, leading to higher variance.
C) The TD algorithm reduces variance by sampling a small number of time steps, leading to higher bias.
D) The TD algorithm reduces bias by sampling a small number of a time steps, leading to higher variance.
Question 2) What is the difference between on-policy and off-policy learning?
A)On-policy learning learns by evaluating the results of a behavior policy to perform policy improvement on a target policy, whereas off-policy learns from experience by evaluating a target policy and performing policy improvement on the target policy.
b) On-policy learning learns from experience by evaluating a target policy and performing policy improvement on the target policy, whereas off-policy learning learns by evaluating the results of a behavior policy to perform policy improvement on a target policy.
C) On-policy learning learns from experience by evaluating a target policy and performing policy improvement on the target policy, whereas off-policy learning learns by evaluating the target policy to perform policy improvement on a behavior policy.
D) On-policy learning learns from experience by evaluating a behavior policy and performing policy improvement on the target policy, whereas off-policy learning learns by evaluating the results of a behavior policy to perform policy improvement on the behavior policy.
Question 3) Which two statements describe eligibility traces?
A) Eligibility traces down weight the contribution of states that are rarely visited to computing average Vs) or Q(s,a).
B) Eligibility traces encourage further exploration of the state space.
C) Eligibility traces assign credit to action.
D) Eligibility traces assign credit to both the most frequently visited and last visited states.
Q1) Which two of the following describe bias-variance trade-off between MC and TD?
B) The MC algorithm reduces bias by sampling until the terminal state, leading to higher variance.
AND
C) The TD algorithm reduces variance by sampling a small number of time steps, leading to higher bias.
Description:
TD can learn before knowing the final outcome (learn online after every step), MC must wait until end of episode before return is known.
TD can learn without the final outcome (learn from incomplete sequences), MC can only learn from complete sequences. TD works in continuting environments, while MC only works for episodic (terminating) environments.
MC has high variance, zero bias, which leads to good convergence properties, not very sensitive to initial value and very simple to understand and use.
TD has low variance but some bias, which renders more efficient than MC, TD(0) converges to , and more sensitive to initial value.
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Question 2) What is the difference between on-policy and off-policy learning?
b) On-policy learning learns from experience by evaluating a target policy and performing policy improvement on the target policy, whereas off-policy learning learns by evaluating the results of a behavior policy to perform policy improvement on a target policy.
Description:
(In other words) it directly learns a policy which gives you decisions about which action to take in some state.
2. Off policy learning : It evaluates one policy ( target policy ) while following another policy ( behavior policy )
just like we learn to do something while observing others doing the same thing.
----------------------------------------------------------------------------------------------------------------------------------------------------------
Question 3) Which two statements describe eligibility traces?
A) Eligibility traces down weight the contribution of states that are rarely visited to computing average Vs) or Q(s,a).
D) Eligibility traces assign credit to both the most frequently visited and last visited states.
Q1) Which two of the following describe bias-variance trade-off between MC and TD? A) The MC...