Consider a problem of a taxi driver, who serves six cities 0, 1, 2, 3, 4 and 5 which are located on a circular highway. The taxi driver can choose one of the following actions.
1. Cruise the streets looking for a passenger.
2. Go to the nearest taxi stand and wait in line. 3. Head back to headquarters (city 0).
Any passenger will only go to a preceding or succeeding city and the driver never cancels a ride. For a given city and any of the first two actions, there is a probability that the driver gets a passenger and goes to the succeeding or preceding city and there is also a probability that the driver doesn’t get any passenger and stays in the same city. Refer Table for the transition probabilities pkij which are represented as pkc where c = j − i.
The rewards for outcomes 1 and -1 for actions 1 and 2 are shown in Figure. When the driver doesn’t get a passenger in a city (c=0), then the reward for the driver is 0. Also, if the driver decides to go to headquarters, the reward is 0.
Suppose 1 − γ is the probability that the taxi will breakdown before the next trip. The driver’s goal is to maximize the total reward until his taxi breakdown
Implement the following: (2+1.5 marks)
- Find an optimal policy using policy iteration starting with a policy that will always cruise independent of the town, and a zero value vector. Let γ = 0.9.
- Run policy iteration for discount factors γ ranging from 0 to 0.95 with intervals of 0.05 and display the results.
Answer the following (based on the data given above): (1+0.5 mark)
- How is different values of γ affecting the policy iteration? Explain your findings.
- Give alternate transition probabilities for action 2(if exists) such that optimal policy consists of action 2. Explain your answer.
You will be writing your solutions & making a submission through a notebook. You can follow the instructions in the starter notebook.
Resources section you will find data files that contains parameters for the environment for this problem.
- Submissions will be made through a notebook following the instructions in the starter notebook.
- Each Team can make 5 successful submissions and 5 failed submissions in a day. Once the limit of failed submission is reached, the submission will be counted in the successful submission.
- The submission limit will reset at 5:30 AM IST every day.
- At the end of the challenge, you will have to select 1 submission as the final one. You can select that here.
- RL TAs
siddharthaOver 1 year ago