RL Assignment 2 - Taxi

Starter notebook

In [ ]:

What is the notebook about?¶

Problem - Taxi Environment Algorithms¶

This problem deals with a taxi environment and stochastic actions. The tasks you have to do are:

Implement Policy Iteration
Visualize the results
Explain the results

How to use this notebook? 📝¶

This is a shared template and any edits you make here will not be saved.You should make a copy in your own drive. Click the "File" menu (top-left), then "Save a Copy in Drive". You will be working in your copy however you like.
Update the config parameters. You can define the common variables here

Variable	Description
`AICROWD_DATASET_PATH`	Path to the file containing test data. This should be an absolute path.
`AICROWD_RESULTS_DIR`	Path to write the output to.
`AICROWD_ASSETS_DIR`	In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation.
`AICROWD_API_KEY`	In order to submit your code to AIcrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me

Installing packages. Please use the Install packages 🗃 section to install the packages

Setup AIcrowd Utilities 🛠¶

We use this to bundle the files for submission and create a submission on AIcrowd. Do not edit this block.

In [ ]:

!pip install -U aicrowd-cli > /dev/null

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.26.0 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.

AIcrowd Runtime Configuration 🧷¶

Define configuration parameters.

In [ ]:

import os

AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", os.getcwd()+"/407e4f07-d470-427d-8758-4f888ae03fe6_data.zip")
AICROWD_RESULTS_DIR = os.getenv("OUTPUTS_DIR", "results")
API_KEY = "" # Get your key from https://www.aicrowd.com/participants/me (ctrl + click the link)

In [ ]:

!aicrowd login --api-key $API_KEY
!aicrowd dataset download -c rl-assignment-2-taxi

API Key valid
Saved API Key successfully!
407e4f07-d470-427d-8758-4f888ae03fe6_data.zip: 100% 16.8k/16.8k [00:00<00:00, 171kB/s]

In [ ]:

DATASET_DIR = 'data/'
!unzip $AICROWD_DATASET_PATH

Install packages 🗃¶

Please add all package installations in this section

In [ ]:

Import packages 💻¶

In [ ]:

import numpy as np
import matplotlib.pyplot as plt 
import os
# ADD ANY IMPORTS YOU WANT HERE

Prediction Phase¶

Taxi Environment¶

Read the environment to understand the functions, but do not edit anything

In [ ]:

import numpy as np

In [ ]:

class TaxiEnv_HW2:
    
    def __init__(self, states, actions, probabilities, rewards, initial_policy):        
        probabilities, rewards = self._build_prob_mapping(states, actions, probabilities,rewards)
        self.possible_states = states
        self._possible_actions = {st: actions for st in states}
        self._ride_probabilities = {st: pr for st, pr in zip(states, probabilities)}
        self._ride_rewards = {st: rw for st, rw in zip(states, rewards)}
        self.initial_policy = initial_policy
        self._verify()

    def _build_prob_mapping(self,states, actions, probabilities,rewards):
        n_cities = len(states)
        n_actions = len(actions)

        probs = np.zeros((n_cities, n_actions, n_cities))
        
        rewards[0] = [0,0,0,0,0,0]    
        rews = np.zeros((n_cities, n_actions, n_cities))

        for src in range(n_cities):
          for action in ('1', '2'):
              for c, prob in probabilities[action].items():
                  dst = (src+c) % n_cities
                  probs[src][actions.index(action)][dst] = prob
                  rews[src][actions.index(action)][dst] = rewards[c][src]
          action = '3'
          action = actions.index(action)
          probs[src][action][0] = 1
        return probs, rews

    def _check_state(self, state):
        assert state in self.possible_states, "State %s is not a valid state" % state

    def _verify(self):
        """ 
        Verify that data conditions are met:
        Number of actions matches shape of next state and actions
        Every probability distribution adds up to 1 
        """
        ns = len(self.possible_states)
        for state in self.possible_states:
            ac = self._possible_actions[state]
            na = len(ac)
            rp = self._ride_probabilities[state]
            assert np.all(rp.shape == (na, ns)), "Probabilities shape mismatch"
        
            rr = self._ride_rewards[state]
            assert np.all(rr.shape == (na, ns)), "Rewards shape mismatch"

            assert np.allclose(rp.sum(axis=1), 1), "Probabilities don't add up to 1"

    def possible_actions(self, state):
        """ Return all possible actions from a given state """
        self._check_state(state)
        return self._possible_actions[state]

    def ride_probabilities(self, state, action):
        """ 
        Returns all possible ride probabilities from a state for a given action
        For every action a list with the returned with values in the same order as self.possible_states
        """
        actions = self.possible_actions(state)
        ac_idx = actions.index(action)
        return self._ride_probabilities[state][ac_idx]

    def ride_rewards(self, state, action):
        actions = self.possible_actions(state)
        ac_idx = actions.index(action)
        return self._ride_rewards[state][ac_idx]

Example of Environment usage¶

In [ ]:

import numpy as np 

def check_taxienv():
    # These are the values as used in the assignment document, but they may be changed during submission, so do not hardcode anything

    states = [0, 1, 2, 3, 4, 5]

    actions = ['1','2','3']

    probs = {}
    probs['1'] = {-1: 1/2, 0: 1/4, 1: 1/4}
    probs['2'] = {-1: 1/16, 0: 3/4, 1: 3/16}

    rewards = {}
    rewards[-1] = [8,7,3,2,1,2]
    rewards[1]  = [8,8,5,1,3,9]

    initial_policy = {0:'1', 1:'1', 2:'1', 3:'1', 4:'1', 5:'1'}

    ##################################


    env = TaxiEnv_HW2(states, actions, probs, rewards, initial_policy)
    print("All possible states", env.possible_states)
    print("All possible actions from state B", env.possible_actions(1))
    print("Ride probabilities from state A with action 2", env.ride_probabilities(2, '2'))
    print("Ride rewards from state C with action 3", env.ride_rewards(4, '1'))

    base_kwargs = {"states": states, "actions": actions, 
                "probabilities": probs, "rewards": rewards,
                "initial_policy": initial_policy}
    return base_kwargs

base_kwargs = check_taxienv()

All possible states [0, 1, 2, 3, 4, 5]
All possible actions from state B ['1', '2', '3']
Ride probabilities from state A with action 2 [0.     0.0625 0.75   0.1875 0.     0.    ]
Ride rewards from state C with action 3 [0. 0. 0. 1. 0. 3.]

Task - Policy Iteration¶

Run policy iteration on the environment and generate the policy and expected reward

In [ ]:

# 1.1 Policy Iteration
def policy_iteration(taxienv, gamma):
    # A list of all the states
    states = taxienv.possible_states
    # Initial values
    values = {s: 0 for s in states}

    # This is a dictionary of states to policies -> e.g {'A': '1', 'B': '2', 'C': '1'}
    policy = taxienv.initial_policy.copy()

    ## Begin code here

    # Hints - 
    # Do not hardcode anything
    # Only the final result is required for the results
    # Put any extra data in "extra_info" dictonary for any plots etc
    # Use the helper functions taxienv.ride_rewards, taxienv.ride_probabilities,  taxienv.possible_actions
    # For terminating condition use the condition exactly mentioned in the pdf

    

    # Put your extra information needed for plots etc in this dictionary
    extra_info = {}

    ## Do not edit below this line

    # Final results
    return {"Expected Reward": values, "Policy": policy}, extra_info

Policy Iteration with different values of gamma

In [ ]:

# 1.2 Policy Iteration with different values of gamma
def run_policy_iteration(env):
    gamma_values = np.arange(5, 100, 5)/100
    results, extra_info = {}, {}
    for gamma in gamma_values:
        results[gamma], extra_info[gamma] = policy_iteration(env, gamma)
    return results, extra_info

In [ ]:

# Do not edit this cell
def get_results(kwargs):

    taxienv = TaxiEnv_HW2(**kwargs)

    policy_iteration_results = run_policy_iteration(taxienv)[0]

    final_results = {}
    final_results["policy_iteration"] = policy_iteration_results

    return final_results

In [ ]:

get_results(base_kwargs)

Out[ ]:

{'policy_iteration': {0.05: {'Expected Reward': {0: 0,
    1: 0,
    2: 0,
    3: 0,
    4: 0,
    5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}},
  0.1: {'Expected Reward': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}},
  0.15: {'Expected Reward': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}},
  0.2: {'Expected Reward': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}},
  0.25: {'Expected Reward': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}},
  0.3: {'Expected Reward': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}},
  0.35: {'Expected Reward': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}},
  0.4: {'Expected Reward': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}},
  0.45: {'Expected Reward': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}},
  0.5: {'Expected Reward': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}},
  0.55: {'Expected Reward': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}},
  0.6: {'Expected Reward': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}},
  0.65: {'Expected Reward': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}},
  0.7: {'Expected Reward': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}},
  0.75: {'Expected Reward': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}},
  0.8: {'Expected Reward': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}},
  0.85: {'Expected Reward': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}},
  0.9: {'Expected Reward': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}},
  0.95: {'Expected Reward': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
   'Policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}}}}

In [ ]:

if not os.path.exists(AICROWD_RESULTS_DIR):
  os.mkdir(AICROWD_RESULTS_DIR)
if not os.path.exists(DATASET_DIR+'/inputs'):
  os.mkdir(DATASET_DIR+'/inputs')

In [ ]:

# Do not edit this cell, generate results with it as is

input_dir = os.path.join(DATASET_DIR, 'inputs')

if not os.path.exists(AICROWD_RESULTS_DIR):
    os.mkdir(AICROWD_RESULTS_DIR)

for params_file in os.listdir(input_dir):
  kwargs = np.load(os.path.join(input_dir, params_file), allow_pickle=True).item()
  print(kwargs)
  results = get_results(kwargs)
  idx = params_file.split('_')[-1][:-4]
  np.save(os.path.join(AICROWD_RESULTS_DIR, 'results_' + idx), results)

{'states': [0, 1, 2, 3, 4, 5], 'actions': ['1', '2', '3'], 'probabilities': {'1': {-1: 0.5, 0: 0.25, 1: 0.25}, '2': {-1: 0.0625, 0: 0.75, 1: 0.1875}}, 'rewards': {-1: [8, 7, 3, 2, 1, 2], 1: [8, 8, 5, 1, 3, 9], 0: [0, 0, 0, 0, 0, 0]}, 'initial_policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1', 5: '1'}}
{'states': [0, 1, 2, 3, 4], 'actions': ['1', '2', '3'], 'probabilities': {'1': {-1: 0.013212886953789417, 0: 0.265387928772242, 1: 0.7213991842739688}, '2': {-1: 0.013212886953789417, 0: 0.265387928772242, 1: 0.7213991842739688}}, 'rewards': {-1: [4, 4, 2, 7, 2], 1: [3, 4, 3, 5, 4], 0: [0, 0, 0, 0, 0, 0]}, 'initial_policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1'}}
{'states': [0, 1, 2, 3], 'actions': ['1', '2', '3'], 'probabilities': {'1': {-1: 0.6652409557748217, 0: 0.24472847105479759, 1: 0.09003057317038043}, '2': {-1: 0.6652409557748217, 0: 0.24472847105479759, 1: 0.09003057317038043}}, 'rewards': {-1: [7, 3, 7, 3], 1: [2, 6, 7, 6], 0: [0, 0, 0, 0, 0, 0]}, 'initial_policy': {0: '1', 1: '1', 2: '1', 3: '1'}}
{'states': [0, 1, 2, 3, 4], 'actions': ['1', '2', '3'], 'probabilities': {'1': {-1: 0.8668133321973345, 0: 0.01587623997646676, 1: 0.1173104278261983}, '2': {-1: 0.8668133321973345, 0: 0.01587623997646676, 1: 0.1173104278261983}}, 'rewards': {-1: [6, 4, 7, 4, 4], 1: [7, 5, 5, 2, 4], 0: [0, 0, 0, 0, 0, 0]}, 'initial_policy': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1'}}

In [ ]:

# Check your score on the given test cases (There are more private test cases not provided)
result_folder = AICROWD_RESULTS_DIR
target_folder = os.path.join(DATASET_DIR, 'targets')

def check_algo_match(results, targets):
    param_matches = []
    for k in results:
        param_results = results[k]
        param_targets = targets[k]
        policy_match = param_results['Policy'] == param_targets['Policy']
        rv = [v for k, v in param_results['Expected Reward'].items()]
        tv = [v for k, v in param_targets['Expected Reward'].items()]
        rewards_match = np.allclose(rv, tv, atol=1e-1)
        equal = rewards_match and policy_match
        param_matches.append(equal)
    return np.mean(param_matches)

def check_score(target_folder, result_folder):
    match = []
    for out_file in os.listdir(result_folder):
        res_file = os.path.join(result_folder, out_file)
        results = np.load(res_file, allow_pickle=True).item()
        idx = out_file.split('_')[-1][:-4]  # Extract the file number
        target_file = os.path.join(target_folder, f"targets_{idx}.npy")
        targets = np.load(target_file, allow_pickle=True).item()
        algo_match = []
        for k in targets:
            algo_results = results[k]
            algo_targets = targets[k]
            algo_match.append(check_algo_match(algo_results, algo_targets))
        match.append(np.mean(algo_match))
    return np.mean(match)

if os.path.exists(target_folder):
    print("Shared data Score (normalized to 1):", check_score(target_folder, result_folder))

Shared data Score (normalized to 1): 0.8157894736842105

Answer the following¶

How is different values of γ affecting the policy iteration from 1.2? Explain your findings

Your Answer:

Give alternate transition probabilities for action 2(if exists) such that optimal policy consists of action 2. Explain your answer

Your Answer:

Submit to AIcrowd 🚀¶

In [ ]:

!DATASET_PATH=$AICROWD_DATASET_PATH \
aicrowd notebook submit \
    -c rl-assignment-2-taxi -a assets

WARNING: No assets directory at assets... Creating one...
WARNING: Assets directory is empty
/usr/local/lib/python3.7/dist-packages/aicrowd/notebook/helpers.py:361: UserWarning: `%aicrowd` magic command can be used to save the notebook inside jupyter notebook/jupyterLab environment and also to get the notebook directly from the frontend without mounting the drive in colab environment. You can use magic command to skip mounting the drive and submit using the code below:
 %load_ext aicrowd.magic
%aicrowd notebook submit -c rl-assignment-2-taxi -a assets
  warnings.warn(description + code)
Mounting Google Drive 💾
Your Google Drive will be mounted to access the colab notebook
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.activity.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fexperimentsandconfigs%20https%3a%2f%2fwww.googleapis.com%2fauth%2fphotos.native&response_type=code

Enter your authorization code:
4/1AX4XfWj8IsfHTG-MHNf-O7kc4iQOVfDUVLSzsxmICXY1EeCp8zS15JLYb9Q
Mounted at /content/drive
Using notebook: Taxi Problem 3 starter notebook .ipynb for submission...
Scrubbing API keys from the notebook...
Collecting notebook...
Validating the submission...
Executing install.ipynb...
[NbConvertApp] Converting notebook /content/submission/install.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python3
[NbConvertApp] Writing 1374 bytes to /content/submission/install.nbconvert.ipynb
Executing predict.ipynb...
[NbConvertApp] Converting notebook /content/submission/predict.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python3
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] ERROR | unhandled iopub msg: colab_request
[NbConvertApp] Writing 28122 bytes to /content/submission/predict.nbconvert.ipynb
submission.zip ━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 23.0/21.4 KB • 2.3 MB/s • 0:00:00
                                         ╭─────────────────────────╮                                          
                                         │ Successfully submitted! │                                          
                                         ╰─────────────────────────╯                                          
                                               Important links                                                
┌──────────────────┬─────────────────────────────────────────────────────────────────────────────────────────┐
│  This submission │ https://www.aicrowd.com/challenges/rl-assignment-2-taxi/submissions/158297              │
│                  │                                                                                         │
│  All submissions │ https://www.aicrowd.com/challenges/rl-assignment-2-taxi/submissions?my_submissions=true │
│                  │                                                                                         │
│      Leaderboard │ https://www.aicrowd.com/challenges/rl-assignment-2-taxi/leaderboards                    │
│                  │                                                                                         │
│ Discussion forum │ https://discourse.aicrowd.com/c/rl-assignment-2-taxi                                    │
│                  │                                                                                         │
│   Challenge page │ https://www.aicrowd.com/challenges/rl-assignment-2-taxi                                 │
└──────────────────┴─────────────────────────────────────────────────────────────────────────────────────────┘

Content

2624

Show Comments

Comments

You must login before you can post a comment.