Implementation
Implemented Q-learning from scratch to train an agent that navigates the Taxi-v3 grid environment. The core challenge was balancing exploration vs exploitation — the agent starts completely random and gradually converges to an optimal policy over 2,000 episodes using an epsilon-greedy strategy. Hyperparameters like learning rate, discount factor, and decay rate are all configurable via CLI arguments.
def update_q_table(state, action, reward, next_state):
old_value = q_table[state, action]
next_max = np.max(q_table[next_state])
q_table[state, action] = (
(1 - ALPHA) * old_value
+ ALPHA * (reward + GAMMA * next_max)
)
