我正在使用超级马里奥兄弟健身房进行Q学习。我正在尝试使用np.argmax检索最佳操作,该操作应返回1-12之间的值。但是它会返回诸如224440的值。它有时仅返回该值,并且随着程序的执行,它似乎更频繁地执行该操作。
我尝试记录动作的形状以查看是否在其他地方犯了错误,我尝试打印几乎每个值以查看是否设置了不正确的内容,但是我似乎找不到任何东西。 / p>
当前,我正在捕获这些不正确的操作,以使它们不会崩溃并使程序随机化,这显然不是解决方案,而是用于调试目的。
from nes_py.wrappers import JoypadSpace
import gym_super_mario_bros
from gym_super_mario_bros.actions import COMPLEX_MOVEMENT
from collections import defaultdict
#imports
import random
import numpy as np
env = gym_super_mario_bros.make('SuperMarioBros-v0')
env = JoypadSpace(env,COMPLEX_MOVEMENT)
Q = np.zeros((240 * 256 * 3,env.action_space.n)) # state size is based on 3 dimensional values of the screen
# hyper-parameters
epsilon = 0.1
alpha = 0.5 # Learning rate
gamma = 0.5 # Decay
# number of GAMES
episodes = 500000000000
for episode in range(1,episodes):
print("Starting episode: " + str(episode))
state = env.reset()
finished = False
# number of steps
while not finished:
if random.uniform(0,1) < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(Q[state])
## FIX THIS!
if action > 12 or action < 0:
#print("Random: " + str(np.argmax(Q[state,:])))
print(action)
print(Q.shape)
action = env.action_space.sample()
new_state,reward,done,info = env.step(action)
Q[state,action] = Q[state,action] + alpha * (reward + gamma * np.max(Q[new_state,:]) - Q[state,action])
state = new_state
env.render()
if done:
finished = True
env.close()
由于我仍在学习和尝试这一概念,因此我很可能在这里误解了一些概念。任何意见或帮助将不胜感激。