目录第一章绪论························································································11.1研究背景及意义·····································································11.2研究现状················································································21.3论文的主要工作及组织结构····················································3第二章背景知识介绍············································································52.1马尔科夫决策过程····································································52.2基于模型的动态规划方法··························································62.3基于蒙特卡罗方法的理论·························································62.4基于时间差分的强化学习方法··················································72.4.1SARSA学习··········································································82.4.2Q-learning方法·······································································9第三章Q-learning及其改进算法研究····················································103.1Q-learning算法········································································103.1.1Q-learning简介······································································103.1.2Q-learning算法收敛性讨论···················································123.1.3Q-learning算法分析······························································143.1.4单一估计器造成高估的证明·················································163.2DoubleQ-learning方法···························································163.2.1DoubleQ-learning的提出·····················································163.2.2算法分析················································································173.3WeightedQ-learning·································································193.3.1简介·······························································...