2.793

                    2018影響因子

                    (CJCR)

                    • 中文核心
                    • EI
                    • 中國科技核心
                    • Scopus
                    • CSCD
                    • 英國科學文摘

                    留言板

                    尊敬的讀者、作者、審稿人, 關于本刊的投稿、審稿、編輯和出版的任何問題, 您可以本頁添加留言。我們將盡快給您答復。謝謝您的支持!

                    姓名
                    郵箱
                    手機號碼
                    標題
                    留言內容
                    驗證碼

                    一種針對德州撲克AI的對手建模與策略集成框架

                    張蒙 李凱 吳哲 臧一凡 徐航 興軍亮

                    張蒙, 李凱, 吳哲, 臧一凡, 徐航, 興軍亮. 一種針對德州撲克AI的對手建模與策略集成框架. 自動化學報, 2021, x(x): 1?15 doi: 10.16383/j.aas.c210127
                    引用本文: 張蒙, 李凱, 吳哲, 臧一凡, 徐航, 興軍亮. 一種針對德州撲克AI的對手建模與策略集成框架. 自動化學報, 2021, x(x): 1?15 doi: 10.16383/j.aas.c210127
                    Zhang Meng, Li Kai, Wu Zhe, Zang Yi-Fan, Xu Hang, Xing Jun-Liang. An Opponent modeling and strategy integration framework for texas hold'em. Acta Automatica Sinica, 2021, x(x): 1?15 doi: 10.16383/j.aas.c210127
                    Citation: Zhang Meng, Li Kai, Wu Zhe, Zang Yi-Fan, Xu Hang, Xing Jun-Liang. An Opponent modeling and strategy integration framework for texas hold'em. Acta Automatica Sinica, 2021, x(x): 1?15 doi: 10.16383/j.aas.c210127

                    一種針對德州撲克AI的對手建模與策略集成框架

                    doi: 10.16383/j.aas.c210127
                    基金項目: 中國國家自然科學基金(62076238, 61902402), 中國國家重點研發計劃(2020AAA0103401), 中國科學院戰略性先導研究項目(XDA27000000), CCF-騰訊犀牛鳥基金(RAGR20200104)資助
                    詳細信息
                      作者簡介:

                      張蒙:中國科學院自動化研究所模式識別與智能系統專業碩士研究生, 2018年獲吉林大學工學學士學位. 主要研究方向為計算機博弈與強化學習. E-mail: zhangmeng2018@ia.ac.cn

                      李凱:中國科學院自動化研究所副研究員, 2018年獲中國科學院自動化研究所模式識別與智能系統博士學位. 主要研究方向為大規模不完美信息博弈和多智能體深度強化學習. E-mail: kai.li@ia.ac.cn

                      吳哲:中國科學院自動化研究所模式識別與智能系統專業碩士研究生, 2019年獲山東大學工學學士學位. 主要研究方向為計算機博弈與強化學習. E-mail: wuzhe2019@ia.ac.cn

                      臧一凡:中國科學院自動化研究所模式識別與智能系統專業博士研究生, 2019年獲吉林大學理學學士學位. 主要研究方向為多智能體系統與強化學習. E-mail: zangyifan2019@ia.ac.cn

                      徐航:中國科學院自動化研究所模式識別與智能系統專業碩士研究生, 2020年獲武漢大學工學學士學位. 主要研究方向為計算機博弈與強化學習. E-mail: xuhang2020@ia.ac.cn

                      興軍亮:中國科學院自動化研究所研究員, 中國科學院大學崗位教授, 2012年獲清華大學計算機科學與技術系博士學位. 主要研究方向為計算機博弈. 本文通信作者. E-mail: jlxing@nlpr.ia.ac.cn

                    An Opponent Modeling and Strategy Integration Framework for Texas Hold'em

                    Funds: Supported by National Natural Science Foundation of China (62076238 and 61902402), National Key Research and Development Program of China (2020AAA0103401), Strategic Priority Research Program of Chinese Academy of Sciences (XDA27000000), and CCF-Tencent Open Research Fund (RAGR20200104)
                    More Information
                      Author Bio:

                      ZHANG Meng Master candidate in pattern recognition and intelligent systems from Institute of Automation, Chinese Academy of Sciences. He received his bachelor's degree in engineering from Jilin University in 2018. His research interests include computer game and reinforcement learning

                      LI Kai Associate Professor at Institute of Automation, Chinese Academy of Sciences. He received his Ph.D. degree in pattern recognition and intelligent system from Institute of Automation, Chinese Academy of Sciences in 2018. His main research interest are large-scale imperfect-information games and deep multi-agent reinforcement learning

                      WU Zhe Master candidate in pattern recognition and intelligent systems from Institute of Automation, Chinese Academy of Sciences. He received his bachelor's degree in engineering from Shandong University in 2019. His research interests include computer game and reinforcement learning

                      ZANG Yi-Fan Ph.D. candidate in pattern recognition and intelligent systems from Institute of Automation, Chinese Academy of Sciences. He received his bachelor's degree in science from Jilin University in 2019. His research interests include multi-agent system and reinforcement learning

                      XU Hang Master candidate in pattern recognition and intelligent systems from Institute of Automation, Chinese Academy of Sciences. He received his bachelor's degree in engineering from Wuhan University in 2020. His research interests include computer game and reinforcement learning

                      XING Jun-Liang Professor at Institute of Automation, Chinese Academy of Sciences. Teaching professor at University of Chinese Academy of Sciences. He received his Ph.D. degree in Department of computer science and technology from Tsinghua University in 2012. His main research interest is computer game. Corresponding author of this paper

                    • 摘要: 以德州撲克游戲為代表的大規模不完美信息博弈是現實世界中常見的一種博弈類型. 現有以求解納什均衡策略為目標的主流德州撲克求解算法存在依賴博弈樹模型、算力消耗大、策略過于保守等問題, 導致智能體在面對不同對手時無法最大化自身收益. 為解決上述問題, 本文提出一種輕量高效且能快速適應對手策略變化進而剝削對手的不完美信息博弈求解框架. 本框架分為智能體離線訓練和在線博弈兩階段. 第一階段基于演化學習思想訓練智能體, 得到能夠剝削不同博弈風格對手的策略神經網絡. 在博弈階段中, 智能體在線建模并適應未知風格對手, 利用種群策略集成的方法最大化剝削對手. 在兩人無限注德州撲克環境中的實驗結果表明, 本框架在面對動態對手策略時, 相比已有方法能夠大幅提升博弈性能.
                    • 圖  1  德州撲克游戲牌型大小規則

                      Fig.  1  Texas Hold'em card rules

                      圖  2  不完美信息博弈求解框架整體流程

                      Fig.  2  Overall process of the imperfect information game solving framework

                      圖  3  對手池策略空間與博弈風格類型定義

                      Fig.  3  The opponent strategy space and game styles definition

                      圖  4  離線訓練階段算法流程及智能體結構

                      Fig.  4  The offline training process and the agent structure

                      圖  5  智能體基因交叉變異示意圖

                      Fig.  5  Crossover and Mutation

                      圖  6  對手博弈風格度量模塊

                      Fig.  6  Measurement module of opponent's style

                      圖  7  種群策略集成模塊

                      Fig.  7  Integration module of population strategies

                      圖  8  策略輸出網絡隱含層神經元數量對種群平均適應度的影響

                      Fig.  8  The influence of the hidden neurons in policy output network on population fitness

                      圖  9  種群生存率對種群平均適應度的影響

                      Fig.  9  The influence of population survival rates on population average fitness

                      圖  10  不同訓練策略對種群平均適應度的影響

                      Fig.  10  The influence of different training strategies on population average fitness

                      圖  11  對打測評過程中博弈性能變化

                      Fig.  11  The change of game performance in the evaluation process

                      表  1  對手智能體博弈風格及定義

                      Table  1  The opponents’ play styles and definitions

                      名稱類型手牌松緊度策略激進度
                      ${O_{\rm{1}}}$松-弱70 %極度保守
                      ${O_2}$松-兇70 %極度激進
                      ${O_3}$緊-弱10 %極度保守
                      ${O_4}$緊-兇10 %極度激進
                      ${O_5}$松-弱50 %相對保守
                      ${O_6}$松-兇50 %相對激進
                      ${O_7}$緊-弱30 %相對保守
                      ${O_8}$緊-兇30 %相對激進
                      下載: 導出CSV

                      表  2  策略網絡結構與訓練參數

                      Table  2  Policy network structure and the training hyper-parameters

                      參數含義參數值
                      對手特征網絡LSTM區塊(Block)數5
                      對手特征網絡LSTM時間序列步數5
                      對手特征網絡輸出維度200
                      游戲特征網絡LSTM區塊(Block)數5
                      游戲特征網絡LSTM時間序列步數5
                      游戲特征網絡輸出維度300
                      策略輸出網絡輸入層神經元數量500
                      策略輸出網絡隱含層數量2
                      策略輸出網絡隱含層神經元數量300
                      策略輸出網絡輸出層神經元數量10
                      種群演化代數300
                      種群個體規模100
                      種群生存率0.25
                      基因變異率(初始/最終)0.25/0.05
                      基因變異強度(初始/最終)0.5/0.1
                      單個對手對打訓練牌局數量10000
                      對手特征庫收集游戲對打局數100000
                      下載: 導出CSV

                      表  3  消融實驗結果(mbb/h)

                      Table  3  Ablation study results (mbb/h)

                      ${O_{\rm{1}}}$${O_2}$${O_3}$${O_4}$${O_5}$${O_6}$${O_7}$${O_8}$${O_{random}}$
                      Slumbot702.53127614942.5814983652.732623.14484.292449.083387.13
                      ${A_{tar}}$999.92292321494.92274741391.04127461371.1034546
                      ${A_{base}}$1000.00226111205.05203801109.849892.43793.42145685105.38
                      ${A_{ave}}$999.9178.4634.06?5537.19927.8492.36?631.55?4461.82?1068.44
                      ${A_{\rm{int} }}$999.92299641305.04273141316.21128741380.88183302738.98
                      ${A^*}$1000.00248881310.34275261286.08112531020.38165146359.36
                      下載: 導出CSV

                      表  4  博弈性能對比結果(mbb/h)

                      Table  4  Performance comparison results (mbb/h)

                      ${A^*}$ASHESlumbotDeepstackNFSP知識AI${O_{random}}$
                      ${A^*}$675.68?48.49?896.7632255229.646359.36
                      ASHE?675.68?153.35?1552.6411904?13.003177.68
                      Slumbot48.49153.35?103.448623.1852.433387.13
                      DeepStack896.761552.64103.444084.27139.411791.27
                      NFSP?32255?11904?8623.18?4084.27?3257.75?18819
                      知識AI?229.6413.00?52.43?139.413257.75?91.92
                      ${O_{random}}$?6859.36?3177.68?3387.13?1791.271881991.92
                      下載: 導出CSV

                      表  5  算法輕量性對比

                      Table  5  Light-weight comparison

                      訓練階段資源需求測評階段資源需求
                      存儲資源(GB)計算資源(h)存儲資源(GB)計算資源(h)響應時間(s)
                      ${A^*}$~30~2×103CPU<0.5<0.1CPU<0.1
                      ASHE~30~103CPU~30<0.1CPU<0.1
                      Slumbot>500>105CPU>500>10CPU~1
                      DeepStack>500>106CPU
                      >103GPU
                      >10~103CPU
                      ~103GPU
                      ~30
                      NFSP>50~104CPU
                      ~102GPU
                      ~1<1CPU
                      <1GPU
                      <1
                      人類玩家~15
                      下載: 導出CSV
                      360彩票
                    • [1] Pomerol J C. Artificial intelligence and human decision making. European Journal of Operational Research, 1997, 99(1): 3?25 doi: 10.1016/S0377-2217(96)00378-5
                      [2] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436?444 doi: 10.1038/nature14539
                      [3] 羅浩, 姜偉, 范星, 張思朋. 基于深度學習的行人重識別研究進展. 自動化學報, 2019, 45(11): 2032?2049

                      LUO Hao, JIANG Wei, FAN Xing, ZHANG Si-Peng. A survey on deep learning based person re-identification. Acta Automatica Sinica, 2019, 45(11): 2032?2049
                      [4] Xiong W, Droppo J, Huang X, Seide F, Seltzer M, Stolcke A, et al. Achieving human parity in conversational speech recognition. arXiv preprint, arXiv: 1610.05256, 2016
                      [5] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA: IEEE Press, 2016. 770?778
                      [6] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, et al. Attention is all you need. In: Proceedings of the 31st Annual Conference on Neural Information Processing Systems. Long Beach, USA: MIT Press, 2017. 5998?6008
                      [7] 奚雪峰, 周國棟. 面向自然語言處理的深度學習研究. 自動化學報, 2016, 42(10): 1445?1465

                      XI Xue-Feng, ZHOU Guo-Dong. A survey on deep learning for natural language processing. Acta Automatica Sinica, 2016, 42(10): 1445?1465
                      [8] Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529?533 doi: 10.1038/nature14236
                      [9] 趙冬斌, 邵坤, 朱圓恒, 李棟, 陳亞冉, 王海濤, 等. 深度強化學習綜述: 兼論計算機圍棋的發展. 控制理論與應用, 2016, 33(6): 701?717 doi: 10.7641/CTA.2016.60173

                      Zhao Dong-Bin, Shao Kun, Zhu Yuan-Heng, Li Dong, Chen Ya-Ran, Wang Hai-Tao, et al. Review of deep reinforcement learning and discussions on the development of computer go. Control Theory and Applications, 2016, 33(6): 701?717 doi: 10.7641/CTA.2016.60173
                      [10] 梁星星, 馮旸赫, 馬揚, 程光權, 黃金才, 王琦, 等. 多Agent深度強化學習綜述. 自動化學報, 2020, 46(12): 2537?2557

                      Liang Xing-Xing, Feng Yang-He, Ma Yang, Cheng Guang-Quan, Huang Jin-Cai, Wang Qi, et al. Deep multi-agent reinforcement learning: a survey. Acta Automatica Sinica, 2020, 46(12): 2537?2557
                      [11] Silver D, Huang A, Maddison C J, Guez A, Sifre L, Van Den Driessche G, et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016, 51(7587): 484?489
                      [12] Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, et al. Mastering the game of go without human knowledge. Nature, 2017, 550(7676): 354?359 doi: 10.1038/nature24270
                      [13] Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, et al. A general reinforcement learning algorithm that Masters chess, shogi, and go through self-play. Science, 2018, 362(6419): 1140?1144 doi: 10.1126/science.aar6404
                      [14] Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 2020, 588(7839): 604?609 doi: 10.1038/s41586-020-03051-4
                      [15] 周志華. AlphaGo專題介紹. 自動化學報, 2016, 42(5): 670?670

                      Zhou Zhi-Hua. AlphaGo special session: an introduction. Acta Automatica Sinica, 2016, 42(5): 670?670
                      [16] Rhalibi A, Wong K W. Artificial intelligence for computer games: an Introduction. International Journal of Computer Games Technology, 2009, 12(3): 351?369
                      [17] 沈宇, 韓金朋, 李靈犀, 王飛躍. 游戲智能中的AI——從多角色博弈到平行博弈. 智能科學與技術學報, 2020, 2(3): 205?213

                      SHEN Yu, HAN Jin-Peng, LI Ling-Xi, Wang Fei-Yue. AI in game intelligence—from multi-role game to parallel game. Chinese Journal of Intelligent Science and Technology, 2020, 2(3): 205?213
                      [18] Myerson R B. Game Theory. London: Harvard university press, 2013. 74?82
                      [19] Brown N, Sandholm T. Superhuman AI for heads-up no-limit poker: libratus beats top professionals. Science, 2018, 359(6374): 418?424 doi: 10.1126/science.aao1733
                      [20] Brown N, Sandholm T. Superhuman AI for multiplayer poker. Science, 2019, 365(6456): 885?890 doi: 10.1126/science.aay2400
                      [21] Li J, Koyamada S, Ye Q, Liu G, Wang C, Yang R, et al. Suphx: mastering mahjong with deep reinforcement learning. arXiv preprint, arXiv: 2003.13590, 2020
                      [22] Jiang Q, Li K, Du B, Chen H, Fang H. DeltaDou: expert-level doudizhu AI through self-play. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China: Morgan Kaufmann, 2019. 1265?1271
                      [23] Zhou Z H, Yu Y, Qian C. Evolutionary Learning: Advances In Theories and Algorithms. Singapore: Springer-Verlag, 2019. 4?6
                      [24] Darse B, Aaron D, Jonathan S, Szafron D. The challenge of poker. Artificial Intelligence, 2002, 134(1-2): 201?240 doi: 10.1016/S0004-3702(01)00130-8
                      [25] Jackson E G. Slumbot NL: Solving large games with counterfactual regret minimization using sampling and distributed processing. In: Workshops at the 27th AAAI Conference on Artificial Intelligence, Bellevue, Washington, USA: AAAI, 2013. 35?38
                      [26] Zinkevich M, Johanson M, Bowling M, Piccione C. Regret minimization in games with incomplete information. In: Proceedings of the 21st Annual Conference on neural information processing systems. British Columbia, Canada: MIT Press, 2007. 1729?1736
                      [27] Waugh K, Schnizlein D, Bowling M H, Szafron D. Abstraction pathologies in extensive games. In: Proceedings of the 8th International Conference on Autonomous Agents and Multiagent Systems, Budapest, Hungary: Springer-Verlag, 2009. 781?788
                      [28] Lanctot M, Waugh K, Zinkevich M, Bowling M H. Monte carlo sampling for regret minimization in extensive games. In: Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, Whistler, BC, Canada: MIT Press, 2009. 1078?1086
                      [29] Morav?ík M, Schmid M, Burch N, Lisy V, Morrill D, Bard N, et al. Deepstack: expert-level artificial intelligence in heads-up no-limit poker. Science, 2017, 356(6337): 508?513 doi: 10.1126/science.aam6960
                      [30] Bowling M, Burch N, Johanson M, Tammelin O. Heads-up limit hold’em poker is solved. Science, 2015, 347(6218): 145?149 doi: 10.1126/science.1259433
                      [31] Heinrich J, Silver D. Deep reinforcement learning from self-play in imperfect-information games. arXiv preprint, arXiv: 1603.01121, 2016
                      [32] 孫長銀, 穆朝絮. 多智能體深度強化學習的若干關鍵科學問題. 自動化學報, 2020, 46(7): 1301?1312

                      Sun Chang-Yin, Mu Chao-Xu. Important scientific problems of multi-agent deep reinforcement learning. Acta Automatica Sinica, 2020, 46(7): 1301?1312
                      [33] 郭瀟逍, 李程, 梅俏竹. 深度學習在游戲中的應用. 自動化學報, 2016, 42(5): 676?684

                      GUO Xiao-Xiao, LI Cheng, MEI Qiao-Zhu. Deep Learning Applied to Games. Acta Automatica Sinica, 2016, 42(5): 676?684
                      [34] Li X, Miikkulainen R. Opponent modeling and exploitation in poker using evolved recurrent neural networks. In: Proceedings of the 27th Genetic and Evolutionary Computation Conference, Kyoto, Japan: ACM Press, 2018. 189?196
                    • 加載中
                    計量
                    • 文章訪問數:  121
                    • HTML全文瀏覽量:  84
                    • 被引次數: 0
                    出版歷程
                    • 收稿日期:  2021-02-06
                    • 錄用日期:  2021-05-31
                    • 網絡出版日期:  2021-07-14

                    目錄

                      /

                      返回文章
                      返回