2.793

                    2018影響因子

                    (CJCR)

                    • 中文核心
                    • EI
                    • 中國科技核心
                    • Scopus
                    • CSCD
                    • 英國科學文摘

                    留言板

                    尊敬的讀者、作者、審稿人, 關于本刊的投稿、審稿、編輯和出版的任何問題, 您可以本頁添加留言。我們將盡快給您答復。謝謝您的支持!

                    姓名
                    郵箱
                    手機號碼
                    標題
                    留言內容
                    驗證碼

                    唇讀研究進展與展望

                    陳小鼎 盛常沖 匡綱要 劉麗

                    陳小鼎, 盛常沖, 匡綱要, 劉麗. 唇讀研究進展與展望. 自動化學報, 2020, 46(11): 2275?2301 doi: 10.16383/j.aas.c190531
                    引用本文: 陳小鼎, 盛常沖, 匡綱要, 劉麗. 唇讀研究進展與展望. 自動化學報, 2020, 46(11): 2275?2301 doi: 10.16383/j.aas.c190531
                    Chen Xiao-Ding, Sheng Chang-Chong, Kuang Gang-Yao, Liu Li. The state of the art and prospects of lip reading. Acta Automatica Sinica, 2020, 46(11): 2275?2301 doi: 10.16383/j.aas.c190531
                    Citation: Chen Xiao-Ding, Sheng Chang-Chong, Kuang Gang-Yao, Liu Li. The state of the art and prospects of lip reading. Acta Automatica Sinica, 2020, 46(11): 2275?2301 doi: 10.16383/j.aas.c190531

                    唇讀研究進展與展望

                    doi: 10.16383/j.aas.c190531
                    基金項目: 國家自然科學基金(61872379)資助
                    詳細信息
                      作者簡介:

                      陳小鼎:國防科技大學系統工程學院碩士研究生. 主要研究方向為計算機視覺與模式識別. E-mail: chenxiaoding14@nudt.edu.cn

                      盛常沖:國防科技大學電子科學學院博士研究生. 主要研究方向為計算機視覺, 模式識別. E-mail: sheng_cc@nudt.edu.cn

                      匡綱要:國防科技大學電子科學學院教授. 主要研究方向為遙感圖像處理, 目標識別. E-mail: kuanggangyao@nudt.edu.cn

                      劉麗:國防科技大學系統工程學院副教授. 主要研究方向為圖像理解, 計算機視覺, 模式識別. 本文通信作者. E-mail: liuli_nudt@nudt.edu.cn

                    The State of the Art and Prospects of Lip Reading

                    Funds: Supported by National Natural Science Foundation of China (61872379)
                    • 摘要: 唇讀, 也稱視覺語言識別, 旨在通過說話者嘴唇運動的視覺信息, 解碼出其所說文本內容. 唇讀是計算機視覺和模式識別領域的一個重要問題, 在公共安防、醫療、國防軍事和影視娛樂等領域有著廣泛的應用價值. 近年來, 深度學習技術極大地推動了唇讀研究進展. 本文首先闡述了唇讀研究的內容和意義, 并深入剖析了唇讀研究面臨的難點與挑戰; 然后介紹了目前唇讀研究的現狀與發展水平, 對近期主流唇讀方法進行了梳理、歸類和評述, 包括傳統方法和近期的基于深度學習的方法; 最后, 探討唇讀研究潛在的問題和可能的研究方向. 以期引起大家對唇讀問題的關注與興趣, 并推動與此相關問題的研究進展.
                    • 圖  1  唇讀示意圖

                      Fig.  1  Illustration of the lip reading task

                      圖  2  唇讀難點示例. (a)第一行為單詞place的實例, 第二行為單詞please的實例, 唇形變化難以區分, 圖片來自GRID數據集; (b)上下兩行分別為單詞wind在不同上下文環境下的不同讀法/wind/與/waind/實例, 唇形變化差異較大; (c)上下兩行分別為兩位說話人說同一個單詞after的實例, 唇形變化存在差異, 圖片來自LRS3-TED數據集; (d)說話人在說話過程中頭部姿態實時變化實例. 上述對比實例均采用相同的視頻時長和采樣間隔.

                      Fig.  2  Challenging examples of lip reading. (a) The upper line is an instance of the word place, the lower line is an instance of the word please; (b) The upper and lower lines are respectively different pronunciation of word wind in different contexts; (c) The upper and lower lines respectively tell the same word after, with big difference in lip motion; (d) An example of a real-time change in the head posture of the speaker during the speech. The above comparison examples all use the same video duration and sampling interval.

                      圖  3  唇讀方法一般流程

                      Fig.  3  The general process of lip reading

                      圖  4  唇讀研究過程中代表性方法. 傳統特征提取方法: 主動形狀模型ASM[51], 主動表觀模型AAM[39], HiLDA[38], LBP-TOP[52], 局部判別圖模型[40], 圖嵌入方法[53], 隨機森林流形對齊RFMA[41], 隱變量方法[54]; 深度學習方法: DBN/CNN+HMM混合模型[42-48], SyncNet[55], LipNet[49], WLAS[10], Transformer[50], LCANet[56], V2P[15].

                      Fig.  4  Representative methods in the process of lip reading research. Traditional feature extraction methods:ASM[51], AAM[39], HiLDA[38], LBP-TOP[52], LDG[40], Graph Embedding[53], RFMA[41], Hidden variable method[54]; Deep learning based methods: DBN/CNN+HMM hybrid model[42-48], SyncNet[55], LipNet[49], WLAS[10], Transformer[50], LCANet[56], V2P[15].

                      圖  5  線性變換特征提取方法一般流程

                      Fig.  5  The workflow of linear transformation feature extraction method

                      圖  6  連續幀曲線映射

                      Fig.  6  Continuous frame curve mapping

                      圖  7  ${\rm LBP}_{8,1}$算子

                      Fig.  7  ${\rm LBP}_{8,1}$ operator

                      圖  8  分塊LBP-TOP特征提取

                      Fig.  8  Block LBP-TOP feature extraction

                      圖  9  語音產生的發音特征

                      Fig.  9  Articulatory features

                      圖  10  唇部輪廓ASM模型

                      Fig.  10  ASM model of lip profile

                      圖  11  典型CNN結構示例圖

                      Fig.  11  A typical CNN structure example

                      圖  12  RNN及LSTM、GRU結構示例圖

                      Fig.  12  The structure of RNN, LSTM and GRU

                      圖  13  CNN-RNN基本框架

                      Fig.  13  The network structure of CN-RNN

                      圖  14  LipNet構架

                      Fig.  14  The network architecture of LipNet

                      圖  15  WAS構架

                      Fig.  15  The network architecture of WAS

                      圖  16  三種唇讀網絡模型

                      Fig.  16  Three lip reading network models

                      圖  17  不同類型數據集變化趨勢

                      Fig.  17  The trends of different types of datasets

                      圖  18  各類數據集示例

                      Fig.  18  Some examples of different datasets

                      表  1  傳統時空特征提取算法優缺點總結

                      Table  1  A summary of advantages and disadvantages of traditional spatiotemporal feature extraction methods

                      時空特征提取方法代表性方法優勢不足
                      基于表觀的全局圖像線性變換[38,57,60-63],
                      圖嵌入與流形[40-41, 53-54,65],
                      LBP-TOP[52,66], HOG[67], 光流[29, 68]···
                      1) 特征提取速度快;
                      2) 無需復雜的人工建模.
                      1) 對唇部區域提取精度要求高;
                      2) 對環境變化、姿態變化、噪聲敏感;
                      3) 不同講話者之間泛化性能較差.
                      基于形狀的輪廓描述[69-72],
                      AFs[73], 形狀模型[74-75]···
                      1) 具有良好的可解釋性;
                      2) 不同講話者之間泛化性能較好;
                      3) 能有效去除冗余信息.
                      1) 會造成部分有用信息丟失;
                      2) 需要大量的人工標注;
                      3) 對于姿態變化非常敏感.
                      形狀表觀融合的形狀+表觀特征串聯[76-77],
                      形狀表觀模型[39]···
                      1) 特征表達能力較強;
                      2) 不同講話者之間泛化性能較好.
                      1) 模型復雜,運算量大;
                      2) 需要大量的人工標注.
                      下載: 導出CSV

                      表  3  單詞、短語和語句識別數據集, 其中(s)代表不同語句的數量. 下載地址為: MIRACL-VC[171], LRW[172], LRW-1000[173], GRID[174], OuluVS[175], VIDTIMIT[176], LILiR[177], MOBIO[178], TCD-TIMIT[179], LRS[180], VLRF[181]

                      Table  3  Word, phrase and sentence lip reading datasets and their download link: MIRACL-VC[171], LRW[172], LRW-1000[173], GRID[174], OuluVS[175], VIDTIMIT[176], LILiR[177], MOBIO[178], TCD-TIMIT[179], LRS[180], VLRF[181]

                      數據集語種識別任務詞匯量話語數目說話人數目姿態分辨率谷歌引用發布年份
                      IBMViaVoice英語語句10 50024 3252900704 × 480, 30 fps2992000
                      VIDTIMIT英語語句346 (s)430430512 × 384, 25 fps512002
                      AVICAR英語語句1 31710 000100?15$\sim$15720 × 480, 30 fps1702004
                      AV-TIMIT英語語句450 (s)4 6602330720 × 480, 30 fps1272004
                      GRID英語短語5134 000340720 × 576, 25 fps7002006
                      IV2法語語句15 (s)4 5003000,90780 × 576, 25 fps192008
                      UWB-07-ICAV捷克語語句7 550 (s)10 000500720 × 576, 50 fps162008
                      OuluVS英語短語10 (s)1 000200720 × 576, 25 fps2112009
                      WAPUSK20英語短語522 000200640 × 480, 32 fps162010
                      LILiR英語語句1 0002 400120, 30, 45, 60, 90720 × 576, 25 fps672010
                      BL法語語句238 (s)4 046170, 90720 × 576, 25 fps122011
                      UNMC-VIER英語語句11 (s)4 5511230, 90708 × 640, 25 fps82011
                      MOBIO英語語句30 1861520640 × 480, 16 fps1752012
                      MIRACL-VC英語單詞101 500150640 × 480, 15 fps222014
                      短語10 (s)1 500
                      Austalk英語單詞966966 0001 0000640 × 480112014
                      語句59 (s)59 000
                      MODALITY英語單詞182 (s)2313501 920 × 1 080, 100 fps232015
                      RM-3000英語語句1 0003 00010360 × 640, 60 fps72015
                      IBM AV-ASR英語語句10 4002620704 × 480, 30 fps1032015
                      TCD-TIMIT英語語句5 954 (s)6 913620, 301920 × 1080, 30 fps592015
                      OuluVS2英語短語101 590530, 30, 45, 60, 901920 × 1080, 30 fps462015
                      語句530 (s)530
                      LRW英語單詞500550 0001 000+0$\sim$30256 × 256, 25 fps1152016
                      HAVRUS俄語語句1 530 (s)4 000200640 × 480, 200 fps132016
                      LRS2-BBC英語語句62 769144 4821 000+0$\sim$30160 × 160, 25 fps1722017
                      VLRF西班牙語語句1 37410 200a2401 280 × 720, 50 fps62017
                      LRS3-TED英語語句70 000151 8191 000+?90$\sim$90224 × 224, 25 fps22018
                      LRW-1000中文單詞1 000745 1872 000+?90$\sim$901 920 × 1 080, 25 fps02018
                      LSVSR英語語句127 0552 934 8991 000+?30$\sim$30128 × 128, 23 ~ 30 fps162018
                      下載: 導出CSV

                      表  2  字母、數字識別數據集. 下載地址為: AVLetters[152], AVICAR[153], XM2VTS[154], BANCA[155], CUAVE[156], VALID[157], CENSREC-1-AV[158], Austalk[159], OuluVS2[160]

                      Table  2  Alphabet and digit lip reading datasets and their download link: AVLetters[152], AVICAR[153], XM2VTS[154], BANCA[155], CUAVE[156], VALID[157], CENSREC-1-AV[158], Austalk[159], OuluVS2[160]

                      數據集語種識別任務類別數目話語數目說話人數目姿態分辨率谷歌引用發布年份
                      AVLetters英語字母26780100376 × 288, 25 fps5071998
                      XM2VTS英語數字108852950720 × 576, 25 fps1 6171999
                      BANCA多語種數字1029 9522080720 × 576, 25 fps5302003
                      AVICAR英語字母2626 000100?15$\sim$15720 × 480, 30 fps1702004
                      數字1323 000
                      CUAVE英語數字107 000+36?90, 0, 90720 × 480, 30 fps2922002
                      VALID英語數字105301060720 × 576, 25 fps382005
                      AVLetters2英語字母26910501 920 × 1 080, 50 fps622008
                      IBMSR英語數字101 66138?90, 0, 90368 × 240, 30 fps172008
                      CENSREC-1-AV日語數字105 197930720 × 480, 30 fps252010
                      QuLips英語數字103 6002?90$\sim$90720 × 576, 25 fps212010
                      Austalk英語數字1024 0001 0000640 × 480112014
                      OuluVS2英語數字10159530$\sim$901 920 × 1 080, 30 fps462015
                      下載: 導出CSV

                      表  4  不同數據集下代表性方法比較

                      Table  4  Comparison of representative methods under different datasets

                      數據集識別任務參考文獻模型主要實驗條件識別率
                      前端特征提取后端分類器音頻信號講話者依賴外部語言模型最小識別單元
                      AVLetters字母[41]RFMA××字母69.60 %
                      [48]RTMRBMSVM×字母66.00 %
                      [42]ST-PCAAutoencoder×××字母64.40 %
                      [52]LBP-TOPSVM××字母62.80 %
                      ××43.50 %
                      [193]DBNF+DCTLSTM××字母58.10 %
                      CUAVE數字[102]AAMHMM××數字83.00 %
                      [91]HOG+MBHSVM×××數字70.10 %
                      ×90.00 %
                      [194]DBNFDNN-HMM×××音素64.90 %
                      [60]DCTHMM××數字60.40 %
                      LRW單詞[128]3D-CNN+ResNetBiLSTM×××單詞83.00 %
                      [131]3D-CNN+ResNetBiGRU×××單詞82.00 %
                      ×98.00 %
                      [10]CNNLSTM+Attention×××單詞76.20 %
                      [9]CNN×××單詞61.10 %
                      GRID短語[56]3D-CNN+highwayBiGRU+Attention××字符97.10 %
                      [10]CNNLSTM+Attention××單詞97.00 %
                      [134]Feed-forwardLSTM××單詞84.70 %
                      95.90 %
                      [49]3D-CNNBiGRU×××字符93.40 %
                      [126]HOGSVM××單詞71.20 %
                      LRS3-TED語句[151]3D-CNN+ResNetTransformer+seq2seq××字符41.10 %
                      Transformer +CTC33.70 %
                      [15]3DCNNBiLSTM+CTC××音素44.90 %
                      下載: 導出CSV
                      360彩票
                    • [1] McGurk H, MacDonald J. Hearing lips and seeing voices. Nature, 1976, 264(5588): 746?748 doi: 10.1038/264746a0
                      [2] Potamianos G, Neti C, Gravier G, Garg A, Senior A W. Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 2003, 91(9): 1306?1326 doi: 10.1109/JPROC.2003.817150
                      [3] Calvert G A, Bullmore E T, Brammer M J, Campbell R, Williams S C R, McGuire P K, et al. Activation of auditory cortex during silent lipreading. Science, 1997, 276(5312): 593?596 doi: 10.1126/science.276.5312.593
                      [4] Deafness and hearing loss [online] available:https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss, July 1, 2019
                      [5] Tye-Murray N, Sommers M S, Spehar B. Audiovisual integration and lipreading abilities of older adults with normal and impaired hearing. Ear and Hearing, 2007, 28(5): 656?668 doi: 10.1097/AUD.0b013e31812f7185
                      [6] Akhtar Z, Micheloni C, Foresti G L. Biometric liveness detection: Challenges and research opportunities. IEEE Security and Privacy, 2015, 13(5): 63?72 doi: 10.1109/MSP.2015.116
                      [7] Rekik A, Ben-Hamadou A, Mahdi W. Human machine interaction via visual speech spotting. In: Proceedings of the 2015 International Conference on Advanced Concepts for Intelligent Vision Systems. Catania, Italy: Springer, 2015. 566?574
                      [8] Suwajanakorn S, Seitz S M, Kemelmacher-Shlizerman I. Synthesizing obama: Learning lip sync from audio. ACM Transactions on Graphics, 2017, 36(4): Article No.95
                      [9] Chung J S, Zisserman A. Lip reading in the wild. In: Proceedings of the 2016 Asian Conference on Computer Vision. Taiwan, China: Springer, 2016. 87?103
                      [10] Chung J S, Senior A, Vinyals O, Zisserman A. Lip reading sentences in the wild. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017. 3444?3453
                      [11] Chen L, Li Z, K Maddox R K, Duan Z, Xu C. Lip movements generation at a glance. In: Proceedings of the 2018 European Conference on Computer Vision. Munich, Germany: Springer, 2018. 538?553
                      [12] Gabbay A, Shamir A, Peleg S. Visual speech enhancement. arXiv preprint arXiv: 1711.08789, 2017
                      [13] 黃雅婷, 石晶, 許家銘, 徐波. 雞尾酒會問題與相關聽覺模型的研究現狀與展望. 自動化學報, 2019, 45(2): 234?251

                      Huang Ya-Ting, Shi Jing, Xu Jia-Ming, Xu Bo. Research advances and perspectives on the cocktail party problem and related auditory models. Acta Automatica Sinica, 2019, 45(2): 234?251
                      [14] Akbari H, Arora H, Cao L L, Mesgarani N. Lip2AudSpec: Speech reconstruction from silent lip movements video. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018. 2516?2520
                      [15] Shillingford B, Assael Y, Hoffman M W, Paine T, Hughes C, Prabhu U, et al. Large-scale visual speech recognition. arXiv preprint arXiv: 1807.05162, 2018
                      [16] Mandarin Audio-Visual Speech Recognition Challenge [online] available: http://vipl.ict.ac.cn/homepage/mavsr/index.html, July 1, 2019
                      [17] Potamianos G, Neti C, Luettin J, Matthews I. Audio-visual automatic speech recognition: An overview. Issues in Visual and Audio-Visual Speech Processing. Cambridge: MIT Press, 2004. 1−30
                      [18] Zhou Z H, Zhao G Y, Hong X P, Pietikainen M. A review of recent advances in visual speech decoding. Image and Vision Computing, 2014, 32(9): 590?605 doi: 10.1016/j.imavis.2014.06.004
                      [19] Fernandez-Lopez A, Sukno F M. Survey on automatic lip-reading in the era of deep learning. Image and Vision Computing, 2018, 78: 53?72 doi: 10.1016/j.imavis.2018.07.002
                      [20] 姚鴻勛, 高文, 王瑞, 郎咸波. 視覺語言-唇讀綜述. 電子學報, 2001, 29(2): 239?246 doi: 10.3321/j.issn:0372-2112.2001.02.025

                      Yao Hong-Xun, Gao Wen, Wang Rui, Lang Xian-Bo. A survey of lipreading-one of visual languages. Acta Electronica Sinica, 2001, 29(2): 239?246 doi: 10.3321/j.issn:0372-2112.2001.02.025
                      [21] Cox S J, Harvey R W, Lan Y, et al. The challenge of multispeaker lip-reading. In: Proceedings of AVSP. 2008: 179?184
                      [22] Messer K, Matas J, Kittler J, et al. XM2VTSDB: The extended M2VTS database. In: Proceedings of the Second International Conference on Audio and Video-based Biometric Person Authentication. 1999, 964: 965?966
                      [23] Bailly-Bailliére E, Bengio S, Bimbot F, Hamouz M, Kittler J, Mariéthoz J, et al. The BANCA database and evaluation protocol. In: Proceedings of the 2003 International Conference on Audio- and Video-based Biometric Person Authentication. Guildford, United Kingdom: Springer, 2003. 625?638
                      [24] Ortega A, Sukno F, Lleida E, Frangi A F, Miguel A, Buera L, et al. AV@CAR: A Spanish multichannel multimodal corpus for in-vehicle automatic audio-visual speech recognition. In: Proceedings of the 4th International Conference on Language Resources and Evaluation. Lisbon, Portugal: European Language Resources Association, 2004. 763−766
                      [25] Lee B, Hasegawa-Johnson M, Goudeseune C, Kamdar S, Borys S, Liu M, et al. AVICAR: Audio-visual speech corpus in a car environment. In: Proceedings of the 8th International Conference on Spoken Language Processing. Jeju Island, South Korea: International Speech Communication Association, 2004. 2489−2492
                      [26] Twaddell W F. On defining the phoneme. Language, 1935, 11(1): 5?62
                      [27] Woodward M F, Barber C G. Phoneme perception in lipreading. Journal of Speech and Hearing Research, 1960, 3(3): 212?222 doi: 10.1044/jshr.0303.212
                      [28] Fisher C G. Confusions among visually perceived consonants. Journal of Speech and Hearing Research, 1968, 11(4): 796?804 doi: 10.1044/jshr.1104.796
                      [29] Cappelletta L, Harte N. Viseme definitions comparison for visual-only speech recognition. In: Proceedings of the 19th European Signal Processing Conference. Barcelona, Spain: IEEE, 2011. 2109−2113
                      [30] Wu Y, Ji Q. Facial landmark detection: A literature survey. International Journal of Computer Vision, 2019, 127(2): 115?142 doi: 10.1007/s11263-018-1097-z
                      [31] Chrysos G G, Antonakos E, Snape P, Asthana A, Zafeiriou S. A comprehensive performance evaluation of deformable face tracking "in-the-wild". International Journal of Computer Vision, 2018, 126(2-4): 198?232 doi: 10.1007/s11263-017-0999-5
                      [32] Koumparoulis A, Potamianos G, Mroueh Y, et al. Exploring ROI size in deep learning based lipreading. In: Proceedings of AVSP. 2017: 64?69
                      [33] Deller J R Jr, Hansen J H L, Proakis J G. Discrete-Time Processing of Speech Signals. New York: Macmillan Pub. Co, 1993.
                      [34] Rabiner L R, Juang B H. Fundamentals of Speech Recognition. Englewood Cliffs: Prentice Hall, 1993.
                      [35] Young S, Evermann G, Gales M J F, Hain T, Kershaw D, Liu X Y, et al. The HTK Book. Cambridge: Cambridge University Engineering Department, 2002.
                      [36] Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, et al. The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Hilton Waikoloa Village, Big Island, Hawaii, US: IEEE, 2011.
                      [37] Matthews I, Cootes T F, Bangham J A, Cox S, Harvey R. Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(2): 198?213 doi: 10.1109/34.982900
                      [38] Potamianos G, Graf H P, Cosatto E. An image transform approach for HMM based automatic lipreading. In: Proceedings of 1998 International Conference on Image Processing. Chicago, USA: IEEE, 1998. 173−177
                      [39] Cootes T F, Edwards G J, Taylor C J. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(6): 681?685 doi: 10.1109/34.927467
                      [40] Fu Y, Zhou X, Liu M, Hasegawa-Johnson M, Huang T S. Lipreading by locality discriminant graph. In: Proceedings of 2007 IEEE International Conference on Image Processing. San Antonio, USA: IEEE, 2007. III−325−III−328
                      [41] Pei Y R, Kim T K, Zha H B. Unsupervised random forest manifold alignment for lipreading. In: Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE, 2013. 129−136
                      [42] Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng A Y. Multimodal deep learning. In: Proceeding of the 28th International Conference on Machine Learning. Washington, USA: ACM, 2011. 689−696
                      [43] Salakhutdinov R, Mnih A, Hinton G. Restricted Boltzmann machines for collaborative filtering. In: Proceedings of the 24th International Conference on Machine Learning. Corvallis, USA: ACM, 2007. 791−798
                      [44] Huang J, Kingsbury B. Audio-visual deep learning for noise robust speech recognition. In: Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada: IEEE, 2013. 7596−7599
                      [45] Ninomiya H, Kitaoka N, Tamura S, et al. Integration of deep bottleneck features for audio-visual speech recognition. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association. 2015.
                      [46] Sui C, Bennamoun M, Togneri R. Listening with your eyes: Towards a practical visual speech recognition system using deep Boltzmann machines. In: Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015. 154−162
                      [47] Noda K, Yamaguchi Y, Nakadai K, Okuno H G, Ogata T. Audio-visual speech recognition using deep learning. Applied Intelligence, 2015, 42(4): 722?737 doi: 10.1007/s10489-014-0629-7
                      [48] Hu D, Li X L, Lu X Q. Temporal multimodal learning in audiovisual speech recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 3574−3582
                      [49] Assael Y M, Shillingford B, Whiteson S, De Freitas N. LipNet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599, 2016
                      [50] Afouras T, Chung J S, Zisserman A. Deep lip reading: A comparison of models and an online application. arXiv preprint arXiv:1806.06053, 2018
                      [51] Luettin J, Thacker N A. Speechreading using probabilistic models. Computer Vision and Image Understanding, 1997, 65(2): 163?178 doi: 10.1006/cviu.1996.0570
                      [52] Zhao G Y, Barnard M, Pietik?inen M. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, 2009, 11(7): 1254?1265 doi: 10.1109/TMM.2009.2030637
                      [53] Zhou Z H, Zhao G Y, Pietik?inen M. Towards a practical lipreading system. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA: IEEE, 2011. 137−144
                      [54] Zhou Z H, Hong X P, Zhao G Y, Pietik?inen M. A compact representation of visual speech data using latent variables. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(1): 1
                      [55] Chung J S, Zisserman A. Out of time: Automated lip sync in the wild. In: Proceedings of Asian Conference on Computer Vision. Taiwan, China: Springer, 2016. 251−263
                      [56] Xu K, Li D W, Cassimatis N, Wang X L. LCANet: End-to-end lipreading with cascaded attention-CTC. In: Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition. Xi'an, China: IEEE, 2018.−548−555
                      [57] Lucey P J, Potamianos G, Sridharan S. A unified approach to multi-pose audio-visual ASR. In: Proceedings of the 8th Annual Conference of the International Speech Communication Association. Antwerp, Belgium: Causal Productions Pty Ltd., 2007. 650−653
                      [58] Almajai I, Cox S, Harvey R, Lan Y X. Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In: Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016. 2722−2726
                      [59] Seymour R, Stewart D, Ming J. Comparison of image transform-based features for visual speech recognition in clean and corrupted videos. EURASIP Journal on Image and Video Processing, 2007, 2008(1): Article No.810362
                      [60] Estellers V, Gurban M, Thiran J P. On dynamic stream weighting for audio-visual speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(4): 1145?1157 doi: 10.1109/TASL.2011.2172427
                      [61] Potamianos G, Neti C, Iyengar G, Senior A W, Verma A. A cascade visual front end for speaker independent automatic speechreading. International Journal of Speech Technology, 2001, 4(3?4): 193?208
                      [62] Lucey P J, Sridharan S, Dean D B. Continuous pose-invariant lipreading. In: Proceedings of the 9th Annual Conference of the International Speech Communication Association (Interspeech 2008) incorporating the 12th Australasian International Conference on Speech Science and Technology (SST 2008). Brisbane Australia: International Speech Communication Association, 2008. 2679−2682
                      [63] Lucey P J, Potamianos G, Sridharan S. Patch-based analysis of visual speech from multiple views. In: Proceedings of the International Conference on Auditory-Visual Speech Processing 2008. Moreton Island, Australia: AVISA, 2008. 69−74
                      [64] Tim Sheerman-Chase, Eng-Jon Ong, Richard Bowden. Cultural Factors in the Regression of Non-verbal Communication Perception. In Workshop on Human Interaction in Computer Vision, Barcelona, 2011
                      [65] Zhou Z H, Zhao G Y, Pietik?inen M. Lipreading: A graph embedding approach. In: Proceedings of the 20th International Conference on Pattern Recognition. Istanbul, Turkey: IEEE, 2010. 523−526
                      [66] Zhao G Y, Pietik?inen M. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6): 915?928 doi: 10.1109/TPAMI.2007.1110
                      [67] Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, USA: IEEE, 2005. 886−893
                      [68] Mase K, Pentland A. Automatic lipreading by optical-flow analysis. Systems and Computers in Japan, 1991, 22(6): 67?76 doi: 10.1002/scj.4690220607
                      [69] Aleksic P S, Williams J J, Wu Z L, Katsaggelos A K. Audio-visual speech recognition using MPEG-4 compliant visual features. EURASIP Journal on Advances in Signal Processing, 2002, 2002(1): Article No. 150948
                      [70] Brooke N M. Using the visual component in automatic speech recognition. In: Proceedings of the 4th International Conference on Spoken Language Processing. Philadelphia, USA: IEEE, 1996. 1656−1659
                      [71] Cetingul H E, Yemez Y, Erzin E, Tekalp A M. Discriminative analysis of lip motion features for speaker identification and speech-reading. IEEE Transactions on Image Processing, 2006, 15(10): 2879?2891 doi: 10.1109/TIP.2006.877528
                      [72] Nefian A V, Liang L H, Pi X B, Liu X X, Murphy K. Dynamic Bayesian networks for audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing, 2002, 2002(11): Article No.783042 doi: 10.1155/S1110865702206083
                      [73] Kirchhoff K. Robust speech recognition using articulatory information Elektronische Ressource. 1999.
                      [74] Cootes T F, Taylor C J, Cooper D H, Graham J. Active shape models-their training and application. Computer Vision and Image Understanding, 1995, 61(1): 38?59 doi: 10.1006/cviu.1995.1004
                      [75] Luettin J, Thacker N A, Beet S W. Speechreading using shape and intensity information. In: Proceedings of the 4th International Conference on Spoken Language Processing. Philadelphia, USA: IEEE, 1996. 58−61
                      [76] Dupont S, Luettin J. Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, 2000, 2(3): 141?151 doi: 10.1109/6046.865479
                      [77] Chan M T. HMM-based audio-visual speech recognition integrating geometric- and appearance-based visual features. In: Proceedings of the 4th Workshop on Multimedia Signal Processing. Cannes, France: IEEE, 2001. 9−14
                      [78] Roweis S T, Sau L K. Nonlinear dimensionality reduction by locally linear embedding. Science, 2000, 290(5500): 2323?2326 doi: 10.1126/science.290.5500.2323
                      [79] Tenenbaum J B, de Silva V, Langford J C. A global geometric framework for nonlinear dimensionality reduction. Science, 2000, 290(5500): 2319?2323 doi: 10.1126/science.290.5500.2319
                      [80] Yan S C, Xu D, Zhang B Y, Zhang H J, Yang Q, Lin S. Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(1): 40?51 doi: 10.1109/TPAMI.2007.250598
                      [81] Fu Y, Yan S C, Huang T S. Classification and feature extraction by simplexization. IEEE Transactions on Information Forensics and Security, 2008, 3(1): 91?100 doi: 10.1109/TIFS.2007.916280
                      [82] Ojala T, Pietik?inen M, Harwood D. A comparative study of texture measures with classification based on featured distributions. Pattern Recognition, 1996, 29(1): 51?59 doi: 10.1016/0031-3203(95)00067-4
                      [83] Ojala T, Pietik?inen M, M?enp?? T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(7): 971?987 doi: 10.1109/TPAMI.2002.1017623
                      [84] 劉麗, 趙凌君, 郭承玉, 王亮, 湯俊. 圖像紋理分類方法研究進展和展望. 自動化學報, 2018, 44(4): 584?607

                      Liu Li, Zhao Ling-Jun, Guo Cheng-Yu, Wang Liang, Tang Jun. Texture classification: State-of-the-art methods and prospects. Acta Automatica Sinica, 2018, 44(4): 584?607
                      [85] Pietik?inen M, Hadid A, Zhao G, Ahonen T. Computer Vision Using Local Binary Patterns. London: Springer, 2011.
                      [86] Liu L, Chen J, Fieguth P, Zhao G Y, Chellappa R, Pietik?inen M. From BoW to CNN: Two decades of texture representation for texture classification. International Journal of Computer Vision, 2019, 127(1): 74?109 doi: 10.1007/s11263-018-1125-z
                      [87] 劉麗, 謝毓湘, 魏迎梅, 老松楊. 局部二進制模式方法綜述. 中國圖象圖形學報, 2014, 19(12): 1696?1720 doi: 10.11834/jig.20141202

                      Liu Li, Xie Yu-Xiang, Wei Ying-Mei, Lao Song-Yang. Survey of Local Binary Pattern method. Journal of Image and Graphics, 2014, 19(12): 1696?1720 doi: 10.11834/jig.20141202
                      [88] Horn B K P, Schunck B G. Determining optical flow. Artificial Intelligence, 1981, 17(1-3): 185?203 doi: 10.1016/0004-3702(81)90024-2
                      [89] Bouguet J Y. Pyramidal implementation of the affine Lucas Kanade feature tracker description of the algorithm. Intel Corporation, 2001, 5: 1?9
                      [90] Lucas B D, Kanade T. An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence. San Francisco, CA, United States: Morgan Kaufmann Publishers Inc., 1981. 674−679
                      [91] Rekik A, Ben-Hamadou A, Mahdi W. An adaptive approach for lip-reading using image and depth data. Multimedia Tools and Applications, 2016, 75(14): 8609?8636 doi: 10.1007/s11042-015-2774-3
                      [92] Shaikh A A, Kumar D K, Yau W C, Azemin M Z C, Gubbi J. Lip reading using optical flow and support vector machines. In: Proceedings of the 3rd International Congress on Image and Signal Processing. Yantai, China: IEEE, 2010. 327−330
                      [93] Goldschen A J, Garcia O N, Petajan E. Continuous optical automatic speech recognition by lipreading. In: Proceedings of the 28th Asilomar Conference on Signals, Systems and Computers. Pacific Grove, CA, USA: IEEE, 1994. 572−577
                      [94] King S, Frankel J, Livescu K, McDermott E, Richmond K, Wester M. Speech production knowledge in automatic speech recognition. The Journal of the Acoustical Society of America, 2007, 121(2): 723?742 doi: 10.1121/1.2404622
                      [95] Kirchhoff K, Fink G A, Sagerer G. Combining acoustic and articulatory feature information for robust speech recognition. Speech Communication, 2002, 37(3?4): 303?319 doi: 10.1016/S0167-6393(01)00020-6
                      [96] Livescu K, Cetin O, Hasegawa-Johnson M, King S, Bartels C, Borges N, et al. Articulatory feature-based methods for acoustic and audio-visual speech recognition: Summary from the 2006 JHU Summer Workshop. In: Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing. Honolulu, USA: IEEE. 2007. IV?621?IV?624
                      [97] Saenko K, Livescu K, Glass J, Darrell T. Production domain modeling of pronunciation for visual speech recognition. In: Proceeding of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing. Philadelphia, USA: IEEE. 2005. v/473?v/476
                      [98] Saenko K, Livescu K, Glass J, Darrell T. Multistream articulatory feature-based models for visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(9): 1700?1707 doi: 10.1109/TPAMI.2008.303
                      [99] Saenko K, Livescu K, Siracusa M, Wilson K, Glass J, Darrell T. Visual speech recognition with loosely synchronized feature streams. In: Proceeding of the 10th IEEE International Conference on Computer Vision. Beijing, China: IEEE. 2005. 1424−1431
                      [100] Papcun G, Hochberg J, Thomas T R, Laroche F, Zacks J, Levy S. Inferring articulation and recognizing gestures from acoustics with a neural network trained on x-ray microbeam data. The Journal of the Acoustical Society of America, 1992, 92(2): 688?700 doi: 10.1121/1.403994
                      [101] Matthews I, Potamianos G, Neti C, Luettin J. A comparison of model and transform-based visual features for audio-visual LVCSR. In: Proceedings of the 2001 IEEE International Conference on Multimedia and Expo. Tokyo, Japan: IEEE, 2001. 825?828
                      [102] Papandreou G, Katsamanis A, Pitsikalis V, Maragos P. Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 2009, 17(3): 423?435 doi: 10.1109/TASL.2008.2011515
                      [103] Hilder S, Harvey R W, Theobald B J. Comparison of human and machine-based lip-reading. In: Proceedings of the 2009 AVSP. 2009: 86?89
                      [104] Lan Y X, Theobald B J, Harvey R. View independent computer lip-reading. In: Proceedings of the 2012 IEEE International Conference on Multimedia and Expo. Melbourne, Australia: IEEE, 2012. 432?437
                      [105] Lan Y X, Harvey R, Theobald B J. Insights into machine lip reading. In: Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing. Kyoto, Japan: IEEE, 2012. 4825?4828
                      [106] Bear H L, Harvey R. Decoding visemes: Improving machine lip-reading. In: Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016. 2009?2013
                      [107] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436?444 doi: 10.1038/nature14539
                      [108] Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786): 504?507 doi: 10.1126/science.1127647
                      [109] Hong X P, Yao H X, Wan Y Q, Chen R. A PCA based visual DCT feature extraction method for lip-reading. In: Proceedings of the 2006 International Conference on Intelligent Information Hiding and Multimedia. Pasadena, USA: IEEE, 2006. 321?326
                      [110] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. Red Hook, NY, United States: Curran Associates Inc., 2012. 1097−1105
                      [111] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556, 2014
                      [112] Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015. 1?9
                      [113] He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 770?778
                      [114] Huang G, Liu Z, Van Der Maaten L, Weinberger K Q. Densely connected convolutional networks. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA: IEEE, 2017. 2261?2269
                      [115] Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah, USA: IEEE, 2018. 7132?7141
                      [116] Liu L, Ouyang W L, Wang X G, Fieguth P, Chen J, Liu X W, et al. Deep learning for generic object detection: A survey. arXiv preprint arXiv: 1809.02165, 2018
                      [117] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015. 3431?3440
                      [118] Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks. In: Proceedings of the 2013 IEEE international Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada: IEEE, 2013. 6645?6649
                      [119] Noda K, Yamaguchi Y, Nakadai K, Okuno H G, Ogata T. Lipreading using convolutional neural network. In: Proceedings of the 15th Annual Conference of the International Speech Communication Association. Singapore: ISCA, 2014. 1149−1153
                      [120] Ji S W, Xu W, Yang M, Yu K. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221?231 doi: 10.1109/TPAMI.2012.59
                      [121] Herath S, Harandi M, Porikli F. Going deeper into action recognition: A survey. Image and Vision Computing, 2017, 60: 4?21 doi: 10.1016/j.imavis.2017.01.010
                      [122] Mroueh Y, Marcheret E, Goel V. Deep multimodal learning for audio-visual speech recognition. In: Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Queensland, Australia: IEEE, 2015. 2130?2134
                      [123] Thangthai K, Harvey R W, Cox S J, et al. Improving lip-reading performance for robust audiovisual speech recognition using DNNs. In: Proceedings of the 2015 AVSP. 2015: 127?131.
                      [124] Gers F A, Schmidhuber J, Cummins F. Learning to forget: Continual prediction with LSTM. Neural Computation, 2000, 12(10): 2451?2471 doi: 10.1162/089976600300015015
                      [125] Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv: 1412.3555, 2014
                      [126] Wand M, Koutník J, Schmidhuber J. Lipreading with long short-term memory. In: Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016. 6115?6119
                      [127] Garg A, Noyola J, Bagadia S. Lip reading using CNN and LSTM, Technical Report, CS231n Project Report, Stanford University, USA, 2016.
                      [128] Stafylakis T, Tzimiropoulos G. Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv: 1703.04105, 2017
                      [129] Graves A, Fernández S, Gomez F, Schmidhuber J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning. New York: ACM, 2006. 369−376
                      [130] Miao Y, Gowayyed M, Metze F. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In: Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding. Arizona, USA: IEEE, 2015. 167?174
                      [131] Petridis S, Stafylakis T, Ma P, Cai F P, Tzimiropoulos G, Pantic M. End-to-end audiovisual speech recognition. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018. 6548?6552
                      [132] Fung I, Mak B. End-to-end low-resource lip-reading with Maxout Cnn and Lstm. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018. 2511?2515
                      [133] Wand M, Schmidhuber J. Improving speaker-independent lipreading with domain-adversarial training. arXiv preprint arXiv: 1708.01565, 2017
                      [134] Wand M, Schmidhuber J, Vu N T. Investigations on end-to-end audiovisual fusion. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018. 3041?3045
                      [135] Srivastava R K, Greff K, Schmidhuber J. Training very deep networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA United States: MIT Press, 2015. 2377−2385
                      [136] Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2014. 3104−3112
                      [137] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv: 1409.0473, 2014
                      [138] Chaudhari S, Polatkan G, Ramanath R, Mithal V. An attentive survey of attention models. arXiv preprint arXiv: 1904.02874, 2019
                      [139] Wang F, Tax D M J. Survey on the attention based RNN model and its applications in computer vision. arXiv preprint arXiv: 1601.06823, 2016
                      [140] Chung J S, Zisserman A. Lip reading in profile. In: Proceedings of the British Machine Vision Conference. Guildford: BMVA Press, 2017. 155.1−155.11
                      [141] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 2015, 115(3): 211?252 doi: 10.1007/s11263-015-0816-y
                      [142] Saitoh T, Zhou Z H, Zhao G Y, Pietik?inen M. Concatenated frame image based cnn for visual speech recognition. In: Proceedings of the 2016 Asian Conference on Computer Vision. Taiwan, China: Springer, 2016. 277?289
                      [143] Lin M, Chen Q, Yan S C. Network in network. arXiv preprint arXiv: 1312.4400, 2013
                      [144] Petridis S, Li Z W, Pantic M. End-to-end visual speech recognition with LSTMs. In: Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans, USA: IEEE, 2017. 2592?2596
                      [145] Petridis S, Wang Y J, Li Z W, Pantic M. End-to-end audiovisual fusion with LSTMS. arXiv preprint arXiv: 1709.04343, 2017
                      [146] Petridis S, Wang Y J, Li Z W, Pantic M. End-to-end multi-view lipreading. arXiv preprint arXiv: 1709.00443, 2017
                      [147] Petridis S, Shen J, Cetin D, Pantic M. Visual-only recognition of normal, whispered and silent speech. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018. 6219?6223
                      [148] Moon S, Kim S, Wang H H. Multimodal transfer deep learning with applications in audio-visual recognition. arXiv preprint arXiv: 1412.3121, 2014
                      [149] Chollet F. Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, Hawaii, USA: IEEE, 2017. 1800?1807
                      [150] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, United States: Curran Associates Inc., 2017. 6000−6010
                      [151] Afouras T, Chung J S, Senior A, et al. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.DOI: 10.1109/TPAMI.2018.2889052
                      [152] AV Letters Database [Online], available: http://www2.cmp.uea.ac.uk/~bjt/avletters/, October 27, 2020
                      [153] AVICAR Project: Audio-Visual Speech Recognition in a Car [Online], available: http://www.isle.illinois.edu/sst/AVICAR/#information, October 27, 2020
                      [154] The Extended M2VTS Database [Online], available: http://www.ee.surrey.ac.uk/CVSSP/xm2vtsdb/, October 27, 2020
                      [155] The BANCA Database [Online], available: http://www.ee.surrey.ac.uk/CVSSP/banca/, October 27, 2020
                      [156] CUAVE Group Set [Online], available: http://people.csail.mit.edu/siracusa/avdata/, October 27, 2020
                      [157] VALID: Visual quality Assessment for Light field Images Dataset [Online], available: https://www.epfl.ch/labs/mmspg/downloads/valid/, October 27, 2020
                      [158] Speech Resources Consortium [Online], available: http://research.nii.ac.jp/src/en/data.html, October 27, 2020
                      [159] AusTalk [Online], available: https://austalk.edu.au/about/corpus/, October 27, 2020
                      [160] OULUVS2: A MULTI-VIEW AUDIOVISUAL DATABASE [Online], available: http://www.ee.oulu.fi/research/imag/OuluVS2/, October 27, 2020
                      [161] Patterson E K, Gurbuz S, Tufekci Z, Gowdy J N. CUAVE: A new audio-visual database for multimodal human-computer interface research. In: Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing. Orlando, Florida, USA: IEEE, 2002. II?2017?II?2020
                      [162] Fox N A, O'Mullane B A, Reilly R B. VALID: A new practical audio-visual database, and comparative results. In: Proceedings of the 2005 International Conference on Audio-and Video-Based Biometric Person Authentication. Berlin, Germany: Springer, 2005. 777?786
                      [163] Anina I, Zhou Z H, Zhao G Y, Pietik?inen M. OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis. In: Proceedings of the 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. Ljubljana, Slovenia: IEEE, 2015. 1−5
                      [164] Estival D, Cassidy S, Cox F, et al. AusTalk: an audio-visual corpus of Australian English. In: Proceedings of the 2014 LREC 2014.
                      [165] Tamura S, Miyajima C, Kitaoka N, et al. CENSREC-1-AV: An audio-visual corpus for noisy bimodal speech recognition. In: Proceedings of the Auditory-Visual Speech Processing 2010. 2010.
                      [166] Pass A, Zhang J G, Stewart D. An investigation into features for multi-view lipreading. In: Proceedings of the 2010 IEEE International Conference on Image Processing. Hong Kong, China: IEEE, 2010. 2417?2420
                      [167] Neti C, Potamianos G, Luettin J, et al. Audio visual speech recognition. IDIAP, 2000.
                      [168] Sanderson C. The vidtimit database. IDIAP, 2002.
                      [169] Jankowski C, Kalyanswamy A, Basson S, Spitz J. NTIMIT: A phonetically balanced, continuous speech, telephone bandwidth speech database. In: Proceedings of the 1990 International Conference on Acoustics, Speech, and Signal Processing. Albuquerque, New Mexico, USA: IEEE, 1990. 109?112
                      [170] Hazen T J, Saenko K, La C H, Glass J R. A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments. In: Proceedings of the 6th International Conference on Multimodal Interfaces. State College, PA, USA: ACM, 2004. 235−242
                      [171] MIRACL-VC1 [Online], available: https://sites.google.com/site/achrafbenhamadou/-datasets/miracl-vc1, October 27, 2020
                      [172] The Oxford-BBC Lip Reading in the Wild (LRW) Dataset [Online], available: http://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html, October 27, 2020
                      [173] LRW-1000: Lip Reading database [Online], available: http://vipl.ict.ac.cn/view_database.php?id=14, October 27, 2020
                      [174] The GRID audiovisual sentence corpus [Online], available: http://spandh.dcs.shef.ac.uk/gridcorpus/, October 27, 2020
                      [175] OuluVS database [Online], available: https://www.oulu.fi/cmvs/node/41315, October 27, 2020
                      [176] VidTIMIT Audio-Video Dataset [Online], available: http://conradsanderson.id.au/vidtimit/#downloads, October 27, 2020
                      [177] LiLiR [Online], available: http://www.ee.surrey.ac.uk/Projects/LILiR/datasets.html, October 27, 2020
                      [178] MOBIO [Online], available: https://www.idiap.ch/dataset/mobio, October 27, 2020
                      [179] TCD-TIMIT [Online], available: https://sigmedia.tcd.ie/TCDTIMIT/, October 27, 2020
                      [180] Lip Reading Datasets [Online], available: http://www.robots.ox.ac.uk/~vgg/data/lip_reading/, October 27, 2020
                      [181] Visual Lip Reading Feasibility (VRLF) [Online], available: https://datasets.bifrost.ai/info/845, October 27, 2020
                      [182] Rekik A, Ben-Hamadou A, Mahdi W. A new visual speech recognition approach for RGB-D cameras. In: Proceedings of the 2014 International Conference Image Analysis and Recognition. Vilamoura, Portugal: Springer, 2014. 21?28
                      [183] McCool C, Marcel S, Hadid A, Pietik?inen M, Matejka P, Cernock? J, et al. Bi-modal person recognition on a mobile phone: Using mobile phone data. In: Proceedings of the 2012 IEEE International Conference on Multimedia and Expo Workshops. Melbourne, Australia: IEEE, 2012. 635?640
                      [184] Howell D. Confusion Modelling for Lip-Reading [Ph. D. dissertation], University of East Anglia, Norwich, 2015
                      [185] Harte N, Gillen E. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia, 2015, 17(5): 603?615 doi: 10.1109/TMM.2015.2407694
                      [186] Verkhodanova V, Ronzhin A, Kipyatkova I, Ivanko D, Karpov A, Zelezny M. HAVRUS corpus: High-speed recordings of audio-visual Russian speech. In: Proceedings of the 2016 International Conference on Speech and Computer. Budapest, Hungary: Springer, 2016. 338?345
                      [187] Fernandez-Lopez A, Martinez O, Sukno F M. Towards estimating the upper bound of visual-speech recognition: The visual lip-reading feasibility database. In: Proceedings of the 12th IEEE International Conference on Automatic Face & Gesture Recognition. Washington, USA: IEEE, 2017. 208?215
                      [188] Cooke M, Barker J, Cunningham S, Shao X. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 2006, 120(5): 2421?2424 doi: 10.1121/1.2229005
                      [189] Vorwerk A, Wang X, Kolossa D, et al. WAPUSK20-A Database for Robust Audiovisual Speech Recognition. In: Proceedings of the 2010 LREC. 2010.
                      [190] Czyzewski A, Kostek B, Bratoszewski P, Kotus J, Szykulski M. An audio-visual corpus for multimodal automatic speech recognition. Journal of Intelligent Information Systems, 2017, 49(2): 167?192 doi: 10.1007/s10844-016-0438-z
                      [191] Afouras T, Chung J S, Zisserman A. LRS3-TED: A large-scale dataset for visual speech recognition. arXiv preprint arXiv: 1809.00496, 2018
                      [192] Yang S, Zhang Y H, Feng D L, Yang M M, Wang C H, Xiao J Y, et al. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. In: Proceedings of the 14th IEEE International Conference on Automatic Face and Gesture Recognition. Lille, France: IEEE, 2019. 1−8
                      [193] Petridis S, Pantic M. Deep complementary bottleneck features for visual speech recognition. In: Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016. 2304?2308
                      [194] Rahmani M H, Almasganj F. Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. In: Proceedings of the 3rd International Conference on Pattern Recognition and Image Analysis. Shahrekord, Iran: IEEE, 2017. 195−199
                      [195] Dosovitskiy A, Fischer P, Ilg E, H?usser P, Hazirbas C, Golkov V, et al. FlowNet: Learning optical flow with convolutional networks. In: Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015. 2758?2766
                      [196] Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T. FlowNet 2.0: Evolution of optical flow estimation with deep networks. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA: IEEE, 2017. 1647?1655
                      [197] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge, MA, United States: MIT Press, 2014. 568−576
                      [198] Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 1933?1941
                      [199] Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu K. Spatial transformer networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA, United States: MIT Press, 2015. 2017−2025
                      [200] Bhagavatula C, Zhu C C, Luu K, Savvides M. Faster than real-time facial alignment: A 3D spatial transformer network approach in unconstrained poses. In: Proceedings of the 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017. 4000?4009
                      [201] Baltru?aitis T, Ahuja C, Morency L P. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(2): 423?443 doi: 10.1109/TPAMI.2018.2798607
                      [202] Loizou P C. Speech Enhancement: Theory and Practice. Boca Raton, FL: CRC Press, 2013.
                      [203] Hou J C, Wang S S, Lai Y H, Tsao Y, Chang H W, Wang H M. Audio-visual speech enhancement based on multimodal deep convolutional neural network. arXiv preprint arXiv: 1703.10893, 2017
                      [204] Ephrat A, Halperin T, Peleg S. Improved speech reconstruction from silent video. In: Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops. Venice, Italy: IEEE, 2017. 455?462
                      [205] Gabbay A, Shamir A, Peleg S. Visual speech enhancement. arXiv preprint arXiv: 1711.08789, 2017.
                    • 加載中
                    圖(18) / 表(4)
                    計量
                    • 文章訪問數:  1684
                    • HTML全文瀏覽量:  1359
                    • PDF下載量:  207
                    • 被引次數: 0
                    出版歷程
                    • 收稿日期:  2019-07-16
                    • 錄用日期:  2019-11-16
                    • 網絡出版日期:  2019-12-19
                    • 刊出日期:  2020-11-24

                    目錄

                      /

                      返回文章
                      返回