Automated Lip Reading to Predict Visemes using Multimodal Convolutional Neural Network with Audio-Visual Features


  • Khalid Mahboob Institute of Business Management, Karachi, Pakistan
  • Umm-e- Laila Institute of Business Management, Karachi, Pakistan
  • Sana Alam Institute of Business Management, Karachi, Pakistan
  • Muhammad Abbas Institute of Business Management, Karachi, Pakistan
  • Muhammad Asghar Khan Institute of Business Management, Karachi, Pakistan
  • Sidra Fatima Sir Syed University of Engineering & Technology, Karachi Pakistan


Lip,reading, model, visemes, accuracy, convolutional neural network


The process of interpreting sentences based on the movements of a speaker's lips is referred to as lip reading. Traditionally, this task has been approached in two stages using conventional methods: first, by generating or learning audio-visual features, and second, by making predictions. While contemporary deep lip reading techniques benefit from end-to-end trainable datasets, much of the existing research on these models tends to concentrate solely on word classification rather than predicting sequences at the sentence level. Long sentences may be lip-read by humans, as studies have shown. This study emphasizes the value of temporal considerations by highlighting the components that are important for capturing temporal context in instances when communication channels are unclear. In the paper, a lip-reading system for viseme prediction is shown. The system uses a Convolutional Neural Network (CNN) with a recurrent network, spatiotemporal convolutions, and the connectionist temporal classification loss. A variable-length series of video frames is efficiently mapped to text using an end-to-end training procedure. Both visual and auditory qualities are evaluated using the CNN architecture. The CNN model outperforms trained human lip readers and achieves accuracies of 72.8% CER and 80.8% WER (unseen speakers with audio), whereas 46.2% CER and 56.6% WER (unseen speakers without audio), which are reasonable accuracies on the GRID corpus by splitting test at the level of the sentences.


Li Z, Liu F, Yang W, Peng S, Zhou J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Transactions on Neural Networks and LearningSystems. 2021; 1–21. Doi:

Pujari S, Sneha S, Vinusha R, Bhuvaneshwari P, Rashaswini C. A Survey on Deep Learning based Lip-Reading Techniques. Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021). 2021; (Icicv):1286– 93.

Oghbaie M, Sabaghi A, Hashemifard K, Akbari M. Advances and Challenges in Deep Lip Reading. Computer Vision and Image Understanding [Internet]. 2021; 1–31. Available from:

Dhillon A, Verma GK. Convolutional neural network: a review of models, methodologies and applications to object detection. Progress in Artificial Intelligence. 2020; 9(2):85–112. Doi:

Fenghour S, Chen D, Guo K, Li B, Xiao P. Deep Learning-Based Automated Lip-Reading: A Survey. IEEE Access. 2021; 9:121184–205. Doi:

Nandini M.S., Bhajantri N.U. A comprehension of contemporary effort for tracking of lip. International Journal of Bioinformatics Research and Applications. 2020; 16(1):85. Doi: https//

McMurray B, Jongman A. What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychol Rev. 2011 Apr; 118(2):219-46. Doi: https//

Fenghour S, Chen D, Guo K, Li B, Xiao P. An effective conversion of visemes to words for high-performance automatic lipreading. Sensors. 2021; 21(23). Doi: 10.3390/s21237890.

Lu Y, Li H. Automatic lip-reading system based on deep convolutional neural network and attention-based long short-term memory. Applied Sciences (Switzerland). 2019; 9(8). Doi: 10.3390/app9081599.

Ivanko D, Ryumin D. Development of visual and audio speech recognition systems using deep neural networks. CEUR Workshop Proceedings. 2021; 3027:905–16. Doi: 10.20948/graphicon-2021-3027-905-916.

Mahboob K, Nizami H, Ali F, Alvi F. Sentences Prediction Based on Automatic Lip-Reading Detection with Deep Learning Convolutional Neural Networks Using Video-Based Features. Vol. 1489 CCIS, Communications in Computer and Information Science. Springer Singapore; 2021. 42–53 p. Doi:

Miled, M., Messaoud, M.A.B. & Bouzid, A. Lip reading of words with lip segmentation and deep learning. Multimedia Tools and Applications (2022). Doi:

Ma P, Wang Y, Petridis S, Shen J, Pantic M, Ai M. Training Strategies for Improved Lip-Reading. ICASSP 2022-IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022; 8472–6. Doi: ICASSP43922.2022.9746706.

Chowdhury SMMH, Rahman M, Oyshi MT, Hasan MA. Text Extraction through Video Lip Reading Using Deep Learning. Proceedings of the 2019 8th International Conference on System Modelingand Advancement in Research Trends, SMART 2019. 2020; 240–3. Doi: 10.1109/SMART46866.2019.9117224.

Jeon S, Elsharkawy A, Kim MS. Lipreading architecture based on multiple convolutional neural networks for sentence-level visual speech recognition. Sensors. 2022; 22(1). Doi: 10.3390/s22010072.

Ma P, Wang Y, Shen J, Petridis S, Pantic M. Lip-reading with densely connected temporal convolutional networks. Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021. 2021; 2856–65. Doi: 10.1109/WACV48630.2021.00290.

Nambeesan AS, Payyappilly C, Edwin JC, P JJ, Alex S. LIP Reading Using Facial Feature Extraction and Deep Learning. International Journal of Innovative Science and Research Technology. 2021; 6(7):92–6.

Zhang C, Zhang S. Lip Reading using CNN Lip Deflection Classifier and GAN Two-Stage Lip Corrector. Journal of Physics: Conference Series. 2021; 1883(1). Doi: 10.1088/1742-6596/1883/1/012134.

Fenghour S, Chen D, Guo K, Xiao P. Lip Reading Sentences Using Deep Learning with only Visual Cues. IEEE Access. 2020; 8(December):215516–30. Doi: 10.1109/ACCESS.2020.3040906.

Bursic S, Boccignone G, Ferrara A, D’Amelio A, Lanzarotti R. Improving the accuracy of automatic facial expression recognition in speaking subjects with deep learning. Applied Sciences (Switzerland) 2020; 10(11):1–15. Doi: 10.3390/app10114002.

Shi YM. Construction of the Convolutional neural network Based on the increase optimizers provided and atomic layers provided. 2022 2nd International Conference on Consumer Electronics and Computer Engineering, ICCECE 2022. 2022; 676–9. Doi: https:/// 10.1109/ICCECE54139.2022.9712718.

Jameel SM, Hashmani MA, Rehman M, Budiman A. Adaptive CNN Ensemble for Complex Multispectral Image Analysis. Complexity. 2020; 2020. Doi: 10.1155/2020/8361989.

Wahid A, Umar Khan A, Mukhtarullah, Khan S, Shah J. A MultilayeredConvolutional Sparse Coding Framework for Modeling of Pooling Operation of Convolution Neural Networks. 2019 IEEE 6th International Conference on Smart Instrumentation, Measurement and Application, ICSIMA 2019. 2019;( August):27–9. Doi: 10.1109/ICSIMA47653.2019.9057334.

Jameel SM, Hashmani MA, Alhussain H, Rehman M, Budiman A. An optimized deep convolutional neural network architecture for concept drifted image classification. Vol. 1037, Advances in Intelligent Systems and Computing. 2020. 932–942 p. Doi: GRID10.1007/978-3-030-29516-5_70.

Zhu Q, Zu X. Fully Convolutional Neural Network Structure and Its Loss Function for Image Classification. IEEE Access. 2022; 10:35541–9. Doi: 10.1109/ACCESS.2022.3163849.

Ting J, Song C, Huang H, Tian T. A Comprehensive Dataset for Machine-Learning-based Lip-Reading Algorithm. Procedia Computer Science. 2021; 199:1444–9. Doi:

Huang H, Song C, Ting J, Tian T, Hong C, Di Z, et al. A Novel Machine Lip Reading Model. Procedia Computer Science. 2021; 199:1432–7. Doi:

Johnston B, Chazal P de. A review of image-based automatic facial landmark identification techniques. Vol. 2018, Eurasip Journal on Image and Video Processing. 2018. Doi: 10.1186/s13640-018-0324-4.

Lalitha SD, Thyagharajan KK. A study on lip localization techniques used for lip reading from a video. International Journal of Applied Engineering Research. 2016; 11(1):611–5.

Sukritha SN, Mohan M. Analysis on Lip Reading Techniques and Image Concatenation Concepts. ICCISc 2021 - 2021 International Conference on Communication, Control and Information Sciences, Proceedings. 2021; Doi: 10.1109/ICCISc52257.2021.9484920.

Chen T, Rao RR. Audio-Visual Integration in Multimodal Communication. Proceedings of the IEEE. 1998; 86(5):837–52. Doi: 10.1109/5.664274.

Neeraja K, Srinivas Rao K, Praneeth G. Deep Learning based Lip Movement Technique for Mute. Proceedings of the 6th International Conference on Communication and Electronics Systems, ICCES 2021. 2021; 1446–50. Doi: 10.1109/ICCES51350.2021.9489122.eff

Jeon S, Kim MS. End-to-End Lip-Reading Open Cloud-Based Speech Architecture. Sensors. 2022; 22(8). Doi: 10.3390/s22082938.

Joshi VS, Raj ED. FYEO : A Character Level Model for Lip Reading. 2021 8th International Conference on Smart Computing and Communications: Artificial Intelligence, AI Driven

Applications for a Smart World, ICSCC 2021. 2021; 257–62. Doi: 10.1109/ICSCC51209.2021.9528104.

Sarhan AM, Elshennawy NM, Ibrahim DM. HLR-Net: A hybrid lip-reading model based on deep convolutional neural networks. Computers, Materials and Continua. 2021; 68(2):1531–49. Doi: 10.32604/cmc.2021.016509.