A CNN-BASED VOICE COMMAND SYSTEM FOR SLIDE CONTROL TO ENHANCE ACCESSIBILITY FOR USERS DISABILITIES

Carli Apriansyah Hutagalung; Adi Fitrianto

Authors

Carli Apriansyah Hutagalung Universitas Media Nusantara Citra
Adi Fitrianto Universitas Media Nusantara Citra

Keywords:

CNN, Voice Recognition, Disabilities

Abstract

This study presents a robust CNN-based model for real-time voice command recognition, specifically designed to recognize “right” and “left” commands. The dataset, derived from the Speech Commands Dataset, includes audio samples augmented with additional noise, yielding hundreds of thousands of data points to enhance model performance under noisy conditions. Each audio sample, approximately 1 second in length, is transformed into a spectrogram to facilitate pattern recognition by the CNN. The model was trained over 20 epochs, achieving a training accuracy of 96.5% and a validation accuracy of 97.6%, indicating strong generalization without overfitting. Testing on real-world noisy audio further demonstrated the model’s effectiveness, recording an overall accuracy of 97.7% and an AUC of 1.0 for both classes. The results underscore the model’s potential for reliable deployment in noisy environments, with low false positives and rapid response times, as indicated by CPU and memory performance metrics. These findings contribute valuable insights into designing voice-controlled systems for real-world applications, especially for users in challenging auditory environments.

References

Agnes Z. Yonatan. (2023, October 1). Menilik Distribusi Sektor Pekerja Disabilitas Indonesia.

Alim, M. A., Setumin, S., Rosli, A. D., & Ani, A. I. C. (2021). Development of a Voice-controlled Intelligent Wheelchair System using Raspberry Pi. In 2021 IEEE 11th IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE) (pp. 274–278). IEEE. doi:10.1109/ISCAIE51753.2021.9431815

Asad, C. M., Wali, R., Rehman, M. Z., Ahmed, S., Rehman, A., Wadood, A., … Bhatti, H. M. F. (2018). Removing Disabilities: Controlling Personal Computer Through Head Movements and Voice Command. In 2018 IEEE 12th International Conference on Application of Information and Communication Technologies (AICT) (pp. 1–4). IEEE. doi:10.1109/ICAICT.2018.8747123

Boulal, H., Hamidi, M., Abarkan, M., & Barkani, J. (2024). Amazigh CNN speech recognition system based on Mel spectrogram feature extraction method. International Journal of Speech Technology, 27(1), 287–296. doi:10.1007/s10772-024-10100-0

Cindy Mutia Annur. (2023, August 8). Mayoritas Pekerja Disabilitas di Indonesia Berstatus Wirausaha.

De J. Velásquez-Martínez, E., Becerra-Sánchez, A., De La Rosa-Vargas, J. I., González-Ramírez, E., Rodarte-Rodríguez, A., Zepeda-Valles, G., … Olvera-González, J. E. (2023). Combining Deep Learning with Domain Adaptation and Filtering Techniques for Speech Recognition in Noisy Environments. In 2023 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC) (pp. 1–6). IEEE. doi:10.1109/ROPEC58757.2023.10409492

Kumar, L. A., Renuka, D. K., & Priya, M. C. S. (2023). Towards Robust Speech Recognition Model Using Deep Learning. In 2023 International Conference on Intelligent Systems for Communication, IoT and Security (ICISCoIS) (pp. 253–256). IEEE. doi:10.1109/ICISCoIS56541.2023.10100390

Lakshmi, K. L., Muthulakshmi, P., Nithya, A. A., Jeyavathana, R. B., Usharani, R., Das, N. S., & Devi, G. N. R. (2023). Recognition of emotions in speech using deep CNN and RESNET. Soft Computing. doi:10.1007/s00500-023-07969-5

Lv, Z., Poiesi, F., Dong, Q., Lloret, J., & Song, H. (2022). Deep Learning for Intelligent Human–Computer Interaction. Applied Sciences, 12(22), 11457. doi:10.3390/app122211457

Peng, N., Chen, A., Zhou, G., Chen, W., Zhang, W., Liu, J., & Ding, F. (2020). Environment Sound Classification Based on Visual Multi-Feature Fusion and GRU-AWS. IEEE Access, 8, 191100–191114. doi:10.1109/ACCESS.2020.3032226

Qian, Y., Bi, M., Tan, T., & Yu, K. (2016). Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(12), 2263–2276. doi:10.1109/TASLP.2016.2602884

Telmem, M., Laaidi, N., Ghanou, Y., Hamiane, S., & Satori, H. (2024). Comparative study of CNN, LSTM and hybrid CNN-LSTM model in amazigh speech recognition using spectrogram feature extraction and different gender and age dataset. International Journal of Speech Technology. doi:10.1007/s10772-024-10154-0

Xu, X., Zhang, X., Bao, Z., Yu, X., Yin, Y., Yang, X., & Niu, Q. (2023). Training-Free Acoustic-Based Hand Gesture Tracking on Smart Speakers. Applied Sciences, 13(21), 11954. doi:10.3390/app132111954