Problem of polish word recognition using classical speech recognition models with state allocation based on phonetic word structure
Adrian Albrecht
Institute of Computer Science, University of Warmia and Mazury in OlsztynAbstrakt
Automatic speech recognition systems rely on statistical or neural models capable of modelling temporal dependencies present in acoustic signals. Among classical approaches, Hidden Markov Models (HMM) remain an important component of many speech recognition systems, particularly in tasks involving limited datasets or domain-specific vocabularies. One of the key design decisions in HMM-based systems concerns the representation of phonetic context and the number of states used to model acoustic sequences. This study investigates the impact of different phonetic representations and state allocation strategies in HMM models for the task of isolated Polish word recognition. The analysis considers three types of phonetic decomposition: phonemes, diphones and triphones. Additionally, three strategies of assigning the number of hidden states are evaluated: a constant number of states for all models, a dynamically adjusted number of states depending on the number of phonetic units in a word, and the classical speech recognition topology assuming three states per phonetic unit. Experiments were conducted on a custom dataset consisting of 3,600 recordings of 20 Polish command words spoken by nine speakers. Acoustic features were represented using MFCC coefficients and modelled with Gaussian Mixture Hidden Markov Models trained using the Baum–Welch algorithm. The obtained results indicate that dynamically assigning the number of states proportional to the number of phonemes (three states per phoneme) achieves the highest recognition accuracy. At the same time, increasing the phonetic context from phonemes to diphones and triphones did not improve performance on the analysed dataset, likely due to the increased model complexity and the limited size of the training corpus. The analysis of confusion matrices further reveals that HMM models capture phonetic similarities between words, which can lead to systematic recognition errors in phonetically similar commands.
Słowa kluczowe:
Hidden Markov Model, phonemes, diphones, triphones, Automatic Speech Recognition, word recognictionBibliografia
Chaurasiya, H. 2022. Cognitive hexagon-controlled intelligent speech interaction system. IEEE Transactions on Cognitive and Developmental Systems, 14(4).
Crossref
Google Scholar
Chen, D., Mak, B., Leung, C.-C., Sivadas, S. 2014. Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 5592–5596.
Crossref
Google Scholar
Del-Agua, M. A., González-Domínguez, J., López-Moreno, I., Moreno, P. J. 2018. Speaker-adapted confidence measures for automatic speech recognition using deep bidirectional recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(7), 1198–1206.
Crossref
Google Scholar
Figielska, E. 2011. Ewolucyjne metody uczenia ukrytych modeli Markowa. Zeszyty Naukowe Warszawskiej Wyższej Szkoły Informatyki. Google Scholar
Gales, M., Young, S. 2008. The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 1(3), 195–304.
Crossref
Google Scholar
Jelinek, F. 1976. Continuous speech recognition by statistical methods. Proceedings of the IEEE, 64(4).
Crossref
Google Scholar
Jurafsky, D., Martin, J. H. 2013. Speech and Language Processing. Pearson Education, Upper Saddle River. Google Scholar
Makridakis, S. 2017. The forthcoming Artificial Intelligence revolution: Its impact on society and firms. Futures, 90.
Crossref
Google Scholar
Pondel-Sycz, K., Bilski, P. 2024. A system dedicated to Polish automatic speech recognition – overview of solutions. Bulletin of the Polish Academy of Sciences: Technical Sciences.
Crossref
Google Scholar
Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2).
Crossref
Google Scholar
Savchenko, A. V. 2013. Phonetic words decoding software in the problem of Russian speech recognition. Automation and Remote Control, 74, 1225–1232.
Crossref
Google Scholar
Sledzinski, D. 2010. Fonemy, difony, trifony i sylaby – charakterystyka jednostek na podstawie korpusu. Kwartalnik Językoznawczy, 3–4. Google Scholar
Smit, P., Virpioja, S., Kurimo, M. 2021. Advances in subword-based HMM-DNN speech recognition across languages. Computer Speech and Language, 66, 101158.
Crossref
Google Scholar
Tachbelie, M. Y., Abate, S. T., Besacier, L. 2014. Using different acoustic, lexical and language modeling units for automatic speech recognition of an under-resourced language – Amharic. Speech Communication, 56.
Crossref
Google Scholar
Institute of Computer Science, University of Warmia and Mazury in Olsztyn

