Problem of polish word recognition using classical speech recognition models with state allocation based on phonetic word structure

Adrian Albrecht

doi:10.31648/ts.12502

Problem of polish word recognition using classical speech recognition models with state allocation based on phonetic word structure

Adrian Albrecht

Institute of Computer Science, University of Warmia and Mazury in Olsztyn

DOI: https://doi.org/10.31648/ts.12502

Abstract

Automatic speech recognition systems rely on statistical or neural models capable of modelling temporal dependencies present in acoustic signals. Among classical approaches, Hidden Markov Models (HMM) remain an important component of many speech recognition systems, particularly in tasks involving limited datasets or domain-specific vocabularies. One of the key design decisions in HMM-based systems concerns the representation of phonetic context and the number of states used to model acoustic sequences. This study investigates the impact of different phonetic representations and state allocation strategies in HMM models for the task of isolated Polish word recognition. The analysis considers three types of phonetic decomposition: phonemes, diphones and triphones. Additionally, three strategies of assigning the number of hidden states are evaluated: a constant number of states for all models, a dynamically adjusted number of states depending on the number of phonetic units in a word, and the classical speech recognition topology assuming three states per phonetic unit. Experiments were conducted on a custom dataset consisting of 3,600 recordings of 20 Polish command words spoken by nine speakers. Acoustic features were represented using MFCC coefficients and modelled with Gaussian Mixture Hidden Markov Models trained using the Baum–Welch algorithm. The obtained results indicate that dynamically assigning the number of states proportional to the number of phonemes (three states per phoneme) achieves the highest recognition accuracy. At the same time, increasing the phonetic context from phonemes to diphones and triphones did not improve performance on the analysed dataset, likely due to the increased model complexity and the limited size of the training corpus. The analysis of confusion matrices further reveals that HMM models capture phonetic similarities between words, which can lead to systematic recognition errors in phonetically similar commands.

Keywords:

Hidden Markov Model, phonemes, diphones, triphones, Automatic Speech Recognition, word recogniction

References

Chaurasiya, H. 2022. Cognitive hexagon-controlled intelligent speech interaction system. IEEE Transactions on Cognitive and Developmental Systems, 14(4).
Crossref Google Scholar

Chen, D., Mak, B., Leung, C.-C., Sivadas, S. 2014. Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 5592–5596.
Crossref Google Scholar

Del-Agua, M. A., González-Domínguez, J., López-Moreno, I., Moreno, P. J. 2018. Speaker-adapted confidence measures for automatic speech recognition using deep bidirectional recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(7), 1198–1206.
Crossref Google Scholar

Figielska, E. 2011. Ewolucyjne metody uczenia ukrytych modeli Markowa. Zeszyty Naukowe Warszawskiej Wyższej Szkoły Informatyki. Google Scholar

Gales, M., Young, S. 2008. The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 1(3), 195–304.
Crossref Google Scholar

Jelinek, F. 1976. Continuous speech recognition by statistical methods. Proceedings of the IEEE, 64(4).
Crossref Google Scholar

Jurafsky, D., Martin, J. H. 2013. Speech and Language Processing. Pearson Education, Upper Saddle River. Google Scholar

Makridakis, S. 2017. The forthcoming Artificial Intelligence revolution: Its impact on society and firms. Futures, 90.
Crossref Google Scholar

Pondel-Sycz, K., Bilski, P. 2024. A system dedicated to Polish automatic speech recognition – overview of solutions. Bulletin of the Polish Academy of Sciences: Technical Sciences.
Crossref Google Scholar

Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2).
Crossref Google Scholar

Savchenko, A. V. 2013. Phonetic words decoding software in the problem of Russian speech recognition. Automation and Remote Control, 74, 1225–1232.
Crossref Google Scholar

Sledzinski, D. 2010. Fonemy, difony, trifony i sylaby – charakterystyka jednostek na podstawie korpusu. Kwartalnik Językoznawczy, 3–4. Google Scholar

Smit, P., Virpioja, S., Kurimo, M. 2021. Advances in subword-based HMM-DNN speech recognition across languages. Computer Speech and Language, 66, 101158.
Crossref Google Scholar

Tachbelie, M. Y., Abate, S. T., Besacier, L. 2014. Using different acoustic, lexical and language modeling units for automatic speech recognition of an under-resourced language – Amharic. Speech Communication, 56.
Crossref Google Scholar

Download

Published

2026-05-25

Cited by

Albrecht, A. (2026). Problem of polish word recognition using classical speech recognition models with state allocation based on phonetic word structure. Technical Sciences. https://doi.org/10.31648/ts.12502