Audio Stream Analysis for Deep Fake Threat Identification

Karol Jędrasiak

doi:10.31648/cetl.9684

Veröffentlicht am: 2024-04-021

Bd. 41 Nr. 1 (2024)

Audio Stream Analysis for Deep Fake Threat Identification

Karol Jędrasiak

Civitas et Lex

Rubrik: Sicherheitswissenschaften

https://doi.org/10.31648/cetl.9684

Abstract

This article introduces a novel approach for the identification of deep fake threats within audio streams, specifically targeting the detection of synthetic speech generated by text-to-speech (TTS) algorithms. At the heart of this system are two critical components: the Vocal Emotion Analysis (VEA) Network, which captures the emotional nuances expressed within speech, and the Supervised Classifier for Deepfake Detection, which utilizes the emotional features extracted by the VEA to distinguish between authentic and fabricated audio tracks. The system capitalizes on the nuanced deficit of deepfake algorithms in replicating the emotional complexity inherent in human speech, thus providing a semantic layer of analysis that enhances the detection process. The robustness of the proposed methodology has been rigorously evaluated across a variety of datasets, ensuring its efficacy is not confined to controlled conditions but extends to realistic and challenging environments. This was achieved through the use of data augmentation techniques, including the introduction of additive white noise, which serves to mimic the variabilities encountered in real-world audio processing. The results have shown that the system's performance is not only consistent across different datasets but also maintains high accuracy in the presence of background noise, particularly when trained with noise-augmented datasets. By leveraging emotional content as a distinctive feature and applying sophisticated machine learning techniques, it presents a robust framework for safeguarding against the manipulation of audio content. This methodological contribution is poised to enhance the integrity of digital communications in an era where synthetic media is proliferating at an unprecedented rate.

Dateien herunterladen

PDF (Język Polski)

Zitierregeln

Jędrasiak, K. (2024). Audio Stream Analysis for Deep Fake Threat Identification. Civitas Et Lex, 41(1), 21–35. https://doi.org/10.31648/cetl.9684

Zitiert von / Teilen

Lizenz

Dieses Werk steht unter der Lizenz Creative Commons Namensnennung - Nicht-kommerziell - Keine Bearbeitungen 4.0 International.

Literaturhinweise

Abramson A.S., Whalen D.H, Voice Onset Time (VOT), “50: Theoretical and practical issues in measuring voicing distinctions”, “Journal of phonetics” 2017, no 63, pp. 75–86.
Crossref
Google Scholar

Alegre F., Vipperla R., Amehraye A., Evans N.W.D., A new speaker verification spoofing countermeasure based on local binary patterns, “Interspeech” 2013.
Crossref
Google Scholar

Almutairi Z., Elgibreen H., A review of modern audio deepfake detection methods: challenges and future directions, “Algorithms” 2022, no. 15(5), p. 155.
Crossref
Google Scholar

Bhangale K.B., Kothandaraman M., Survey of deep learning paradigms for speech processing, “Wireless Personal Communications” 2022, no. 125(2), pp. 1913–1949.
Crossref
Google Scholar

Chakroborty S., Roy A., Saha G., Improved closed set text-independent speaker identification by combining mfcc with evidence from flipped filter banks, “World Academy of Science, Engineering and Technology, International Journal of Electrical, Computer, Energetic, Electronic and Communication Engineering” 2008, vol. 2, pp. 2554–2561.
Google Scholar

Chen L., Guo W., Dai L., Speaker verification against synthetic speech, “7th International Symposium on Chinese Spoken Language Processing” 2010, pp. 309–312.
Crossref
Google Scholar

Chen N., Qian Y., Dinkel H., Chen B., Yu K., Robust deep feature for spoofing detection – the sjtu system for asvspoof 2015 challenge, “Interspeech” 2015.
Crossref
Google Scholar

Cheng X., Xu M., Zheng T.F., Replay detection using cqt-based modified group delay feature and resnewt network in asvspoof 2019, “Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)” 2019.
Crossref
Google Scholar

Cheuk K.W., Anderson H., Agres K., Herremans D., nnaudio: An on-the-fly gpu audio to spectrogram conversion toolbox using 1d convolutional neural networks, “IEEE Access” 2020, vol. PP, no. 99, pp. 1–1.
Crossref
Google Scholar

Conti E., Salvi D., Borrelli C., Hosler B., Bestagini P., Antonacci F., Sarti A., Stamm M.C., Tubaro S., Deepfake speech detection through emotion recognition: A semantic approach, “IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022”, Virtual and Singapore, 23–27 May 2022, pp. 8962–8966.
Crossref
Google Scholar

Das R.K., Yang J., Li H., Assessing the scope of generalized countermeasures for anti-spoofing, “IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2020” 2020.
Crossref
Google Scholar

Dutoit T., High-quality text-to-speech synthesis: An overview, “Journal Of Electrical And Electronics Engineering Australia” 1997, no. 17(1), pp. 25–36.
Google Scholar

Fu Q., Teng Z., White J., Powell M.G., Schmidt D.C., Fastaudio: A learnable audio front-end for spoof speech detection, “ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)” 2021, pp. 3693–3697.
Crossref
Google Scholar

Hasanabadi M.R., An overview of text-to-speech systems and media applications, “arXiv preprint arXiv:2310.14301” 2023.
Google Scholar

Hong Y., Tan Z.H., Ma Z., Guo J., Dnn filter bank cepstral coefficients for spoofing detection, “IEEE Access” 2017, vol. 5, no. 99, pp. 4779–4787.
Crossref
Google Scholar

Machado A.F., Queiroz M.G.D, Voice conversion: A critical survey, “Proceedings” 2010.
Google Scholar

Martín-Doñas J.M., Álvarez A., The vicomtech audio deepfake detection system based on wav2vec2 for the 2022 add challenge, 2022, pp. 9241–9245.
Crossref
Google Scholar

Mittal A., Dua M., Automatic speaker verification systems and spoof detection techniques: review and analysis, “International Journal of Speech Technology” 2021, vol. 25, pp. 105–134.
Crossref
Google Scholar

Novoselov S., Kozlov A., Lavrentyeva G., Simonchik K., Shchemelinin V., Stc anti-spoofing systems
Google Scholar

for the asvspoof 2015 challenge, “IEEE International Conference on Acoustics, Speech
Google Scholar

and Signal Processing (ICASSP)” 2016.
Google Scholar

Pal M., Paul D., Saha G., Synthetic speech detection using fundamental frequency variation and spectral features, “Computer Speech & Language” 2018, vol. 48, pp. 31–50.
Crossref
Google Scholar

Pan J.Y., Nie S., Zhang H., He S., Zhang K., Liang S., Zhang X., Tao J., Speaker recognitionassisted robust audio deepfake detection, “InterSpeech” 2022.
Crossref
Google Scholar

Patel T.B., Patil H., Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech, “Conference of International Speech Communication Association” 2015.
Crossref
Google Scholar

Rabiner L., Juang B.H., Fundamentals of speech recognition, “Fundamentals of speech recognition” 1999.
Google Scholar

Rana M.S., Nobi M.N., Murali B., Sung A.H., Deepfake detection: A systematic literature review, “IEEE access” 2022, no. 10, pp. 25494–25513.
Crossref
Google Scholar

Ravanelli M., Bengio Y., Speaker recognition from raw waveform with sincnet, “IEEE Spoken Language Technology Workshop (SLT)” 2018, pp. 1021–1028.
Crossref
Google Scholar

Rosenberg A.E., Automatic speaker verification: A review, “Proceedings of the IEEE” 1976, no 64(4), pp. 475–487.
Crossref
Google Scholar

Sahidullah M., Kinnunen T., Hanilci C., A comparison of features for synthetic speech detection, “Proc. of INTER SPEECH” 2015.
Crossref
Google Scholar

Sailor H.B., Agrawal D.M., Patil H.A., Unsupervised filterbank learning using convolutional restricted boltzmann machine for environmental sound classification, “Interspeech” 2017.
Crossref
Google Scholar

Sanchez J., Saratxaga I., Hernaez I., Navas E., Erro D., Raitio T., Toward a universal synthetic speech spoofing detection using phase information, “IEEE Transactions on Information Forensics & Security” 2015, vol. 10, no. 4, pp. 810–820.
Crossref
Google Scholar

Swathi P., Sk S., Deepfake creation and detection: A survey, “2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)” 2021, pp. 584–588.
Google Scholar

Tian X., Wu Z., Xiong X., Chng E.S., Li H., Spoofing detection from a feature representation perspective, “2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)” 2016.
Crossref
Google Scholar

Todisco M., Delgado H., Evans N., A new feature for automatic speaker verification antispoofing: Constant q cepstral coefficients, “Processings of Odyssey 2016” 2016.
Crossref
Google Scholar

Todisco M., Delgado H., Lee K.A., Sahidullah M., Evans N.W.D., Kinnunen T.H., Yamagishi J., Integrated presentation attack detection and automatic speaker verification: Common features and gaussian back-end fusion, “Interspeech” 2018.
Crossref
Google Scholar

Wang C., Yi J., Tao J., Zhang C., Zhang S., Chen X., Detection of cross-dataset fake audio based on prosodic and pronunciation features, “Interspeech” 2023.
Crossref
Google Scholar

Wu Z., De Leon P.L., Demiroglu C., Khodabakhsh A., King S., Ling Z.H., Saito D., Stewart B., Toda T., Wester M., Yamagishi J., Anti-spoofing for text-independent speaker verification: An initial database, comparison of countermeasures, and human performance, “IEEE/ACM Transactions on Audio, Speech, and Language Processing” 2016, vol. 24, no. 4, pp. 768–783.
Crossref
Google Scholar

Wu Z., Xiong X., Chng E.S., Li H., Synthetic speech detection using temporal modulation feature, “IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)” 2013.
Crossref
Google Scholar

Xiao X., Tian X., Du S., Xu H., Li H., Spoofing speech detection using high dimensional magnitude and phase features: the ntu approach for asvspoof 2015 challenge, “Interspeech” 2015.
Crossref
Google Scholar

Xie Y., Zhang Z., Yang Y., Siamese network with wav2vec feature for spoofing speech detection, “Interspeech” 2021.
Crossref
Google Scholar

Yi J., Bai Y., Tao J., Ma H., Tian Z., Wang C., Wang T., Fu R., Half-truth: A partially fake audio detection dataset, “Proc. Of Interspeech” 2021.
Crossref
Google Scholar

Yi J., Wang C., Tao J., Tian Z., Fan C., Ma H., Fu R., Scenefake: An initial dataset and benchmarks for scene fake audio detection, “ArXiv” 2022, vol. abs/2211.06073.
Google Scholar

Yi J., Wang C., Tao J., Zhang X., Zhang C.Y., Zhao Y., Audio Deepfake Detection: A Survey, “arXiv preprint arXiv:2308.14970” 2023.
Google Scholar

Zeghidour N., Teboul O., Quitry F., Tagliasacchi M., Leaf: A learnable frontend for audio classification, “ICLR” 2021.
Google Scholar

Zeghidour N., Usunier N., Kokkinos I., Schatz T., Synnaeve G., Dupoux E., Learning filterbanks from raw speech for phone recognition, “IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)” 2018, pp. 5509–5513.
Crossref
Google Scholar

Zhang Y., Wang W., Zhang P., The effect of silence and dual band fusion in anti-spoofing system, “Interspeech” 2021.
Crossref
Google Scholar

Zhizheng Wu E.S.C., Li H., Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition, “Interspeech” 2012.
Google Scholar

Internet source:
Google Scholar

<https://cyware.com/news/fraudsters-make-away-with-243000-by-impersonating-company-ceoin-new-voice-phishing-attack-c8dc188d>, accessed: 06.11.2023.
Google Scholar