Audio to Phone Transcription — Illustration with Mandarin – UROP Spring Symposium 2021

Audio to Phone Transcription — Illustration with Mandarin

Yingjie Qu

Yingjie Qu

Pronouns: she/her/hers

Research Mentor(s): San Duanmu, Professor
Research Mentor School/College/Department: Linguistics, College of Literature, Science, and the Arts
Presentation Date: Thursday, April 22, 2021
Session: Session 4 (2pm-2:50pm)
Breakout Room: Room 20
Presenter: 1

Event Link


Generally, linguists’ first step in analyzing any language is to transcribe audio files into written representation. This is an extremely time-consuming process because the committed time to audio length ratio for an experienced linguist is about 100:1, which slows down the analysis of language tremendously. In addition, despite many advances in language technology, there does not exist such a tool with which any audio file in any language could be converted into transcribed phones automatically. Our research project focuses on developing a method to convert audio files of speech to transcribed phones (consonants and vowels), without prior knowledge of the target language. Our main approach lies in trying to find cues from the sound waves of an audio file and using the cues to predict boundaries between consonants and vowels. Specifically, we start out by manually placing such boundaries and use them as standards for future comparison. From sound files and text grid files obtained from an audio processing software named Praat (Boersma & Weenink 2020), we observe acoustic signals to understand why boundaries are put at certain locations. Those acoustic signals include intensity, pitch, zero-crossing, turning points, amplitude, pulses and formants. Once we understand the relations between the signals and boundaries, we try to predict the locations of boundaries with the rules. We import all the data into Excel and analyze our data with it. By the end of the first semester, I have made use of the zero-crossing rate and the rate of change of turning points to make predictions of the boundary locations. Specifically, I analyzed a piece of Chinese audio file with the length of 48.614 seconds. I hand labelled 364 boundaries in total. For zero-crossing, I predicted 382 boundaries using our method. After checking the 20 milliseconds precision rate, I noticed that 201 of the boundaries were correct. The success rate is about 52.62%. For turning points, I predicted 388 boundaries in total. After checking the 20 milliseconds precision rate, I noticed that 206 of my boundaries were correct. The success rate is about 53.09%. The main goal of the second semester is to improve the accuracy rate of boundary predictions. Possible solutions include revising hand-labeled boundaries and improving rules for boundary prediction. The success of this project would be of great help for documenting endangered languages (those that are expected to become extinct by the end of this century), which include from 50% to 90% of the 7,000 languages spoken in the world today (Austin & Sallabank 2011). References: Austin, Peter K., and Julia Sallabank. 2011. Introduction. In The Cambridge handbook of endangered languages, ed. Peter K. Austin and Julia Sallabank, 1-24. Cambridge: Cambridge University Press. Boersma, Paul, & David Weenink. 2020. Praat: doing phonetics by computer [Computer program]. Version 6.1.16, retrieved 6 June 2020 from

Authors: Yingjie Qu, San Duanmu
Research Method: Qualitative Study

lsa logoum logo