Investigating LLM’s Knowledge about English G2P Rules and Pronunciation with Pseudo-words

Yao, Sheng

Investigating LLM’s Knowledge about English G2P Rules and Pronunciation with Pseudo-words

dc.contributor.author	Yao, Sheng
dc.date.accessioned	2026-05-21T13:57:26Z
dc.date.available	2026-05-21T13:57:26Z
dc.date.issued	2026-05-21
dc.date.submitted	2026-05-14
dc.description.abstract	For large language models (LLM), the real world is mapped onto a world made of text strings, and one current research direction in NLP is to examine how much knowledge LLMs learn inside their text world. Studies have shown that they have knowledge of English grapheme-to-phoneme conversion (G2P) and pronunciation, but only to a moderate degree. They mainly use tasks such as rhyme detection, syllable counting, and G2P involving words inside the vocabulary of a language - in most cases English. While the results are convincing, we still believe that the acid test of such knowledge should involve pseudo-words - made-up orthographic words. For example, state-of-the-art LLMs such as GPT5 and Gemini3-Pro have no problem providing the pronunciation of an in-vocabulary English word as they are clever enough to fetch the fact from their training data. When given a pseudo-word, however, their predictions can sound unnatural from a human perspective. On the other hand, if human participants all agree on a certain pronunciation for a given pseudo-word, it means they have used some common (implicit) knowledge about G2P and pronunciation when making their prediction. Therefore, we aim to examine the degree of similarity between human participants and LLMs when they are predicting the sound of pseudo-words as an indicator of whether LLMs have learned about G2P and pronunciation in their text world. It turns out that LLMs’ knowledge does have a remarkable degree of human-likeness, not only because 80% of LLMs’ predictions are the same as humans’ when there is zero inter-human divergence, but also because LLMs’ bewilderment (measured by how LLMs’ predictions vary across runs) correlates with humans’. That is, models and humans are dealing with the G2P task in more or less the same way. However, we also see substantial numbers of cases where LLMs’ predictions are far from humans’ predictions. When we took a closer look at such cases, we found a couple of tendencies and further validated the findings using real English words. More than half of these tendencies suggest that LLMs actually oversimplify the matter of G2P, sticking to the most common mappings in the English vocabulary. These tendencies can be useful when we try to improve the performance of LLMs and even text-to-speech models on pronunciation tasks. In fact, we also included the text-to-speech component of SpeechT5, in order to compare the performance of text-only models and bi-modal ones. We find that, while there seems to be a bottleneck for LLMs in the sense that the most powerful model, GPT5.4, does not significantly outperform a much weaker Llama3 on the G2P task, SpeechT5 is easily more human-like than all LLMs on several metrics. It seems that bi-model learning does give text-to-speech models such as SpeechT5 an advantage on a sound-related task, despite the fact that SpeechT5 is much smaller than LLMs.
dc.identifier.uri	https://hdl.handle.net/10012/23365
dc.language.iso	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.subject	LLM
dc.subject	large language models
dc.subject	pseudo words
dc.subject	TTS models
dc.subject	text-to-speech models
dc.subject	phonology
dc.subject	phonetics
dc.subject	knowledge
dc.subject	alignment
dc.subject	pronunciation
dc.title	Investigating LLM’s Knowledge about English G2P Rules and Pronunciation with Pseudo-words
dc.type	Master Thesis
uws-etd.degree	Master of Mathematics
uws-etd.degree.department	David R. Cheriton School of Computer Science
uws-etd.degree.discipline	Computer Science
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	0
uws.comment.hidden	All issues mentioned in the previous email have been fixed: 1) Title Page - insert a copyright line below the location line 2) Table of Contents - add a 'List of Figures' title to the Table of Contents list 3) Table of Contents - add a 'List of Tables' title to the Table of Contents list (UPDATED on May 20) pp.29-35 & 30-40 are set to landscape to display the graphs, just so you now. Thanks.
uws.contributor.advisor	Shi, Haoyue Freda
uws.contributor.affiliation1	Faculty of Mathematics
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Yao_Sheng.pdf
Size:: 4.79 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses