Investigating LLM’s Knowledge about English G2P Rules and Pronunciation with Pseudo-words

dc.contributor.authorYao, Sheng
dc.date.accessioned2026-05-21T13:57:26Z
dc.date.available2026-05-21T13:57:26Z
dc.date.issued2026-05-21
dc.date.submitted2026-05-14
dc.description.abstractFor large language models (LLM), the real world is mapped onto a world made of text strings, and one current research direction in NLP is to examine how much knowledge LLMs learn inside their text world. Studies have shown that they have knowledge of English grapheme-to-phoneme conversion (G2P) and pronunciation, but only to a moderate degree. They mainly use tasks such as rhyme detection, syllable counting, and G2P involving words inside the vocabulary of a language - in most cases English. While the results are convincing, we still believe that the acid test of such knowledge should involve pseudo-words - made-up orthographic words. For example, state-of-the-art LLMs such as GPT5 and Gemini3-Pro have no problem providing the pronunciation of an in-vocabulary English word as they are clever enough to fetch the fact from their training data. When given a pseudo-word, however, their predictions can sound unnatural from a human perspective. On the other hand, if human participants all agree on a certain pronunciation for a given pseudo-word, it means they have used some common (implicit) knowledge about G2P and pronunciation when making their prediction. Therefore, we aim to examine the degree of similarity between human participants and LLMs when they are predicting the sound of pseudo-words as an indicator of whether LLMs have learned about G2P and pronunciation in their text world. It turns out that LLMs’ knowledge does have a remarkable degree of human-likeness, not only because 80% of LLMs’ predictions are the same as humans’ when there is zero inter-human divergence, but also because LLMs’ bewilderment (measured by how LLMs’ predictions vary across runs) correlates with humans’. That is, models and humans are dealing with the G2P task in more or less the same way. However, we also see substantial numbers of cases where LLMs’ predictions are far from humans’ predictions. When we took a closer look at such cases, we found a couple of tendencies and further validated the findings using real English words. More than half of these tendencies suggest that LLMs actually oversimplify the matter of G2P, sticking to the most common mappings in the English vocabulary. These tendencies can be useful when we try to improve the performance of LLMs and even text-to-speech models on pronunciation tasks. In fact, we also included the text-to-speech component of SpeechT5, in order to compare the performance of text-only models and bi-modal ones. We find that, while there seems to be a bottleneck for LLMs in the sense that the most powerful model, GPT5.4, does not significantly outperform a much weaker Llama3 on the G2P task, SpeechT5 is easily more human-like than all LLMs on several metrics. It seems that bi-model learning does give text-to-speech models such as SpeechT5 an advantage on a sound-related task, despite the fact that SpeechT5 is much smaller than LLMs.
dc.identifier.urihttps://hdl.handle.net/10012/23365
dc.language.isoen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.subjectLLM
dc.subjectlarge language models
dc.subjectpseudo words
dc.subjectTTS models
dc.subjecttext-to-speech models
dc.subjectphonology
dc.subjectphonetics
dc.subjectknowledge
dc.subjectalignment
dc.subjectpronunciation
dc.titleInvestigating LLM’s Knowledge about English G2P Rules and Pronunciation with Pseudo-words
dc.typeMaster Thesis
uws-etd.degreeMaster of Mathematics
uws-etd.degree.departmentDavid R. Cheriton School of Computer Science
uws-etd.degree.disciplineComputer Science
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0
uws.comment.hiddenAll issues mentioned in the previous email have been fixed: 1) Title Page - insert a copyright line below the location line 2) Table of Contents - add a 'List of Figures' title to the Table of Contents list 3) Table of Contents - add a 'List of Tables' title to the Table of Contents list (UPDATED on May 20) pp.29-35 & 30-40 are set to landscape to display the graphs, just so you now. Thanks.
uws.contributor.advisorShi, Haoyue Freda
uws.contributor.affiliation1Faculty of Mathematics
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Yao_Sheng.pdf
Size:
4.79 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections