Improving Knowledge Distillation by Training Teachers to maximize Their Conditional Mutual Information

Ye, Linfeng

Improving Knowledge Distillation by Training Teachers to maximize Their Conditional Mutual Information

dc.contributor.author	Ye, Linfeng
dc.date.accessioned	2024-09-03T13:22:39Z
dc.date.available	2024-09-03T13:22:39Z
dc.date.issued	2024-09-03
dc.date.submitted	2024-08-29
dc.description.abstract	Knowledge distillation (KD) and its variants, as effective model compression methods, have attracted tremendous attention from both academia and industry. These methods usually use the pretrained teacher models' outputs and the ground truth labels as supervision signals to train the lightweight student model, which can improve student performance in terms of accuracy. One aspect of KD that has rarely been explored in the literature is how the behavior of the teacher models affects the students' performance. Specifically, in most existing KD projects, teacher models are usually trained to optimize their own accuracy. However, recent studies have shown that a teacher with higher accuracy does not always lead to a student with higher accuracy \cite{cho2019efficacy, stanton2021does}. To explain the aforementioned counter-intuitive observations and advance the understanding of the role of teacher models in KD, the following research problem naturally arises: \textit{How can a teacher model be trained to further improve student's accuracy in scenarios where the teacher is willing to allow its knowledge to be transferred to the student in whatever form?} In this thesis, we assert that the role of the teacher model is to provide contextual information to the student model during the KD process. In order to increase the contextual information captured by the teacher model, this thesis proposes a novel regularization term called Maximum Conditional Mutual Information (MCMI). Specifically, when a teacher model is trained by conventional cross-entropy loss plus MCMI, its log-likelihood and conditional mutual information (CMI) are simultaneously maximized. The new Class Activation Mapping (CAM) algorithm further verified that maximizing the teacher’s CMI value allows it to capture more contextual information in an image cluster. Via conducting a thorough set of experiments, we show that by employing a teacher trained by CE plus MCMI rather than one trained CE in various state-of-the-art KD frameworks, student's classification accuracy consistently increases, with a gain of up to 3.32\%. In addition, we show that such improvements in the student's accuracy are more drastic in zero-shot and few-shot settings. Notably, when 5\% of the training samples are available to the student (few-shot), the student's accuracy increases with the gain of up to 5.72\%, and increases from 0\% to as high as 84\% for an omitted class (zero-shot).
dc.identifier.uri	https://hdl.handle.net/10012/20949
dc.language.iso	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.title	Improving Knowledge Distillation by Training Teachers to maximize Their Conditional Mutual Information
dc.type	Master Thesis
uws-etd.degree	Master of Applied Science
uws-etd.degree.department	Electrical and Computer Engineering
uws-etd.degree.discipline	Electrical and Computer Engineering
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	0
uws.contributor.advisor	Yang, En-hui
uws.contributor.affiliation1	Faculty of Engineering
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Ye_Linfeng.pdf
Size:: 1.27 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Electrical and Computer Engineering