Improving Knowledge Distillation by Training Teachers to maximize Their Conditional Mutual Information

dc.contributor.authorYe, Linfeng
dc.date.accessioned2024-09-03T13:22:39Z
dc.date.available2024-09-03T13:22:39Z
dc.date.issued2024-09-03
dc.date.submitted2024-08-29
dc.description.abstractKnowledge distillation (KD) and its variants, as effective model compression methods, have attracted tremendous attention from both academia and industry. These methods usually use the pretrained teacher models' outputs and the ground truth labels as supervision signals to train the lightweight student model, which can improve student performance in terms of accuracy. One aspect of KD that has rarely been explored in the literature is how the behavior of the teacher models affects the students' performance. Specifically, in most existing KD projects, teacher models are usually trained to optimize their own accuracy. However, recent studies have shown that a teacher with higher accuracy does not always lead to a student with higher accuracy \cite{cho2019efficacy, stanton2021does}. To explain the aforementioned counter-intuitive observations and advance the understanding of the role of teacher models in KD, the following research problem naturally arises: \textit{How can a teacher model be trained to further improve student's accuracy in scenarios where the teacher is willing to allow its knowledge to be transferred to the student in whatever form?} In this thesis, we assert that the role of the teacher model is to provide contextual information to the student model during the KD process. In order to increase the contextual information captured by the teacher model, this thesis proposes a novel regularization term called Maximum Conditional Mutual Information (MCMI). Specifically, when a teacher model is trained by conventional cross-entropy loss plus MCMI, its log-likelihood and conditional mutual information (CMI) are simultaneously maximized. The new Class Activation Mapping (CAM) algorithm further verified that maximizing the teacher’s CMI value allows it to capture more contextual information in an image cluster. Via conducting a thorough set of experiments, we show that by employing a teacher trained by CE plus MCMI rather than one trained CE in various state-of-the-art KD frameworks, student's classification accuracy consistently increases, with a gain of up to 3.32\%. In addition, we show that such improvements in the student's accuracy are more drastic in zero-shot and few-shot settings. Notably, when 5\% of the training samples are available to the student (few-shot), the student's accuracy increases with the gain of up to 5.72\%, and increases from 0\% to as high as 84\% for an omitted class (zero-shot).
dc.identifier.urihttps://hdl.handle.net/10012/20949
dc.language.isoen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.titleImproving Knowledge Distillation by Training Teachers to maximize Their Conditional Mutual Information
dc.typeMaster Thesis
uws-etd.degreeMaster of Applied Science
uws-etd.degree.departmentElectrical and Computer Engineering
uws-etd.degree.disciplineElectrical and Computer Engineering
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0
uws.contributor.advisorYang, En-hui
uws.contributor.affiliation1Faculty of Engineering
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Ye_Linfeng.pdf
Size:
1.27 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description: