Improving Knowledge Distillation by Training Teachers to maximize Their Conditional Mutual Information

Loading...
Thumbnail Image

Date

2024-09-03

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Knowledge distillation (KD) and its variants, as effective model compression methods, have attracted tremendous attention from both academia and industry. These methods usually use the pretrained teacher models' outputs and the ground truth labels as supervision signals to train the lightweight student model, which can improve student performance in terms of accuracy. One aspect of KD that has rarely been explored in the literature is how the behavior of the teacher models affects the students' performance. Specifically, in most existing KD projects, teacher models are usually trained to optimize their own accuracy. However, recent studies have shown that a teacher with higher accuracy does not always lead to a student with higher accuracy \cite{cho2019efficacy, stanton2021does}. To explain the aforementioned counter-intuitive observations and advance the understanding of the role of teacher models in KD, the following research problem naturally arises: \textit{How can a teacher model be trained to further improve student's accuracy in scenarios where the teacher is willing to allow its knowledge to be transferred to the student in whatever form?} In this thesis, we assert that the role of the teacher model is to provide contextual information to the student model during the KD process. In order to increase the contextual information captured by the teacher model, this thesis proposes a novel regularization term called Maximum Conditional Mutual Information (MCMI). Specifically, when a teacher model is trained by conventional cross-entropy loss plus MCMI, its log-likelihood and conditional mutual information (CMI) are simultaneously maximized. The new Class Activation Mapping (CAM) algorithm further verified that maximizing the teacher’s CMI value allows it to capture more contextual information in an image cluster. Via conducting a thorough set of experiments, we show that by employing a teacher trained by CE plus MCMI rather than one trained CE in various state-of-the-art KD frameworks, student's classification accuracy consistently increases, with a gain of up to 3.32\%. In addition, we show that such improvements in the student's accuracy are more drastic in zero-shot and few-shot settings. Notably, when 5\% of the training samples are available to the student (few-shot), the student's accuracy increases with the gain of up to 5.72\%, and increases from 0\% to as high as 84\% for an omitted class (zero-shot).

Description

Keywords

LC Keywords

Citation