Evaluation of Information Access Systems in the Generative Era

No Thumbnail Available

Date

2025-06-20

Advisor

Clarke, Charles

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

The rapid advancement of information access technologies, including neural retrieval models and generative information-seeking systems, has outpaced traditional evaluation methodologies, exposing fundamental gaps in assessing their effectiveness. Existing evaluation frameworks struggle to adapt, particularly in the presence of sparse relevance labels, limiting their ability to fairly and comprehensively compare retrieval and generation-based systems. The emergence of large language models (LLMs) further complicates evaluation, as they challenge conventional assessment paradigms while offering new opportunities for automated evaluation. To address these issues, it is crucial to first identify flaws in current evaluation methodologies and then develop more robust, efficient, and adaptable assessment strategies. This thesis begins by demonstrating that evaluation based on sparse labeling introduces substantial biases and inconsistencies in system rankings, often failing to recognize genuine improvements in retrieval effectiveness. We show that in traditional IR benchmarks,stronger models may retrieve highly relevant but unjudged documents, leading to underestimation of their performance. To mitigate this, we propose an alternative evaluation approach based on distribution of retrieved results and labeled data using Fréchet Distance. This method not only improves robustness in the presence of sparse labels but also facilitates direct comparison between retrieval-based and generative models on a common evaluation scale. We then investigate how LLMs can be leveraged to evaluate IR systems, distinguishing between their use for evaluating retrieval-based methods and generative IR systems. A key focus of this work is the role of LLMs in automated relevance judgments. We systematically compare different LLM-based relevance assessment methodologies, highlighting the lack of standardization in evaluating these approaches. To address this gap, we propose a structured framework that evaluates relevance judgment methods based on their alignment with human labels and their impact on system rankings. Furthermore, we examine the effect of prompt formulation on LLM-based evaluation, demonstrating how prompt variations can significantly influence the consistency and reliability of assessment outcomes. Finally, we extend our study beyond retrieval-based evaluation to assessing generated content across multiple applications. We explore retrieval-assisted methods for evaluating generative textual content, IR-inspired approaches for assessing text-to-image generation models, and a broader framework for evaluating LLM-powered applications. These contributions lay the foundation for a new generation of evaluation methodologies that keep pace with evolving information access technologies, ensuring that improvements in retrieval and generative AI systems can be accurately and meaningfully assessed.

Description

Keywords

Information Retrieval, Large Language Models, Evaluation, Retrieval Augmented Generation, Generative Information Retrieval

LC Subject Headings

Citation