BugLLM: Explainable Bug Localization through LLMs
Loading...
Date
2024-09-24
Authors
Advisor
Meiyappan Nagappan
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Bug localization is the process of identifying the files in a codebase that contain a bug based on a bug report. This thesis presents BugLLM, a novel zero-shot bug localization method leveraging Large Language Models (LLMs) and semantic search techniques. BugLLM comprises two main phases: ingestion and inference.
In the ingestion phase, the codebase is chunked using an Abstract Syntax Tree (AST) parser, embedded using OpenAI's Ada V2 model and indexed in a Milvus vector database for efficient querying.
In the inference phase, a query is built from the bug report using an LLM to filter out non-technical details. This refined query is then used to search the vector database, retrieving semantically similar code chunks. These chunks undergo further filtering using another LLM query to establish their relevance to the bug, ensuring only the most pertinent chunks are considered.
Our method was evaluated on a dataset that includes bugs from six large Java projects. The evaluation metrics used include top-5 accuracy, where BugLLM achieved a top-5 accuracy ranging from 44.7% to 61.1%. BugLLM's performance was competitive, often surpassing traditional methods, and demonstrated efficiency with no training required.
To further aid developers, BugLLM also generates explanations for why specific files are relevant to a bug. The motivation behind this is twofold: helping developers understand why a file is important to fixing a bug and increasing transparency about how our tool works. Our methodology employs Chain-of-Thought prompting to generate detailed explanations from LLMs. These explanations are evaluated based on technical accuracy, groundedness, and informativeness. We find that the explanations generated by BugLLM are largely accurate and grounded in the actual content and context of the code, with minimal hallucination. The explanations were also found to be informative, providing valuable insights to developers. The mean scores (out of 5) for technical accuracy, groundedness, and informativeness were 3.9, 4.5, and 4.3, respectively, across different prompting techniques.