Post-Training Large Language Models as Software Engineering Agents

Loading...
Thumbnail Image

Advisor

Wenhu, Chen

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in code un- derstanding and generation, yet a significant gap remains between static code generation and interactive software engineering. This thesis investigates the post-training of LLMs as software engineering agents, focusing on three interconnected challenges: infrastructure, data, and training methodology. First, we contribute to VerlTool, a unified framework for agentic reinforcement learn- ing with tool integration (ARLT). The author’s contributions center on the training orches- tration layer — the stateful environment protocol, environment server architecture, and SWE agent post-training pipeline — which make tool-augmented RL training practical and accessible for researchers. Second, we address the critical bottleneck of training data and evaluation infrastructure. SWE-Next provides a scalable, Ray-native pipeline for synthesizing verifiable software engineering tasks from open-source repositories (ongoing work with intermediate results reported). For SWE-QA-Pro, a representative benchmark for code question answering, the author contributes the data sourcing and synthesis pipeline. Third, we investigate the post-training design space for software engineering agents, spanning supervised fine-tuning (SFT), rejection fine-tuning (RFT), RL from AI feed- back (RLAIF), and RL with verifiable rewards (RLVR). Through three complementary case studies—code question answering (SFT + RLAIF), web-based information retrieval (SFT + RFT), and repository-level bug fixing (RLVR)—we demonstrate that the opti- mal training recipe depends on task characteristics such as reward verifiability, exploration complexity, and data availability. Our experiments show that task-specific post-training of smaller open-weight models can be competitive with larger proprietary models, and that matching the training method to the task structure is more important than uniformly applying all stages.

Description

Keywords

LC Subject Headings

Citation