Abstract
This article explores the benefits and challenges of using custom AI agents with large language models (LLMs) in the context of large codebases. We examine a case study involving OpenStudyBuilder, an open-source project with an extensive codebase, and Gemini Pro, an LLM with a one-million-token context window.
Custom AI agents can be instrumental throughout the entire software development lifecycle. This case study focuses on how such agents facilitate common tasks, such as ramping up and contributing to the project towards developing new features, enhancing tests, validating code, and improving documentation. While Gemini Pro’s large context window is not sufficient to analyse the entire OpenStudyBuilder codebase directly, it is effective for analysing substantial modules and filtered versions of the codebase.
This approach enables faster familiarisation with the project through summaries and analyses, providing a flexible, powerful tool that complements or even replaces AI pair programming platforms like GitHub Copilot.
Addressing a Common Problem: Comprehending a Large Codebase
Last week, I began ramping up on OpenStudyBuilder, an open-source solution designed to streamline clinical study design and collaboration by integrating CDSIC standards.
A significant challenge with large codebases is the time required to comprehend them. Professionals such as software architects, developers, and testers typically spend weeks familiarising themselves with a project before making active contributions. Although documentation is available, contributing effectively demands a deeper understanding of the codebase. AI code assistants and generative AI agents can expedite this process.
Google, a leader in this field, highlights the capabilities of full codebase awareness in their forthcoming Code Assist products:
“Google’s Gemini Code Assist offers large-scale codebase modifications from a single prompt, including feature additions, cross-file dependency updates, version upgrades, and comprehensive code reviews, powered by the Gemini 1.5 Pro model with a one-million-token context window.”
Large Context Window Approach for Full Codebase Analysis Google provides excellent examples of leveraging their model, such as this notebook for getting started as a developer. The large context window approach is appealing for its simplicity: it does not involve organising your codebase for a Retrieval-Augmented Generation (RAG) search but instead provides the entire codebase for your queries. This method ensures no information is lost in the retrieval unlike with a RAG search (making the RAG search efficient can be time-consuming). The notebook demonstrates a straightforward setup: specify your codebase location (Git or local folder) and request analyses in natural language. Examples include:
- Summarising codebases
- Generating developer documentation
- Uncovering critical bugs and fixes
- Implementing new features
- Understanding changes between Git commits
Customising Your AI Agent: Overcoming Limitations and Workflow Integration
Applying this approach to OpenStudyBuilder initially resulted in a timeout error because the codebase exceeded the one-million-token context window. However, simple Python customisations can work around this limitation by:
- Checking prompt size to ensure it doesn’t exceed the limit
- Filtering out large, irrelevant files (e.g., test data, images)
- Selecting specific modules for analysis instead of the entire codebase
Additional Benefits of Customisation are:
- Providing More Context: You can incorporate company-specific policies and guidelines to refine outcomes.
- Adapting and Deploying in Your Workflow: Custom AI agents in Python can generate structured outputs, minimising downstream integration effort. They can also be integrated into continuous integration/continuous deployment (CI/CD) pipelines, enhancing review or release preparation processes.
Custom AI Agent in Action
Here are examples of the capabilities demonstrated by analyzing a key module of OpenStudyBuilder—the import function. The AI agent provided:
- A module summary:
"This codebase is a Python script for populating a Neo4j database with medical dictionaries and study data for a software called StudyBuilder. It includes a variety of functions for importing data from different sources and formats, including CSV and JSON files."
- Insights into data validation approaches and potential improvements, aiding in discussions with the module developers:
"This codebase does not contain any explicit data validation routines. However, it does include several implicit validation approaches based on data lookups and checks for existing data. Data Validation Approaches: Lookup Tables, Existence Checks, Data Type Checking, Conditional Logic'"
- Identification of unhandled failures mode:
"scenario": "Invalid format of input CSV files", "impact": "The various import scripts read CSV files to get data for the dictionaries and codelists. These CSV files are assumed to be properly formatted, and there are no checks to validate the input. Invalid input may cause the scripts to fail with an exception or import invalid data. Thus it's important that the CSV files are reviewed and validated before using them for import."
Conclusion
Using large context windows to build AI agents for full codebase analysis is a powerful and straightforward approach. Although the one-million-token context window can be a bottleneck, this can be mitigated by analyzing modules and pre-filtering content in custom AI agents. This enables significant use cases such as accelerated ramp-up in new codebases, quality and risk assessment, and support for new feature development.
When used correctly, custom AI agents can significantly enhance the software development process, not only during code generation but also during ramp-up, review, and assessment phases. They can complement or even replace tools like GitHub Copilot, providing a flexible, powerful alternative.