Chatting with 100k+ Files: A Codebase Indexing Project

Chatting with 100k+ Files: A Codebase Indexing Project

Backstory

At the time, I was deep into studying operating systems — especially memory paging — and I wanted to understand how it all actually worked in practice. Naturally, I turned to the Linux kernel source code.

However the Linux repo is massive, containing over 50,000 files. Manually exploring and jumping between files to follow execution paths was nearly impossible. I wanted a better way to search for relevant code across the entire codebase.

This was when GPT was just released, and with tools like LangChain, Pinecone, and Modal starting to mature, I thought it would be helpful to build a custom RAG (Retrieval-Augmented Generation) pipeline tailored for codebases.

How It Was Built

Indexing the Repository

The pipeline starts by taking in a Git repo URL. The backend:

  1. Clones the repo
  2. Streams all raw file contents to worker nodes
  3. Splits the code into chunks by function, class, or logical page
  4. Embeds each chunk using OpenAI (or any supported embedding model)
  5. Stores embeddings in Pinecone with relevant metadata (file path, function name, etc.)

This chunking strategy ensures that semantic units (like a C function or kernel module) stay together in the vector index — improving search results.

Inference: Chatting with the Codebase

Once the repo is indexed, you can ask natural language questions like:

  • “Where does Linux handle page faults?”
  • “How does copy-on-write work in memory management?”

Here’s what happens under the hood:

  1. Your query is embedded and used to run a cosine similarity search in Pinecone
  2. The top-K matching code chunks are returned
  3. LangChain wraps them with a chain-of-thought prompt, then feeds them into GPT to synthesize an answer

This lets you reason about the codebase more like you would with a knowledgeable dev mentor — without manually tracing hundreds of source files.

Final Thoughts

This project was one of my earliest experiences combining LLMs with infrastructure tooling like Modal and Pinecone. If I were to continue with this project I would've added code-analysis features (like AST, etc) to be sent to GPT.

If you’re working on AI tooling for devs, especially around RAG + code search, feel free to reach out or DM me. Always down to chat.