Chatting with 100k+ Files: A Codebase Indexing Project

Backstory
At the time, I was deep into studying operating systems — especially memory paging — and I wanted to understand how it all actually worked in practice. Naturally, I turned to the Linux kernel source code.
However the Linux repo is massive, containing over 50,000 files. Manually exploring and jumping between files to follow execution paths was nearly impossible. I wanted a better way to search for relevant code across the entire codebase.
This was when GPT was just released, and with tools like LangChain, Pinecone, and Modal starting to mature, I thought it would be helpful to build a custom RAG (Retrieval-Augmented Generation) pipeline tailored for codebases.
How It Was Built
Indexing the Repository
The pipeline starts by taking in a Git repo URL. The backend:
- Clones the repo
- Streams all raw file contents to worker nodes
- Splits the code into chunks by function, class, or logical page
- Embeds each chunk using OpenAI (or any supported embedding model)
- Stores embeddings in Pinecone with relevant metadata (file path, function name, etc.)
This chunking strategy ensures that semantic units (like a C function or kernel module) stay together in the vector index — improving search results.
Inference: Chatting with the Codebase
Once the repo is indexed, you can ask natural language questions like:
- “Where does Linux handle page faults?”
- “How does copy-on-write work in memory management?”
Here’s what happens under the hood:
- Your query is embedded and used to run a cosine similarity search in Pinecone
- The top-K matching code chunks are returned
- LangChain wraps them with a chain-of-thought prompt, then feeds them into GPT to synthesize an answer
This lets you reason about the codebase more like you would with a knowledgeable dev mentor — without manually tracing hundreds of source files.
Final Thoughts
This project was one of my earliest experiences combining LLMs with infrastructure tooling like Modal and Pinecone. If I were to continue with this project I would've added code-analysis features (like AST, etc) to be sent to GPT.
If you’re working on AI tooling for devs, especially around RAG + code search, feel free to reach out or DM me. Always down to chat.