I built an AI sandbox so I could actually learn this stuff
Instead of reading about RAG and agents and embeddings in isolation, I built a place where they're all connected and I can break things on purpose.
I learn by doing. I can read about vector similarity all day, but until I see two sentences produce nearly identical bar charts and a third one look completely different, it doesn’t stick. Same with agents — reading about tool use is not the same as watching an agent ignore your carefully written prompt and answer from its own knowledge anyway.
So I built a sandbox. A place where I can actually run these things, see the numbers, watch the decisions, and break stuff without consequence.
What I Built
Six pieces, all connected to the same Postgres database with pgvector:
Embeddings page. I paste text and get back a vector. There’s a bar chart showing the first 50 dimensions. The whole point is comparison — embedding “I love cats” next to “I adore felines” next to “the stock market crashed” and seeing the first two look almost identical while the third is different. That’s what makes embeddings stop being abstract.
RAG pipeline. I upload documents. They get chunked, embedded, and stored. When I search, I see which chunks matched and their cosine distance scores. When I ask, I get both the matched chunks and a generated answer side by side. This makes it immediately obvious when retrieval works (right chunks, good answer) and when it doesn’t (wrong chunks, the model makes stuff up).
Agent runners. Two frameworks. LangGraph for single agents with tools — specifically the option to give an agent access to my knowledge base and watch it decide whether to use it. CrewAI for multi-agent work — I set up a researcher and a writer and watch tasks flow between them. Both log every step to the database.
MCP client and server. The server exposes my RAG pipeline as a tool any MCP client can discover and use. The client connects to external MCP servers and lets me call their tools from the UI. This is how the sandbox stops being isolated and starts composing with other systems.
Chat. Persistent conversations stored in the database. The real use is comparing a plain chat answer (no context) against a RAG answer (with context) on the same question. The difference is retrieval working or not working.
Vector database. pgvector extension on Postgres. One database for documents, embeddings, conversations, and agent logs. No separate vector service, no sync problems.
Why This Setup
pgvector because I already need Postgres. Running a separate vector database alongside it would be one more thing to manage and one more place for data to get out of sync. The Neon free tier includes pgvector and 0.5 GB of storage, which is enough for a sandbox.
OpenRouter because managing separate API keys for OpenAI, Anthropic, and Google is annoying. One key, 100+ models. I swap between GPT-4o, Claude, Gemini, and Llama by changing one config line. The backend uses LangChain’s OpenAI-compatible interface which works with OpenRouter directly.
FastAPI because it’s async, generates its own API docs, and stays out of the way. LangGraph and CrewAI because they represent two approaches to agent design and I want to understand both. No abstraction layers on top — if I want to change chunking, I change the chunking function. If I want a new agent, I write a new agent file.
The Purpose
This isn’t a product. It’s a workbench. The goal is to make the things I’ve been learning about tangible:
- What does cosine distance 0.05 vs 0.45 actually look like on my own documents?
- When does an agent choose to call a tool vs answer from memory?
- How much does chunk size actually matter? (A lot, it turns out.)
- Does giving an agent RAG access actually produce better answers, or does it just add latency?
- Can multiple agents coordinate without losing context between them?
None of these questions answer themselves from reading. I need to run them and see what happens.
How I Plan to Learn With This
First, get the numbers in my head. Upload my own notes and documents. Search them. Stare at cosine distances until they feel intuitive instead of abstract. Embed pairs of similar and different sentences and compare the output. This is about building raw intuition for what vectors actually capture.
Second, make RAG work well. Ingest content from multiple sources with metadata. Ask cross-source questions. Build a test set of questions I know the answers to. Rate the answers. Change chunk size and overlap. Rate again. Keep going until the hallucination rate drops. This is where I learn that retrieval quality matters more than model quality — a good model with bad chunks still makes things up.
Third, let agents decide. Run LangGraph with RAG access. See when it calls the tool and when it doesn’t. Write better prompts. See it still ignore them sometimes. Then try CrewAI’s multi-agent setup and watch information flow and break between agents. Debug the traces. This is where I learn that agent behavior is unpredictable and that logging every step is the only way to understand what happened.
Fourth, connect outside. Write a custom MCP server wrapping a real API. Let an agent compose it with the RAG tool. Build a permission layer where dangerous tool calls need human approval. See what happens when an agent tries to call something it shouldn’t. This is where the sandbox stops being self-contained and starts interacting with real systems.
Fifth, put it together. An agent that finds new content, indexes it, tests its own retrieval, and reports. Multiple agents with shared state coordinating on a task. A code assistant that searches a codebase and makes edits. These are the projects where every piece connects and the failures get interesting.
Each phase builds on the one before. Each one has an obvious failure mode — bad vectors, bad retrieval, agents ignoring tools, sloppy security, things not composing. Those failures are the point. That’s where I actually learn.