Development··3 min read

Building an Internal Doc Search with RAG — A War Story

What happened when I tried to make 3,000 Confluence docs searchable with AI

I Heard "Check Confluence" Three Times a Day

Whenever someone asked a question on Slack, the answer was always the same: "It's on Confluence." But actually finding the right document on Confluence was like a treasure hunt. Search results surfaced 3-year-old docs first, there were five documents on the same topic, and you couldn't tell which one was current. Every single day.

"I'll just build a RAG-powered doc search chatbot." Thought it'd take two weeks. It took six.

Week 1: Smooth Sailing So Far

Crawling 3,247 documents via the Confluence API, chunking the text, and converting to embedding vectors — that part was fine. Used OpenAI's text-embedding-3-large for embeddings, stored them in Pinecone. Done in two days.

Basic semantic search worked great. Ask "deployment process" and it'd pull up 5 relevant docs, with the LLM summarizing an answer. "Wow, this is amazing," I thought. (Should've stopped here.)

Week 2: "This Thing Is Lying"

The honeymoon didn't last. The QA team tried it out and sent feedback. When asked "how do I request time off?", it generated an answer based on a 2023 document describing the old process. The system had changed completely — different method entirely. Even though the documents had dates, the LLM didn't prioritize the newest one.

The scariest thing about RAG is the "confident lie." If it said "I don't know," at least you'd go look it up yourself. But when it confidently gives a wrong answer, people follow it. One new hire actually submitted their time-off request wrong because of this. That was on me.

Weeks 3-4: The Chunking Problem

Initially I'd mechanically split at 500 tokens, but context was getting cut off. Tables split in half, code blocks chopped in two. I ended up analyzing document structure, splitting by h2 tags into sections, and attaching document title and creation date as metadata to each chunk. That alone took a week.

Then I added reranking. Retrieve 20 candidates via vector search, rerank with a Cross-Encoder, and feed only the top 5 to the LLM. This alone noticeably improved answer accuracy. I don't have exact numbers though — it's hard to measure precisely.

Week 5: Security Called

Just when I thought it was almost done. "HR documents should only be visible to authorized people — the chatbot can't show them to everyone." Fair point. I had to mirror Confluence's page permission system in the RAG pipeline. Filtering so each user only sees documents they have access to. Another 3 days.

Week 6: Finally Shipped

Deployed as a Slack bot. Ask a question with /ask and get an answer within 3 seconds. Source document links included. Two weeks of post-launch stats: average 45 questions per day, perceived accuracy around 80%. Not perfect, but definitively better than Confluence's search bar, according to user feedback.

What I Honestly Regret

Saying "two weeks." Taking six was a prediction failure on my part. Embeddings and vector search are less than 20% of the whole thing. The other 80% is the boring work — chunking, reranking, permission management, hallucination prevention. You can build a demo-level RAG in a day. Building one that people actually use is a completely different problem.

Related Posts