Browser document Q&A (Llama-3 + RAG)

I built a browser-based document assistant to see how far I could push private, session-local Q&A over PDFs and text: chunking, embeddings, vector retrieval, and a Llama-3–class model for answers—without sending document content to a third-party app beyond the inference API I chose for the experiment.

Screen capture of the prototype: upload, question, and grounded-style answers over in-session documents.

What I shipped

Ingestion pipeline — PDF/text handling with cleanup and chunking suitable for retrieval (LangChain-style orchestration, PyPDF, custom normalization).
Retrieval stack — M2-BERT embeddings with FAISS for search over chunks; prompt assembly so the LLM works from retrieved context.
Reasoning layer — Llama-3-70B via Together AI for answer generation in early iterations.
Hosting — AWS serverless-style deployment with scaling in mind for bursty use.

What I learned

End-to-end RAG is less about the headline model and more about chunk boundaries, embedding quality, and failure modes when retrieval misses.
Keeping sessions scoped and documents client/session-bound is doable, but latency and cost trade-offs are real at moderate scale.

This was a personal build for practice and portfolio; it is not a product launch. A public demo URL may follow if I revisit hosting and terms of use.