Portfolio project
RAG Document Intelligence QA System
A retrieval-augmented question-answering API over internal policy-style documents. Questions are embedded, matched against a FAISS index, and answered with an LLM using only retrieved context—returning explicit source references for each response.
Overview
Problem. Teams store procedures, SLAs, and policies in text files or PDFs. Finding a precise answer quickly—without guessing—requires search that understands meaning, not just keywords.
What this system does. It ingests chunked documents, builds dense vector embeddings, indexes them in FAISS, and exposes a REST API. Each POST /ask retrieves top-k chunks, assembles them into a prompt, and calls an OpenAI-compatible chat model. The response includes document name, page, and chunk id for each source.
Why grounding matters. Pure LLM answers can confabulate on factual details (dates, SLAs, thresholds). Conditioning on retrieved passages ties the answer to specific text and makes errors easier to audit.
Scope & limitations
- Fixed corpus — Answers are only as good as the indexed documents; the API does not browse the open web.
- Retrieval can miss or rank poorly — Wrong or weak chunks still lead to weak or wrong answers; similarity is not “truth.”
- LLM behavior — The model may still misread context or refuse; optional server-side
RETRIEVAL_MIN_SCOREcan skip the LLM and return a fallback even when chunks exist. - Not legal/financial advice — Demo policies are sample text for engineering illustration.
Architecture & workflow
End-to-end path for a single question:
One request; offline steps (extract → chunk → embed → index) are separate from this path.
-
1
User question
JSON body
{"question":"…"}toPOST /ask. -
2
Embedding & retrieval
The query is embedded with the same model used at index time (
all-MiniLM-L6-v2, normalized vectors). -
3
FAISS search
Inner-product search on the index returns the top-k chunk rows with similarity scores.
-
4
Context assembly
Chunk text is formatted with source labels (document, page) and capped for prompt size.
-
5
LLM answer
A chat completion generates the answer constrained to the provided context.
-
6
Sources in the response
The API returns
answerplussources[](and optionally retrieved chunk payloads for debugging).
Features
Document-grounded answers
Answers are driven by retrieved passages, not the model’s unconstrained prior knowledge.
Source references
Each citation includes document name, page, and chunk id for traceability.
Retrieval pipeline
Separate ingestion, chunking, embedding, and indexing steps; API loads index and metadata at startup.
API-first
FastAPI with OpenAPI; easy to integrate from web clients, scripts, or other services.
Dockerized backend
Single image with pinned dependencies; artifacts baked or mounted per environment.
VPS deployment
Container on a VPS behind Caddy for HTTPS and reverse proxy to the app.
Tech stack
- Runtime / API: Python 3.11, FastAPI, Uvicorn
- Vectors: sentence-transformers, NumPy, FAISS (CPU)
- Generation: OpenAI API (or OpenAI-compatible base URL)
- Data: CSV chunk metadata, pandas in the indexing pipeline
- Ops: Docker, Linux VPS, Caddy (TLS + reverse proxy)
Deployment & engineering
The service runs as a Docker container exposing the FastAPI app on an internal port. Caddy terminates TLS and proxies public HTTPS to the container. The live instance is reachable at rag.vahdetkaratas.com with GET /health reporting index and metadata presence and whether retrieval is loaded in memory.
This matches a small production-style loop: build image → run on VPS → configure DNS and reverse proxy → verify health before sending traffic to /ask.
Building the index & artifacts (local or CI) is documented in the repo: see README.md and docs/PUBLISH.md for extraction → chunking → FAISS indexing and Docker notes.
Live demo
Calls POST https://rag.vahdetkaratas.com/ask from your browser.
If the request is blocked (e.g. CORS or API key on the server), use Swagger UI or curl instead.
Try an example (fills the box; then press Ask):
Answer
This answer was generated using retrieved document context. The passages listed under retrieved evidence are what the model was shown.
Retrieved evidence
Chunks returned by the API as supporting context for the answer above.
View excerpt of retrieved context (first chunk text from API)
Why this project
The repo demonstrates a full retrieval + generation loop with a real HTTP API and deployed endpoint—not only notebooks or local scripts. Design choices (separate indexing from inference, explicit sources, health checks) reflect how similar systems are operated in practice.