Portfolio project

RAG Document Intelligence QA System

A retrieval-augmented question-answering API over internal policy-style documents. Questions are embedded, matched against a FAISS index, and answered with an LLM using only retrieved context—returning explicit source references for each response.

FastAPI FAISS sentence-transformers OpenAI API Docker VPS Caddy

Live demo GitHub API docs

Overview

Problem. Teams store procedures, SLAs, and policies in text files or PDFs. Finding a precise answer quickly—without guessing—requires search that understands meaning, not just keywords.

What this system does. It ingests chunked documents, builds dense vector embeddings, indexes them in FAISS, and exposes a REST API. Each POST /ask retrieves top-k chunks, assembles them into a prompt, and calls an OpenAI-compatible chat model. The response includes document name, page, and chunk id for each source.

Why grounding matters. Pure LLM answers can confabulate on factual details (dates, SLAs, thresholds). Conditioning on retrieved passages ties the answer to specific text and makes errors easier to audit.

Scope & limitations

Fixed corpus — Answers are only as good as the indexed documents; the API does not browse the open web.
Retrieval can miss or rank poorly — Wrong or weak chunks still lead to weak or wrong answers; similarity is not “truth.”
LLM behavior — The model may still misread context or refuse; optional server-side RETRIEVAL_MIN_SCORE can skip the LLM and return a fallback even when chunks exist.
Not legal/financial advice — Demo policies are sample text for engineering illustration.

Architecture & workflow

End-to-end path for a single question:

Question Query embedding FAISS top‑k Prompt + context LLM Answer + sources

One request; offline steps (extract → chunk → embed → index) are separate from this path.

1

User question

JSON body {"question":"…"} to POST /ask.
2

Embedding & retrieval

The query is embedded with the same model used at index time (all-MiniLM-L6-v2, normalized vectors).
3

FAISS search

Inner-product search on the index returns the top-k chunk rows with similarity scores.
4

Context assembly

Chunk text is formatted with source labels (document, page) and capped for prompt size.
5

LLM answer

A chat completion generates the answer constrained to the provided context.
6

Sources in the response

The API returns answer plus sources[] (and optionally retrieved chunk payloads for debugging).

Features

Document-grounded answers

Answers are driven by retrieved passages, not the model’s unconstrained prior knowledge.

Source references

Each citation includes document name, page, and chunk id for traceability.

Retrieval pipeline

Separate ingestion, chunking, embedding, and indexing steps; API loads index and metadata at startup.

API-first

FastAPI with OpenAPI; easy to integrate from web clients, scripts, or other services.

Dockerized backend

Single image with pinned dependencies; artifacts baked or mounted per environment.

VPS deployment

Container on a VPS behind Caddy for HTTPS and reverse proxy to the app.

Tech stack

Runtime / API: Python 3.11, FastAPI, Uvicorn
Vectors: sentence-transformers, NumPy, FAISS (CPU)
Generation: OpenAI API (or OpenAI-compatible base URL)
Data: CSV chunk metadata, pandas in the indexing pipeline
Ops: Docker, Linux VPS, Caddy (TLS + reverse proxy)

Deployment & engineering

The service runs as a Docker container exposing the FastAPI app on an internal port. Caddy terminates TLS and proxies public HTTPS to the container. The live instance is reachable at rag.vahdetkaratas.com with GET /health reporting index and metadata presence and whether retrieval is loaded in memory.

This matches a small production-style loop: build image → run on VPS → configure DNS and reverse proxy → verify health before sending traffic to /ask.

Building the index & artifacts (local or CI) is documented in the repo: see README.md and docs/PUBLISH.md for extraction → chunking → FAISS indexing and Docker notes.

Live demo

Calls POST https://rag.vahdetkaratas.com/ask from your browser. If the request is blocked (e.g. CORS or API key on the server), use Swagger UI or curl instead.

Try an example (fills the box; then press Ask):

Your question

Why this project

The repo demonstrates a full retrieval + generation loop with a real HTTP API and deployed endpoint—not only notebooks or local scripts. Design choices (separate indexing from inference, explicit sources, health checks) reflect how similar systems are operated in practice.

Vahdettin Karataş

Location:

Tools & Focus Areas

How I can help

Recent outcomes

RAG Document Intelligence QA System

Overview

Scope & limitations

Architecture & workflow

User question

Embedding & retrieval

FAISS search

Context assembly

LLM answer

Sources in the response