Kit includes a complete RAG (Retrieval-Augmented Generation) system that answers questions using a knowledge base instead of the LLM's training data. The system uses OpenAI embeddings and pgvector similarity search to find relevant knowledge chunks, then passes only those chunks as context to the LLM — reducing token usage by approximately 95% compared to sending the full knowledge base.
This page covers the RAG pipeline, vector search, query rewriting, API routes, and knowledge base setup. For the streaming protocol and hooks, see Chat System.
Without RAG, a typical knowledge base might use ~90K tokens per request (sending the entire FAQ). With RAG, only the 3-5 most relevant chunks are sent — typically 3-5K tokens. This is the difference between $0.072 and $0.003 per request with Claude Haiku 4.5.
RAG Pipeline
Every RAG question flows through a six-step pipeline:
User asks: "How do I configure authentication?"
|
v
Step 1: Query Enhancement
|--- If conversation history exists:
| LLM rewrites vague follow-up → self-contained query
| Example: "What else?" → "What other Clerk auth features are available?"
|--- If standalone question: use as-is
|
v
Step 2: Embedding Generation
|--- OpenAI text-embedding-3-small
|--- Query → 1536-dimensional vector
|
v
Step 3: pgvector Similarity Search
|--- Cosine similarity: 1 - (embedding <=> query_vector)
|--- Filter by similarity threshold (default: 0.35)
|--- Optional category filter
|--- Returns top 5 chunks (configurable)
|
v
Step 4: Context Assembly
|--- Format chunks: [Source 1: Category]\nQ: ...\nA: ...
|--- Estimated ~3-5K tokens (vs ~90K for full FAQ)
|
v
Step 5: LLM Generation
|--- System prompt (tier-customized)
|--- Context from relevant chunks
|--- Conversation history
|--- Answer with source references
|
v
Step 6: Response with Attribution
|--- answer: "To configure authentication, ..."
|--- sources: [{ question, category, similarity }]
|--- chunksUsed: 3
|--- tokensEstimated: 4200
|--- usage: { promptTokens, completionTokens }
RAG Service
The
RAGService class orchestrates the entire pipeline. The core method is answerQuestionWithRAG:src/lib/ai/rag-service.ts — answerQuestionWithRAG
async answerQuestionWithRAG(
question: string,
options: AnswerWithRAGOptions = {}
): Promise<RAGAnswer> {
const {
userId: _userId,
userTier,
categories,
limit = 5,
similarityThreshold = 0.35, // Adjusted for cross-lingual queries and semantic variation
conversationHistory = [],
model: _model,
temperature: _temperature,
} = options
The service accepts these options:
| Option | Type | Default | Description |
|---|---|---|---|
userId | string | — | User ID for tier customization |
userTier | 'free' | 'pro' | 'enterprise' | — | Tier for system prompt customization |
categories | string[] | All | Filter search to specific categories |
limit | number | 5 | Number of chunks to retrieve |
similarityThreshold | number | 0.35 | Minimum similarity score (0-1) |
conversationHistory | Message[] | [] | Previous messages for context |
model | string | Provider default | Override AI model |
temperature | number | 0.7 | Response creativity |
The response includes full source attribution:
typescript
interface RAGAnswer {
answer: string // Generated answer text
sources: Array<{
question: string // Original FAQ question
category: string // Knowledge base category
similarity: number // Cosine similarity score (0-1)
}>
chunksUsed: number // Number of chunks used
tokensEstimated: number // Estimated context tokens
usage?: { // Actual token usage (if available)
promptTokens: number
completionTokens: number
totalTokens: number
}
model?: string // Model used for generation
}
Vector Search
The
searchRAG function performs semantic similarity search using OpenAI embeddings and pgvector:src/lib/ai/rag-search.ts — Similarity Search
export async function searchRAG(
query: string,
options: SearchOptions = {}
): Promise<SearchResult[]> {
const {
limit = 5,
similarityThreshold = 0.35, // Adjusted for cross-lingual queries and semantic variation
categories,
includeMetadata = true,
} = options
// 1. Generate embedding for user query
const apiKey = getEmbeddingApiKey()
const embeddingModelId = getEmbeddingModel()
const openai = createOpenAI({ apiKey })
const { embedding: queryEmbedding } = await embed({
model: openai.embedding(embeddingModelId),
value: query,
})
// 2. Format embedding as PostgreSQL array string
// Prisma doesn't automatically convert arrays to pgvector format
const embeddingStr = `[${queryEmbedding.join(',')}]`
// 3. Build SQL query with pgvector similarity
// Cosine similarity: 1 - (embedding <=> query_embedding)
// Higher score = more similar (1.0 = identical, 0.0 = unrelated)
// Build category filter SQL
const categoryFilterSQL =
categories && categories.length > 0
? `AND category = ANY(ARRAY[${categories.map((c) => `'${c}'`).join(',')}])`
: ''
const metadataSQL = includeMetadata ? 'metadata,' : ''
// Use $queryRawUnsafe to avoid template literal interpolation limits
const sql = `
SELECT
id,
content,
question,
answer,
category,
${metadataSQL}
1 - (embedding <=> '${embeddingStr}'::vector) as similarity
FROM faq_chunks
WHERE 1 - (embedding <=> '${embeddingStr}'::vector) > ${similarityThreshold}
${categoryFilterSQL}
ORDER BY embedding <=> '${embeddingStr}'::vector
LIMIT ${limit}
`
const results = await prisma.$queryRawUnsafe<SearchResult[]>(sql)
return results
}
The search process:
- Generate embedding — Convert the query to a 1536-dimensional vector using
text-embedding-3-small - Format for pgvector — Convert the float array to PostgreSQL vector format
- Cosine similarity query —
1 - (embedding <=> query_vector)gives a similarity score from 0 to 1 - Filter and sort — Apply similarity threshold and optional category filter, return top N results
RAG search uses OpenAI's embedding model for vector generation, even if you use Anthropic or Google for chat. Ensure
OPENAI_API_KEY is set in your environment (or AI_API_KEY as a fallback).The embedding model is configurable via
AI_EMBEDDING_MODEL (default: text-embedding-3-small, 1536 dimensions). Changing the model requires re-generating all embeddings.Similarity Thresholds
The similarity score ranges from 0 (completely unrelated) to 1 (identical). Kit uses a default threshold of 0.35 to balance recall and precision:
| Range | Interpretation | Action |
|---|---|---|
| 0.90+ | Near duplicate | Exact match found |
| 0.70 - 0.89 | Very relevant | High confidence answer |
| 0.50 - 0.69 | Good semantic match | Reliable answer with context |
| 0.35 - 0.49 | Loosely related | Include but may need qualification |
| Below 0.35 | Not relevant | Excluded (below threshold) |
If users get too many irrelevant results, raise
similarityThreshold to 0.5. If they get "no results found" too often, lower it to 0.3. The default of 0.35 is tuned for cross-lingual queries (English questions against German FAQ content).Helper Functions
The search module provides additional utilities:
| Function | Purpose |
|---|---|
searchRAG(query, options) | Primary search — embeddings + pgvector similarity |
getRelatedQuestions(chunkId, limit) | Find similar questions in the same category (threshold: 0.8) |
searchInCategory(category, query, limit) | Search within a specific category (threshold: 0.4) |
getCategories() | Get all unique category names |
getRAGStats() | Statistics: total chunks, categories, token counts |
Query Rewriting
When users send follow-up questions like "What else?" or "Tell me more", the RAG search would fail because the query lacks context. The
rewriteQueryWithContext method uses the LLM to transform vague follow-ups into self-contained queries:src/lib/ai/rag-service.ts — Query Rewriting
/**
* Rewrite vague or follow-up questions using conversation context
* This makes RAG searches work better for multi-turn conversations
*
* @param question - Current user question (may be vague or a follow-up)
* @param history - Previous conversation messages
* @returns Enhanced, self-contained question for RAG search
*
* @example
* // Original: "Und was noch?"
* // With context about Clerk auth
* // Enhanced: "What other Clerk authentication features are available in the boilerplate?"
*/
private async rewriteQueryWithContext(
question: string,
history: Message[]
): Promise<string> {
// Take last 3 messages for context (avoid token bloat)
const recentHistory = history.slice(-3)
// Build prompt for query rewriting
const contextPrompt = `You are a query rewriting assistant for a technical FAQ system about a Next.js SaaS Boilerplate.
Rewrite the user's follow-up question to be self-contained and specific for vector database search.
Conversation History:
${recentHistory.map((m) => `${m.role}: ${m.content}`).join('\n')}
Current Question: ${question}
Instructions:
1. If the question is already clear and specific, return it unchanged
2. If it's a follow-up (e.g., "Und was noch?", "What about X?"), incorporate context from history
3. Make it technical and specific to the Next.js boilerplate
4. Keep it concise (1-2 sentences max)
5. Output ONLY the rewritten question, nothing else
Rewritten Question:`
try {
const response = await this.aiService.answerQuestion(question, {
context: contextPrompt,
systemPrompt:
'You are a query rewriting assistant. Output ONLY the rewritten question, nothing else.',
stream: false,
})
const rewrittenQuery =
response?.choices[0]?.message?.content?.trim() || question
// Log for debugging (helps tune the system)
console.log('[RAG Service] Query Rewriting:', {
original: question,
enhanced: rewrittenQuery,
historyLength: recentHistory.length,
changed: rewrittenQuery !== question,
})
return rewrittenQuery
} catch (error) {
// If rewriting fails, fall back to original question
console.warn(
'[RAG Service] Query rewriting failed, using original:',
error
)
return question
}
}
Example transformations:
| Original Question | Conversation Topic | Rewritten Query |
|---|---|---|
| "What else?" | Clerk authentication | "What other Clerk authentication features are available?" |
| "And how about that?" | Database migrations | "How do Prisma database migrations work in the boilerplate?" |
| "More details please" | Payment webhooks | "What are the details of payment webhook processing?" |
The rewriting:
- Takes the last 3 messages for context (avoids token bloat)
- Returns the original question unchanged if it is already specific
- Falls back to the original question if the LLM call fails
- Logs transformations for debugging (
[RAG Service] Query Rewriting: { original, enhanced })
No-Match Fallback
When no knowledge chunks match the similarity threshold, the RAG service does not return an empty response. Instead, it generates a helpful guided response based on the boilerplate's tech stack:
No chunks found for: "How do I deploy to AWS?"
|
v
Fallback prompt injected:
|--- Acknowledge limitation: "I don't have specific FAQ content..."
|--- Provide boilerplate guidance based on tech stack
|--- Reference relevant files/folders
|--- Suggest rephrasing: "Could you ask more specifically?"
|
v
LLM generates helpful response WITHOUT hallucinating
The fallback prompt explicitly instructs the LLM to:
- Never give generic SaaS business advice
- Never make up features or pricing tiers
- Never suggest "consulting the provider" (Kit IS the provider)
- Reference actual file paths from the boilerplate
API Routes
POST /api/ai/rag/ask
Authenticated RAG endpoint for dashboard users. Supports both streaming and non-streaming responses.
The RAG endpoint accepts two request formats for compatibility:
Object format:
{ question: "How do I...?", conversationHistory: [...] }Messages format:
{ messages: [{ role: "user", content: "How do I...?" }] }Both are normalized internally before processing.
Request:
json
{
"question": "How do I configure Clerk authentication?",
"conversationHistory": [
{ "role": "user", "content": "Tell me about auth" },
{ "role": "assistant", "content": "Kit uses Clerk for..." }
],
"categories": ["Authentication"],
"stream": true
}
Response (non-streaming):
json
{
"answer": "To configure Clerk authentication...",
"sources": [
{
"question": "How is Clerk integrated?",
"category": "Authentication",
"similarity": 0.87
}
],
"chunksUsed": 3,
"tokensEstimated": 4200
}
Conversation Management
Kit stores RAG conversation history for multi-turn chat:
| Endpoint | Method | Purpose |
|---|---|---|
/api/ai/rag/conversations | GET | List user's conversations |
/api/ai/rag/conversations | POST | Create new conversation |
/api/ai/rag/conversations/[id] | GET | Get conversation with messages |
/api/ai/rag/conversations/[id] | DELETE | Delete a conversation |
Knowledge Base Setup
The RAG system uses a
faq_chunks table in PostgreSQL with pgvector for vector storage:Database Schema
sql
CREATE TABLE faq_chunks (
id TEXT PRIMARY KEY DEFAULT gen_random_uuid(),
question TEXT NOT NULL,
answer TEXT NOT NULL,
content TEXT NOT NULL, -- Combined searchable content
category TEXT NOT NULL,
embedding vector(1536), -- OpenAI text-embedding-3-small
metadata JSONB, -- Tags, related topics, keywords
token_count INTEGER DEFAULT 0,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- pgvector index for fast similarity search
CREATE INDEX idx_faq_chunks_embedding
ON faq_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
Populating the Knowledge Base
Kit includes a seed script that generates embeddings and populates the
faq_chunks table:bash
# Generate embeddings and seed the knowledge base
cd apps/boilerplate && npx prisma db seed
# Or run the FAQ seeder directly
cd apps/boilerplate && npx tsx prisma/seed-faq.ts
The seed process:
- Reads FAQ content from
apps/boilerplate/src/content/faq/markdown files - Splits content into chunks (question + answer pairs)
- Generates embeddings via OpenAI
text-embedding-3-small - Inserts chunks with embeddings into
faq_chunks
Adding Custom Content
To add your own knowledge base content:
- Create markdown files in
apps/boilerplate/src/content/faq/with category-based organization - Each FAQ entry needs a
questionandanswersection - Run the seed script to generate embeddings:
cd apps/boilerplate && npx tsx prisma/seed-faq.ts - Verify with the stats endpoint:
GET /api/ai/rag/stats
Key Files
| File | Purpose |
|---|---|
apps/boilerplate/src/lib/ai/rag-service.ts | RAG pipeline orchestrator — query rewriting, search, generation |
apps/boilerplate/src/lib/ai/rag-search.ts | pgvector similarity search, related questions, statistics |
apps/boilerplate/src/lib/ai/ai-service.ts | AI service wrapper used by RAG for LLM calls |
apps/boilerplate/src/app/api/ai/rag/ask/route.ts | Authenticated RAG endpoint |
apps/boilerplate/src/app/api/ai/rag/conversations/ | Conversation CRUD endpoints |
apps/boilerplate/src/lib/ai/sse-parser.ts | Shared SSE parser with SSEStreamError and SSELineBuffer |
apps/boilerplate/src/hooks/use-rag-chat.ts | useRAGChat hook for RAG streaming chat |
apps/boilerplate/src/content/faq/ | Knowledge base source content |
apps/boilerplate/prisma/seed-faq.ts | Embedding generation and database seeding |