RAG System

Kit includes a complete RAG (Retrieval-Augmented Generation) system that answers questions using a knowledge base instead of the LLM's training data. The system uses OpenAI embeddings and pgvector similarity search to find relevant knowledge chunks, then passes only those chunks as context to the LLM — reducing token usage by approximately 95% compared to sending the full knowledge base.

This page covers the RAG pipeline, vector search, query rewriting, API routes, and knowledge base setup. For the streaming protocol and hooks, see Chat System.

Without RAG, a typical knowledge base might use ~90K tokens per request (sending the entire FAQ). With RAG, only the 3-5 most relevant chunks are sent — typically 3-5K tokens. This is the difference between $0.072 and $0.003 per request with Claude Haiku 4.5.

RAG Pipeline

Every RAG question flows through a six-step pipeline:

User asks: "How do I configure authentication?"
    |
    v
Step 1: Query Enhancement
    |--- If conversation history exists:
    |    LLM rewrites vague follow-up → self-contained query
    |    Example: "What else?" → "What other Clerk auth features are available?"
    |--- If standalone question: use as-is
    |
    v
Step 2: Embedding Generation
    |--- OpenAI text-embedding-3-small
    |--- Query → 1536-dimensional vector
    |
    v
Step 3: pgvector Similarity Search
    |--- Cosine similarity: 1 - (embedding <=> query_vector)
    |--- Filter by similarity threshold (default: 0.35)
    |--- Optional category filter
    |--- Returns top 5 chunks (configurable)
    |
    v
Step 4: Context Assembly
    |--- Format chunks: [Source 1: Category]\nQ: ...\nA: ...
    |--- Estimated ~3-5K tokens (vs ~90K for full FAQ)
    |
    v
Step 5: LLM Generation
    |--- System prompt (tier-customized)
    |--- Context from relevant chunks
    |--- Conversation history
    |--- Answer with source references
    |
    v
Step 6: Response with Attribution
    |--- answer: "To configure authentication, ..."
    |--- sources: [{ question, category, similarity }]
    |--- chunksUsed: 3
    |--- tokensEstimated: 4200
    |--- usage: { promptTokens, completionTokens }

RAG Service

The RAGService class orchestrates the entire pipeline. The core method is answerQuestionWithRAG:

src/lib/ai/rag-service.ts — answerQuestionWithRAG

async answerQuestionWithRAG(
    question: string,
    options: AnswerWithRAGOptions = {}
  ): Promise<RAGAnswer> {
    const {
      userId: _userId,
      userTier,
      categories,
      limit = 5,
      similarityThreshold = 0.35, // Adjusted for cross-lingual queries and semantic variation
      conversationHistory = [],
      model: _model,
      temperature: _temperature,
    } = options

The service accepts these options:

Option	Type	Default	Description
`userId`	`string`	—	User ID for tier customization
`userTier`	`'free' \| 'pro' \| 'enterprise'`	—	Tier for system prompt customization
`categories`	`string[]`	All	Filter search to specific categories
`limit`	`number`	5	Number of chunks to retrieve
`similarityThreshold`	`number`	0.35	Minimum similarity score (0-1)
`conversationHistory`	`Message[]`	`[]`	Previous messages for context
`model`	`string`	Provider default	Override AI model
`temperature`	`number`	0.7	Response creativity

The response includes full source attribution:

typescript

interface RAGAnswer {
  answer: string           // Generated answer text
  sources: Array<{
    question: string       // Original FAQ question
    category: string       // Knowledge base category
    similarity: number     // Cosine similarity score (0-1)
  }>
  chunksUsed: number       // Number of chunks used
  tokensEstimated: number  // Estimated context tokens
  usage?: {                // Actual token usage (if available)
    promptTokens: number
    completionTokens: number
    totalTokens: number
  }
  model?: string           // Model used for generation
}

Vector Search

The searchRAG function performs semantic similarity search using OpenAI embeddings and pgvector:

src/lib/ai/rag-search.ts — Similarity Search

export async function searchRAG(
  query: string,
  options: SearchOptions = {}
): Promise<SearchResult[]> {
  const {
    limit = 5,
    similarityThreshold = 0.35, // Adjusted for cross-lingual queries and semantic variation
    categories,
    includeMetadata = true,
  } = options

  // 1. Generate embedding for user query
  const apiKey = getEmbeddingApiKey()
  const embeddingModelId = getEmbeddingModel()
  const openai = createOpenAI({ apiKey })

  const { embedding: queryEmbedding } = await embed({
    model: openai.embedding(embeddingModelId),
    value: query,
  })

  // 2. Format embedding as PostgreSQL array string
  // Prisma doesn't automatically convert arrays to pgvector format
  const embeddingStr = `[${queryEmbedding.join(',')}]`

  // 3. Build SQL query with pgvector similarity
  // Cosine similarity: 1 - (embedding <=> query_embedding)
  // Higher score = more similar (1.0 = identical, 0.0 = unrelated)

  // Build category filter SQL
  const categoryFilterSQL =
    categories && categories.length > 0
      ? `AND category = ANY(ARRAY[${categories.map((c) => `'${c}'`).join(',')}])`
      : ''

  const metadataSQL = includeMetadata ? 'metadata,' : ''

  // Use $queryRawUnsafe to avoid template literal interpolation limits
  const sql = `
    SELECT
      id,
      content,
      question,
      answer,
      category,
      ${metadataSQL}
      1 - (embedding <=> '${embeddingStr}'::vector) as similarity
    FROM faq_chunks
    WHERE 1 - (embedding <=> '${embeddingStr}'::vector) > ${similarityThreshold}
    ${categoryFilterSQL}
    ORDER BY embedding <=> '${embeddingStr}'::vector
    LIMIT ${limit}
  `

  const results = await prisma.$queryRawUnsafe<SearchResult[]>(sql)

  return results
}

The search process:

Generate embedding — Convert the query to a 1536-dimensional vector using text-embedding-3-small
Format for pgvector — Convert the float array to PostgreSQL vector format
Cosine similarity query — 1 - (embedding <=> query_vector) gives a similarity score from 0 to 1
Filter and sort — Apply similarity threshold and optional category filter, return top N results

RAG search uses OpenAI's embedding model for vector generation, even if you use Anthropic or Google for chat. Ensure OPENAI_API_KEY is set in your environment (or AI_API_KEY as a fallback).

The embedding model is configurable via AI_EMBEDDING_MODEL (default: text-embedding-3-small, 1536 dimensions). Changing the model requires re-generating all embeddings.

Similarity Thresholds

The similarity score ranges from 0 (completely unrelated) to 1 (identical). Kit uses a default threshold of 0.35 to balance recall and precision:

Range	Interpretation	Action
0.90+	Near duplicate	Exact match found
0.70 - 0.89	Very relevant	High confidence answer
0.50 - 0.69	Good semantic match	Reliable answer with context
0.35 - 0.49	Loosely related	Include but may need qualification
Below 0.35	Not relevant	Excluded (below threshold)

If users get too many irrelevant results, raise similarityThreshold to 0.5. If they get "no results found" too often, lower it to 0.3. The default of 0.35 is tuned for cross-lingual queries (English questions against German FAQ content).

Helper Functions

The search module provides additional utilities:

Function	Purpose
`searchRAG(query, options)`	Primary search — embeddings + pgvector similarity
`getRelatedQuestions(chunkId, limit)`	Find similar questions in the same category (threshold: 0.8)
`searchInCategory(category, query, limit)`	Search within a specific category (threshold: 0.4)
`getCategories()`	Get all unique category names
`getRAGStats()`	Statistics: total chunks, categories, token counts

Query Rewriting

When users send follow-up questions like "What else?" or "Tell me more", the RAG search would fail because the query lacks context. The rewriteQueryWithContext method uses the LLM to transform vague follow-ups into self-contained queries:

src/lib/ai/rag-service.ts — Query Rewriting

/**
   * Rewrite vague or follow-up questions using conversation context
   * This makes RAG searches work better for multi-turn conversations
   *
   * @param question - Current user question (may be vague or a follow-up)
   * @param history - Previous conversation messages
   * @returns Enhanced, self-contained question for RAG search
   *
   * @example
   * // Original: "Und was noch?"
   * // With context about Clerk auth
   * // Enhanced: "What other Clerk authentication features are available in the boilerplate?"
   */
  private async rewriteQueryWithContext(
    question: string,
    history: Message[]
  ): Promise<string> {
    // Take last 3 messages for context (avoid token bloat)
    const recentHistory = history.slice(-3)

    // Build prompt for query rewriting
    const contextPrompt = `You are a query rewriting assistant for a technical FAQ system about a Next.js SaaS Boilerplate.

Rewrite the user's follow-up question to be self-contained and specific for vector database search.

Conversation History:
${recentHistory.map((m) => `${m.role}: ${m.content}`).join('\n')}

Current Question: ${question}

Instructions:
1. If the question is already clear and specific, return it unchanged
2. If it's a follow-up (e.g., "Und was noch?", "What about X?"), incorporate context from history
3. Make it technical and specific to the Next.js boilerplate
4. Keep it concise (1-2 sentences max)
5. Output ONLY the rewritten question, nothing else

Rewritten Question:`

    try {
      const response = await this.aiService.answerQuestion(question, {
        context: contextPrompt,
        systemPrompt:
          'You are a query rewriting assistant. Output ONLY the rewritten question, nothing else.',
        stream: false,
      })

      const rewrittenQuery =
        response?.choices[0]?.message?.content?.trim() || question

      // Log for debugging (helps tune the system)
      console.log('[RAG Service] Query Rewriting:', {
        original: question,
        enhanced: rewrittenQuery,
        historyLength: recentHistory.length,
        changed: rewrittenQuery !== question,
      })

      return rewrittenQuery
    } catch (error) {
      // If rewriting fails, fall back to original question
      console.warn(
        '[RAG Service] Query rewriting failed, using original:',
        error
      )
      return question
    }
  }

Example transformations:

Original Question	Conversation Topic	Rewritten Query
"What else?"	Clerk authentication	"What other Clerk authentication features are available?"
"And how about that?"	Database migrations	"How do Prisma database migrations work in the boilerplate?"
"More details please"	Payment webhooks	"What are the details of payment webhook processing?"

The rewriting:

Takes the last 3 messages for context (avoids token bloat)
Returns the original question unchanged if it is already specific
Falls back to the original question if the LLM call fails
Logs transformations for debugging ([RAG Service] Query Rewriting: { original, enhanced })

No-Match Fallback

When no knowledge chunks match the similarity threshold, the RAG service does not return an empty response. Instead, it generates a helpful guided response based on the boilerplate's tech stack:

No chunks found for: "How do I deploy to AWS?"
    |
    v
Fallback prompt injected:
    |--- Acknowledge limitation: "I don't have specific FAQ content..."
    |--- Provide boilerplate guidance based on tech stack
    |--- Reference relevant files/folders
    |--- Suggest rephrasing: "Could you ask more specifically?"
    |
    v
LLM generates helpful response WITHOUT hallucinating

The fallback prompt explicitly instructs the LLM to:

Never give generic SaaS business advice
Never make up features or pricing tiers
Never suggest "consulting the provider" (Kit IS the provider)
Reference actual file paths from the boilerplate

API Routes

POST /api/ai/rag/ask

Authenticated RAG endpoint for dashboard users. Supports both streaming and non-streaming responses.

The RAG endpoint accepts two request formats for compatibility:

Object format: { question: "How do I...?", conversationHistory: [...] }

Messages format: { messages: [{ role: "user", content: "How do I...?" }] }

Both are normalized internally before processing.

Request:

json

{
  "question": "How do I configure Clerk authentication?",
  "conversationHistory": [
    { "role": "user", "content": "Tell me about auth" },
    { "role": "assistant", "content": "Kit uses Clerk for..." }
  ],
  "categories": ["Authentication"],
  "stream": true
}

Response (non-streaming):

json

{
  "answer": "To configure Clerk authentication...",
  "sources": [
    {
      "question": "How is Clerk integrated?",
      "category": "Authentication",
      "similarity": 0.87
    }
  ],
  "chunksUsed": 3,
  "tokensEstimated": 4200
}

Conversation Management

Kit stores RAG conversation history for multi-turn chat:

Endpoint	Method	Purpose
`/api/ai/rag/conversations`	GET	List user's conversations
`/api/ai/rag/conversations`	POST	Create new conversation
`/api/ai/rag/conversations/[id]`	GET	Get conversation with messages
`/api/ai/rag/conversations/[id]`	DELETE	Delete a conversation

Knowledge Base Setup

The RAG system uses a faq_chunks table in PostgreSQL with pgvector for vector storage:

Database Schema

sql

CREATE TABLE faq_chunks (
  id          TEXT PRIMARY KEY DEFAULT gen_random_uuid(),
  question    TEXT NOT NULL,
  answer      TEXT NOT NULL,
  content     TEXT NOT NULL,        -- Combined searchable content
  category    TEXT NOT NULL,
  embedding   vector(1536),         -- OpenAI text-embedding-3-small
  metadata    JSONB,                -- Tags, related topics, keywords
  token_count INTEGER DEFAULT 0,
  created_at  TIMESTAMP DEFAULT NOW(),
  updated_at  TIMESTAMP DEFAULT NOW()
);

-- pgvector index for fast similarity search
CREATE INDEX idx_faq_chunks_embedding
  ON faq_chunks
  USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

Populating the Knowledge Base

Kit includes a seed script that generates embeddings and populates the faq_chunks table:

bash

# Generate embeddings and seed the knowledge base
cd apps/boilerplate && npx prisma db seed

# Or run the FAQ seeder directly
cd apps/boilerplate && npx tsx prisma/seed-faq.ts

The seed process:

Reads FAQ content from apps/boilerplate/src/content/faq/ markdown files
Splits content into chunks (question + answer pairs)
Generates embeddings via OpenAI text-embedding-3-small
Inserts chunks with embeddings into faq_chunks

Adding Custom Content

To add your own knowledge base content:

Create markdown files in apps/boilerplate/src/content/faq/ with category-based organization
Each FAQ entry needs a question and answer section
Run the seed script to generate embeddings: cd apps/boilerplate && npx tsx prisma/seed-faq.ts
Verify with the stats endpoint: GET /api/ai/rag/stats

Key Files

File	Purpose
`apps/boilerplate/src/lib/ai/rag-service.ts`	RAG pipeline orchestrator — query rewriting, search, generation
`apps/boilerplate/src/lib/ai/rag-search.ts`	pgvector similarity search, related questions, statistics
`apps/boilerplate/src/lib/ai/ai-service.ts`	AI service wrapper used by RAG for LLM calls
`apps/boilerplate/src/app/api/ai/rag/ask/route.ts`	Authenticated RAG endpoint
`apps/boilerplate/src/app/api/ai/rag/conversations/`	Conversation CRUD endpoints
`apps/boilerplate/src/lib/ai/sse-parser.ts`	Shared SSE parser with `SSEStreamError` and `SSELineBuffer`
`apps/boilerplate/src/hooks/use-rag-chat.ts`	`useRAGChat` hook for RAG streaming chat
`apps/boilerplate/src/content/faq/`	Knowledge base source content
`apps/boilerplate/prisma/seed-faq.ts`	Embedding generation and database seeding