RAG System

Knowledge base search with pgvector similarity, query rewriting, source attribution, and public access

Kit includes a complete RAG (Retrieval-Augmented Generation) system that answers questions using a knowledge base instead of the LLM's training data. The system uses OpenAI embeddings and pgvector similarity search to find relevant knowledge chunks, then passes only those chunks as context to the LLM — reducing token usage by approximately 95% compared to sending the full knowledge base.
This page covers the RAG pipeline, vector search, query rewriting, API routes, and knowledge base setup. For the streaming protocol and hooks, see Chat System.

RAG Pipeline

Every RAG question flows through a six-step pipeline:
User asks: "How do I configure authentication?"
    |
    v
Step 1: Query Enhancement
    |--- If conversation history exists:
    |    LLM rewrites vague follow-up → self-contained query
    |    Example: "What else?" → "What other Clerk auth features are available?"
    |--- If standalone question: use as-is
    |
    v
Step 2: Embedding Generation
    |--- OpenAI text-embedding-3-small
    |--- Query → 1536-dimensional vector
    |
    v
Step 3: pgvector Similarity Search
    |--- Cosine similarity: 1 - (embedding <=> query_vector)
    |--- Filter by similarity threshold (default: 0.35)
    |--- Optional category filter
    |--- Returns top 5 chunks (configurable)
    |
    v
Step 4: Context Assembly
    |--- Format chunks: [Source 1: Category]\nQ: ...\nA: ...
    |--- Estimated ~3-5K tokens (vs ~90K for full FAQ)
    |
    v
Step 5: LLM Generation
    |--- System prompt (tier-customized)
    |--- Context from relevant chunks
    |--- Conversation history
    |--- Answer with source references
    |
    v
Step 6: Response with Attribution
    |--- answer: "To configure authentication, ..."
    |--- sources: [{ question, category, similarity }]
    |--- chunksUsed: 3
    |--- tokensEstimated: 4200
    |--- usage: { promptTokens, completionTokens }

RAG Service

The RAGService class orchestrates the entire pipeline. The core method is answerQuestionWithRAG:
src/lib/ai/rag-service.ts — answerQuestionWithRAG
async answerQuestionWithRAG(
    question: string,
    options: AnswerWithRAGOptions = {}
  ): Promise<RAGAnswer> {
    const {
      userId: _userId,
      userTier,
      categories,
      limit = 5,
      similarityThreshold = 0.35, // Adjusted for cross-lingual queries and semantic variation
      conversationHistory = [],
      model: _model,
      temperature: _temperature,
    } = options
The service accepts these options:
OptionTypeDefaultDescription
userIdstringUser ID for tier customization
userTier'free' | 'pro' | 'enterprise'Tier for system prompt customization
categoriesstring[]AllFilter search to specific categories
limitnumber5Number of chunks to retrieve
similarityThresholdnumber0.35Minimum similarity score (0-1)
conversationHistoryMessage[][]Previous messages for context
modelstringProvider defaultOverride AI model
temperaturenumber0.7Response creativity
The response includes full source attribution:
typescript
interface RAGAnswer {
  answer: string           // Generated answer text
  sources: Array<{
    question: string       // Original FAQ question
    category: string       // Knowledge base category
    similarity: number     // Cosine similarity score (0-1)
  }>
  chunksUsed: number       // Number of chunks used
  tokensEstimated: number  // Estimated context tokens
  usage?: {                // Actual token usage (if available)
    promptTokens: number
    completionTokens: number
    totalTokens: number
  }
  model?: string           // Model used for generation
}
The searchRAG function performs semantic similarity search using OpenAI embeddings and pgvector:
src/lib/ai/rag-search.ts — Similarity Search
export async function searchRAG(
  query: string,
  options: SearchOptions = {}
): Promise<SearchResult[]> {
  const {
    limit = 5,
    similarityThreshold = 0.35, // Adjusted for cross-lingual queries and semantic variation
    categories,
    includeMetadata = true,
  } = options

  // 1. Generate embedding for user query
  const apiKey = getEmbeddingApiKey()
  const embeddingModelId = getEmbeddingModel()
  const openai = createOpenAI({ apiKey })

  const { embedding: queryEmbedding } = await embed({
    model: openai.embedding(embeddingModelId),
    value: query,
  })

  // 2. Format embedding as PostgreSQL array string
  // Prisma doesn't automatically convert arrays to pgvector format
  const embeddingStr = `[${queryEmbedding.join(',')}]`

  // 3. Build SQL query with pgvector similarity
  // Cosine similarity: 1 - (embedding <=> query_embedding)
  // Higher score = more similar (1.0 = identical, 0.0 = unrelated)

  // Build category filter SQL
  const categoryFilterSQL =
    categories && categories.length > 0
      ? `AND category = ANY(ARRAY[${categories.map((c) => `'${c}'`).join(',')}])`
      : ''

  const metadataSQL = includeMetadata ? 'metadata,' : ''

  // Use $queryRawUnsafe to avoid template literal interpolation limits
  const sql = `
    SELECT
      id,
      content,
      question,
      answer,
      category,
      ${metadataSQL}
      1 - (embedding <=> '${embeddingStr}'::vector) as similarity
    FROM faq_chunks
    WHERE 1 - (embedding <=> '${embeddingStr}'::vector) > ${similarityThreshold}
    ${categoryFilterSQL}
    ORDER BY embedding <=> '${embeddingStr}'::vector
    LIMIT ${limit}
  `

  const results = await prisma.$queryRawUnsafe<SearchResult[]>(sql)

  return results
}
The search process:
  1. Generate embedding — Convert the query to a 1536-dimensional vector using text-embedding-3-small
  2. Format for pgvector — Convert the float array to PostgreSQL vector format
  3. Cosine similarity query1 - (embedding <=> query_vector) gives a similarity score from 0 to 1
  4. Filter and sort — Apply similarity threshold and optional category filter, return top N results

Similarity Thresholds

The similarity score ranges from 0 (completely unrelated) to 1 (identical). Kit uses a default threshold of 0.35 to balance recall and precision:
RangeInterpretationAction
0.90+Near duplicateExact match found
0.70 - 0.89Very relevantHigh confidence answer
0.50 - 0.69Good semantic matchReliable answer with context
0.35 - 0.49Loosely relatedInclude but may need qualification
Below 0.35Not relevantExcluded (below threshold)

Helper Functions

The search module provides additional utilities:
FunctionPurpose
searchRAG(query, options)Primary search — embeddings + pgvector similarity
getRelatedQuestions(chunkId, limit)Find similar questions in the same category (threshold: 0.8)
searchInCategory(category, query, limit)Search within a specific category (threshold: 0.4)
getCategories()Get all unique category names
getRAGStats()Statistics: total chunks, categories, token counts

Query Rewriting

When users send follow-up questions like "What else?" or "Tell me more", the RAG search would fail because the query lacks context. The rewriteQueryWithContext method uses the LLM to transform vague follow-ups into self-contained queries:
src/lib/ai/rag-service.ts — Query Rewriting
/**
   * Rewrite vague or follow-up questions using conversation context
   * This makes RAG searches work better for multi-turn conversations
   *
   * @param question - Current user question (may be vague or a follow-up)
   * @param history - Previous conversation messages
   * @returns Enhanced, self-contained question for RAG search
   *
   * @example
   * // Original: "Und was noch?"
   * // With context about Clerk auth
   * // Enhanced: "What other Clerk authentication features are available in the boilerplate?"
   */
  private async rewriteQueryWithContext(
    question: string,
    history: Message[]
  ): Promise<string> {
    // Take last 3 messages for context (avoid token bloat)
    const recentHistory = history.slice(-3)

    // Build prompt for query rewriting
    const contextPrompt = `You are a query rewriting assistant for a technical FAQ system about a Next.js SaaS Boilerplate.

Rewrite the user's follow-up question to be self-contained and specific for vector database search.

Conversation History:
${recentHistory.map((m) => `${m.role}: ${m.content}`).join('\n')}

Current Question: ${question}

Instructions:
1. If the question is already clear and specific, return it unchanged
2. If it's a follow-up (e.g., "Und was noch?", "What about X?"), incorporate context from history
3. Make it technical and specific to the Next.js boilerplate
4. Keep it concise (1-2 sentences max)
5. Output ONLY the rewritten question, nothing else

Rewritten Question:`

    try {
      const response = await this.aiService.answerQuestion(question, {
        context: contextPrompt,
        systemPrompt:
          'You are a query rewriting assistant. Output ONLY the rewritten question, nothing else.',
        stream: false,
      })

      const rewrittenQuery =
        response?.choices[0]?.message?.content?.trim() || question

      // Log for debugging (helps tune the system)
      console.log('[RAG Service] Query Rewriting:', {
        original: question,
        enhanced: rewrittenQuery,
        historyLength: recentHistory.length,
        changed: rewrittenQuery !== question,
      })

      return rewrittenQuery
    } catch (error) {
      // If rewriting fails, fall back to original question
      console.warn(
        '[RAG Service] Query rewriting failed, using original:',
        error
      )
      return question
    }
  }
Example transformations:
Original QuestionConversation TopicRewritten Query
"What else?"Clerk authentication"What other Clerk authentication features are available?"
"And how about that?"Database migrations"How do Prisma database migrations work in the boilerplate?"
"More details please"Payment webhooks"What are the details of payment webhook processing?"
The rewriting:
  • Takes the last 3 messages for context (avoids token bloat)
  • Returns the original question unchanged if it is already specific
  • Falls back to the original question if the LLM call fails
  • Logs transformations for debugging ([RAG Service] Query Rewriting: { original, enhanced })

No-Match Fallback

When no knowledge chunks match the similarity threshold, the RAG service does not return an empty response. Instead, it generates a helpful guided response based on the boilerplate's tech stack:
No chunks found for: "How do I deploy to AWS?"
    |
    v
Fallback prompt injected:
    |--- Acknowledge limitation: "I don't have specific FAQ content..."
    |--- Provide boilerplate guidance based on tech stack
    |--- Reference relevant files/folders
    |--- Suggest rephrasing: "Could you ask more specifically?"
    |
    v
LLM generates helpful response WITHOUT hallucinating
The fallback prompt explicitly instructs the LLM to:
  • Never give generic SaaS business advice
  • Never make up features or pricing tiers
  • Never suggest "consulting the provider" (Kit IS the provider)
  • Reference actual file paths from the boilerplate

API Routes

POST /api/ai/rag/ask

Authenticated RAG endpoint for dashboard users. Supports both streaming and non-streaming responses.
Request:
json
{
  "question": "How do I configure Clerk authentication?",
  "conversationHistory": [
    { "role": "user", "content": "Tell me about auth" },
    { "role": "assistant", "content": "Kit uses Clerk for..." }
  ],
  "categories": ["Authentication"],
  "stream": true
}
Response (non-streaming):
json
{
  "answer": "To configure Clerk authentication...",
  "sources": [
    {
      "question": "How is Clerk integrated?",
      "category": "Authentication",
      "similarity": 0.87
    }
  ],
  "chunksUsed": 3,
  "tokensEstimated": 4200
}

Conversation Management

Kit stores RAG conversation history for multi-turn chat:
EndpointMethodPurpose
/api/ai/rag/conversationsGETList user's conversations
/api/ai/rag/conversationsPOSTCreate new conversation
/api/ai/rag/conversations/[id]GETGet conversation with messages
/api/ai/rag/conversations/[id]DELETEDelete a conversation

Knowledge Base Setup

The RAG system uses a faq_chunks table in PostgreSQL with pgvector for vector storage:

Database Schema

sql
CREATE TABLE faq_chunks (
  id          TEXT PRIMARY KEY DEFAULT gen_random_uuid(),
  question    TEXT NOT NULL,
  answer      TEXT NOT NULL,
  content     TEXT NOT NULL,        -- Combined searchable content
  category    TEXT NOT NULL,
  embedding   vector(1536),         -- OpenAI text-embedding-3-small
  metadata    JSONB,                -- Tags, related topics, keywords
  token_count INTEGER DEFAULT 0,
  created_at  TIMESTAMP DEFAULT NOW(),
  updated_at  TIMESTAMP DEFAULT NOW()
);

-- pgvector index for fast similarity search
CREATE INDEX idx_faq_chunks_embedding
  ON faq_chunks
  USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

Populating the Knowledge Base

Kit includes a seed script that generates embeddings and populates the faq_chunks table:
bash
# Generate embeddings and seed the knowledge base
cd apps/boilerplate && npx prisma db seed

# Or run the FAQ seeder directly
cd apps/boilerplate && npx tsx prisma/seed-faq.ts
The seed process:
  1. Reads FAQ content from apps/boilerplate/src/content/faq/ markdown files
  2. Splits content into chunks (question + answer pairs)
  3. Generates embeddings via OpenAI text-embedding-3-small
  4. Inserts chunks with embeddings into faq_chunks

Adding Custom Content

To add your own knowledge base content:
  1. Create markdown files in apps/boilerplate/src/content/faq/ with category-based organization
  2. Each FAQ entry needs a question and answer section
  3. Run the seed script to generate embeddings: cd apps/boilerplate && npx tsx prisma/seed-faq.ts
  4. Verify with the stats endpoint: GET /api/ai/rag/stats

Key Files

FilePurpose
apps/boilerplate/src/lib/ai/rag-service.tsRAG pipeline orchestrator — query rewriting, search, generation
apps/boilerplate/src/lib/ai/rag-search.tspgvector similarity search, related questions, statistics
apps/boilerplate/src/lib/ai/ai-service.tsAI service wrapper used by RAG for LLM calls
apps/boilerplate/src/app/api/ai/rag/ask/route.tsAuthenticated RAG endpoint
apps/boilerplate/src/app/api/ai/rag/conversations/Conversation CRUD endpoints
apps/boilerplate/src/lib/ai/sse-parser.tsShared SSE parser with SSEStreamError and SSELineBuffer
apps/boilerplate/src/hooks/use-rag-chat.tsuseRAGChat hook for RAG streaming chat
apps/boilerplate/src/content/faq/Knowledge base source content
apps/boilerplate/prisma/seed-faq.tsEmbedding generation and database seeding