Cost Management

Three-layer cost control — rate limiting, credit system, and usage tracking for AI operations

Kit protects your AI budget with a three-layer cost management system: global rate limiting for burst protection, a credit-based system for per-operation billing, and usage tracking for analytics and cost monitoring. All three layers work together — every AI request must pass through each layer before reaching the provider.
This page covers the rate limiting architecture, credit costs, usage tracking, token cost calculation, and feature flag integration. For the credit system's billing integration, see Credit System.

Three-Layer Architecture

Every AI request passes through three protection layers in sequence:
Incoming AI Request
    |
    v
Layer 1: Global Rate Limit (Burst Protection)
    |--- Upstash Redis sliding window
    |--- 10 requests per 10 seconds (configurable)
    |--- Applies to ALL users regardless of tier
    |--- Purpose: Prevent DDoS and burst abuse
    |--- Fail: 429 "Too many requests. Please slow down."
    |
    v
Layer 2: Credit System (Per-Operation Billing)
    |--- Check credit balance in database
    |--- Verify sufficient credits for operation
    |--- Auto-reset if 30+ days elapsed (webhook backup)
    |--- Atomic deduction BEFORE processing
    |--- Fail: 402 "Insufficient credits"
    |
    v
Layer 3: Usage Tracking (Analytics)
    |--- Non-blocking database write (after response)
    |--- Records: provider, model, tokens, cost, purpose
    |--- Monthly aggregation for quota checks
    |--- Used for billing analytics and dashboards
    |
    v
Request reaches AI Provider

Rate Limiting

Global Burst Protection

The global rate limiter uses Upstash Redis with a sliding window algorithm. It applies to all users and prevents burst abuse:
bash
# Configure via environment variables
AI_RATE_LIMIT_WINDOW=10      # Window in seconds (default: 10)
AI_RATE_LIMIT_MAX_REQUESTS=10 # Max requests per window (default: 10)
Key features:
  • Sliding window — smoother than fixed windows, no burst at window boundaries
  • Ephemeral cache — in-memory cache reduces Redis calls by 50-80%
  • Analytics — Upstash analytics enabled for monitoring

Tier-Based Monthly Quotas

Each subscription tier has a monthly request limit, enforced via separate Redis rate limiters:
TierMonthly LimitWindowEnv Override
Free50030 daysAI_FREE_TIER_REQUESTS
Basic1,50030 daysAI_BASIC_TIER_REQUESTS
Pro5,00030 daysAI_PRO_TIER_REQUESTS
Enterprise15,00030 daysAI_ENTERPRISE_TIER_REQUESTS
The user's tier is determined from their Lemon Squeezy subscription variant ID. Users without a subscription default to the Free tier.

Rate Limit Check Flow

The comprehensive checkRateLimit function orchestrates all three checks. It is called by every AI API route:
src/lib/ai/rate-limiter.ts — Comprehensive Rate Limit Check
export async function checkRateLimit(params: {
  userId?: string
  sessionId?: string
  ip?: string
  cost?: number // Credit cost for operation (default: 1)
}): Promise<{
  success: boolean
  limit: number
  remaining: number
  reset: number
  tier: SubscriptionTier
  reason?: string
  creditSystemEnabled: boolean
}> {
  const { userId, sessionId, ip, cost = 1 } = params

  // Get identifier for burst protection
  const identifier = getIdentifier(userId, sessionId, ip)

  // STEP 1: ALWAYS check global rate limit (DDoS/Burst protection)
  const globalResult = await checkGlobalRateLimit(identifier)
  if (!globalResult.success) {
    return {
      success: false,
      limit: globalResult.limit,
      remaining: globalResult.remaining,
      reset: globalResult.reset,
      tier: 'free',
      reason: 'Too many requests. Please slow down.',
      creditSystemEnabled: isCreditSystemEnabled(),
    }
  }

  // STEP 2: Check if credit system is enabled
  if (!isCreditSystemEnabled()) {
    console.log('[Rate Limit] Credit system disabled - allowing request')
    return {
      success: true,
      limit: 999999, // Unlimited
      remaining: 999999,
      reset: Date.now() + 30 * 24 * 60 * 60 * 1000,
      tier: userId ? await getUserTier(userId) : 'free',
      creditSystemEnabled: false,
    }
  }
The function returns a standardized result:
typescript
{
  success: boolean          // true if all checks passed
  limit: number             // Total credit/request limit
  remaining: number         // Remaining credits/requests
  reset: number             // Unix timestamp for reset
  tier: SubscriptionTier    // User's subscription tier
  reason?: string           // Human-readable error message
  creditSystemEnabled: boolean  // Whether credit system is active
}

Credit Costs

Every AI operation has a defined credit cost. Costs are based on estimated token usage and computational complexity:
src/lib/credits/credit-costs.ts — Operation Costs
export const CREDIT_COSTS = {
  // ============================================================================
  // FAQ Operations
  // ============================================================================

  /**
   * Simple FAQ lookup using RAG (Retrieval-Augmented Generation)
   *
   * Uses vector search with minimal context window. Suitable for
   * straightforward questions with clear answers in the knowledge base.
   *
   * **Estimated tokens**: 500-1000
   *
   * **Example use cases**:
   * - "What are your business hours?"
   * - "How do I reset my password?"
   * - "What payment methods do you accept?"
   */
  faq_simple: 5,

  /**
   * Complex FAQ query with multi-step reasoning
   *
   * Uses larger context window and may require multiple RAG retrievals
   * or chain-of-thought reasoning. Suitable for nuanced questions.
   *
   * **Estimated tokens**: 2000-4000
   *
   * **Example use cases**:
   * - "Compare your pricing plans and recommend one for my use case"
   * - "Explain the difference between your two authentication methods"
   * - "How does your refund policy work for annual subscriptions?"
   */
  faq_complex: 15,

  // ============================================================================
  // Chat Operations
  // ============================================================================

  /**
   * Standard chat message (non-streaming)
   *
   * Single message exchange with context window up to 4000 tokens.
   * Suitable for most conversational interactions.
   *
   * **Estimated tokens**: 1000-4000
   *
   * **Example use cases**:
   * - General conversation
   * - Question answering
   * - Content suggestions
   */
  chat_message: 15,

  /**
   * Streaming chat message
   *
   * Real-time token streaming with same context as standard messages.
   * Higher cost due to streaming infrastructure and perceived value.
   *
   * **Estimated tokens**: 1000-4000
   *
   * **Example use cases**:
   * - Interactive chat experiences
   * - Real-time content generation
   * - Live coding assistance
   */
  chat_streaming: 20,

  /**
   * Chat with tool/function calling
   *
   * Chat message that can invoke external tools, APIs, or functions.
   * Includes extra tokens for tool definitions and result processing.
   *
   * **Estimated tokens**: 2000-6000
   *
   * **Example use cases**:
   * - Database queries via chat
   * - API integrations
   * - Calculator or data lookups
   */
  chat_with_tools: 30,

  /**
   * Image analysis in chat (Vision)
   *
   * Multimodal chat message with one or more images for visual analysis.
   * Higher cost due to image processing tokens (images consume ~85 tokens
   * per 512x512 tile in most providers).
   *
   * **Estimated tokens**: 2000-8000 (depends on image resolution)
   *
   * **Example use cases**:
   * - "What's in this image?"
   * - Screenshot analysis and debugging
   * - Document/receipt scanning via chat
   * - Design feedback and comparison
   */
  image_analysis: 30,

  /**
   * PDF document analysis in chat
   *
   * Upload and analyze PDF documents in the LLM Chat.
   * Server-side text extraction with pdf-parse, then AI analysis.
   * Higher cost than streaming due to extraction overhead.
   *
   * **Estimated tokens**: 3000-10000 (depends on document length)
   *
   * **Example use cases**:
   * - "Summarize this contract"
   * - "What are the key terms in this PDF?"
   * - "Extract the action items from this meeting notes PDF"
   */
  pdf_analysis: 40,

  // ============================================================================
  // Advanced AI Operations
  // ============================================================================

  /**
   * Image generation from text prompt
   *
   * Text-to-image generation using models like DALL-E or Stable Diffusion.
   * Highest single-operation cost due to computational requirements.
   *
   * **Estimated tokens**: N/A (GPU-based operation)
   *
   * **Example use cases**:
   * - Marketing visual generation
   * - Product mockups
   * - Concept art creation
   */
  image_gen: 80,

  /**
   * Image editing/manipulation
   *
   * Modify existing images using text prompts or masks.
   * Includes inpainting, outpainting, and style transfer.
   *
   * **Estimated tokens**: N/A (GPU-based operation)
   *
   * **Example use cases**:
   * - Background removal
   * - Object replacement
   * - Image enhancement
   */
  image_edit: 50,

  /**
   * Code analysis and review
   *
   * Static analysis, bug detection, and code quality assessment.
   * Analyzes code structure, patterns, and potential issues.
   *
   * **Estimated tokens**: 3000-8000
   *
   * **Example use cases**:
   * - Security vulnerability scanning
   * - Performance optimization suggestions
   * - Code smell detection
   */
  code_analysis: 40,

  /**
   * Code generation from specifications
   *
   * Generate complete code files or functions from natural language
   * descriptions. Includes language-specific syntax and best practices.
   *
   * **Estimated tokens**: 4000-10000
   *
   * **Example use cases**:
   * - Component scaffolding
   * - API endpoint generation
   * - Test case creation
   */
  code_gen: 50,

  // ============================================================================
  // Embeddings and Vector Operations
  // ============================================================================

  /**
   * Single text embedding generation
   *
   * Convert text to vector representation for semantic search.
   * Typically 1536-dimensional vector (OpenAI ada-002).
   *
   * **Estimated tokens**: 100-500
   *
   * **Example use cases**:
   * - Document indexing
   * - Semantic search preparation
   * - Content similarity calculation
   */
  embedding_single: 5,

  /**
   * Batch embedding generation
   *
   * Process multiple texts in a single batch operation.
   * More efficient than individual embeddings for bulk operations.
   *
   * **Estimated tokens**: 1000-5000
   *
   * **Example use cases**:
   * - Bulk document processing
   * - Knowledge base initialization
   * - Large-scale content indexing
   */
  embedding_batch: 10,

  /**
   * Vector similarity search
   *
   * Query vector database to find semantically similar content.
   * Cost covers embedding query text and database lookup.
   *
   * **Estimated tokens**: 200-800
   *
   * **Example use cases**:
   * - Semantic document search
   * - Recommendation systems
   * - Duplicate content detection
   */
  vector_search: 5,

  // ============================================================================
  // Audio Operations
  // ============================================================================

  /**
   * Audio transcription (speech-to-text)
   *
   * Convert audio files to text using Whisper or similar models.
   * Cost per minute of audio content.
   *
   * **Estimated tokens**: N/A (audio processing)
   *
   * **Example use cases**:
   * - Meeting transcription
   * - Podcast notes generation
   * - Voice command processing
   */
  transcription: 30,

  /**
   * Speech-to-text for chat voice input
   *
   * Short audio recordings from microphone input in LLM Chat,
   * transcribed via OpenAI Whisper. Lower cost than general transcription
   * because chat recordings are typically shorter (max 120s).
   *
   * **Estimated tokens**: N/A (audio processing)
   *
   * **Example use cases**:
   * - Voice input in chat (microphone button)
   * - Quick voice messages for AI conversation
   */
  speech_to_text: 20,

  /**
   * Text-to-speech synthesis
   *
   * Generate natural-sounding audio from text input.
   * Includes voice selection and audio quality options.
   *
   * **Estimated tokens**: N/A (audio synthesis)
   *
   * **Example use cases**:
   * - Voiceover generation
   * - Accessibility features
   * - Audio content creation
   */
  tts: 20,

  // ============================================================================
  // Document Processing
  // ============================================================================

  /**
   * PDF parsing and text extraction
   *
   * Extract text, tables, and metadata from PDF documents.
   * Handles multi-page documents with layout preservation.
   *
   * **Estimated tokens**: 1000-3000
   *
   * **Example use cases**:
   * - Document digitization
   * - Invoice processing
   * - Contract analysis
   */
  pdf_parse: 15,

  /**
   * Optical Character Recognition (OCR)
   *
   * Extract text from images and scanned documents.
   * Includes text detection, recognition, and layout analysis.
   *
   * **Estimated tokens**: N/A (image processing)
   *
   * **Example use cases**:
   * - Receipt scanning
   * - Handwriting recognition
   * - Screenshot text extraction
   */
  ocr: 30,

  /**
   * Document summarization
   *
   * Generate concise summaries of long documents.
   * Uses extractive or abstractive summarization techniques.
   *
   * **Estimated tokens**: 5000-12000
   *
   * **Example use cases**:
   * - Research paper summaries
   * - Meeting notes condensation
   * - Article key points extraction
   */
  document_summary: 65,

  // ============================================================================
  // Content Generation
  // ============================================================================

  /**
   * Template-based content generation
   *
   * Generate text from templates (email, product description, blog outline,
   * social media post, marketing copy) with streaming output.
   * Cost covers template processing + text generation.
   *
   * **Estimated tokens**: 1000-4000
   *
   * **Example use cases**:
   * - Professional email drafting
   * - Product description writing
   * - Blog post outline generation
   * - Social media post creation
   * - Marketing copy generation
   */
  content_generation: 25,

Full Cost Table

OperationCreditsEstimated TokensCategory
faq_simple5500-1,000FAQ
faq_complex152,000-4,000FAQ
chat_message151,000-4,000Chat
chat_streaming201,000-4,000Chat
content_generation251,000-4,000Content
chat_with_tools302,000-6,000Chat
image_analysis302,000-8,000Chat
pdf_analysis403,000-10,000Chat
image_gen80N/A (GPU)Advanced AI
image_edit50N/A (GPU)Advanced AI
code_analysis403,000-8,000Advanced AI
code_gen504,000-10,000Advanced AI
embedding_single5100-500Embeddings
embedding_batch101,000-5,000Embeddings
vector_search5200-800Embeddings
transcription30N/A (audio)Audio
tts20N/A (audio)Audio
speech_to_text20N/A (audio)Audio
pdf_parse151,000-3,000Document
ocr30N/A (image)Document
document_summary655,000-12,000Document

Credit Cost Utilities

The credit costs module provides helper functions:
FunctionPurpose
getCreditCost(operation)Get cost for a single operation
calculateBatchCost(operation, quantity)Calculate total cost for batch operations
getAllCreditCosts()Get all costs as a plain object (for admin UI)
isValidOperation(string)Type guard — check if a string is a valid operation
getOperationsByCategory()Group operations by category (faq, chat, etc.)
estimateOperationCount(operation, credits)How many operations can X credits afford?
formatCreditAmount(credits, includeUnit)Format for display (20"20 credits")

Usage Tracking

Every AI request is tracked to the AIUsage database table for analytics and cost monitoring. Tracking is non-blocking — failures are logged but never prevent the AI response from being delivered.

TrackUsageParams

typescript
interface TrackUsageParams {
  userId?: string
  sessionId?: string
  provider: string        // "openai", "anthropic", etc.
  model: string           // "gpt-5-nano", "claude-haiku", etc.
  tokens: number          // Total tokens used
  cost?: TokenCost | number  // USD cost (from provider pricing)
  purpose: 'faq' | 'chat' | 'completion' | 'stream' | 'embedding' | 'general'
  metadata?: Record<string, unknown>  // Additional context
}

Monthly Aggregation

Usage is aggregated monthly for quota checks and analytics:
typescript
interface MonthlyUsage {
  totalTokens: number
  totalCost: number
  requestCount: number
  byProvider: Record<string, { tokens, cost, requests }>
  byPurpose: Record<string, { tokens, cost, requests }>
}

Usage API

EndpointMethodPurpose
/api/ai/usageGETCurrent month's usage statistics
The usage endpoint returns aggregated data broken down by provider and purpose, suitable for dashboard charts and usage meters.

Token Cost Calculation

Kit calculates the USD cost of each request using per-model pricing tables. The calculateCost method on BaseProvider converts token counts to dollar amounts:
src/lib/ai/providers/base-provider.ts — Cost Calculation
calculateCost(usage: TokenUsage, model?: string): TokenCost {
    const modelInfo = this.getModelInfo(model ?? this.defaultModel)
    if (!modelInfo) {
      return {
        promptCost: 0,
        completionCost: 0,
        totalCost: 0,
        currency: 'USD',
      }
    }

    const promptCost =
      (usage.promptTokens / 1_000_000) * modelInfo.costPerMillionPromptTokens
    const completionCost =
      (usage.completionTokens / 1_000_000) *
      modelInfo.costPerMillionCompletionTokens

    return {
      promptCost,
      completionCost,
      totalCost: promptCost + completionCost,
      currency: 'USD',
    }
  }
The calculation uses each model's costPerMillionPromptTokens and costPerMillionCompletionTokens from the model info registry. This provides accurate cost tracking across all four providers.
Example: A request using claude-haiku-4-5 with 500 prompt tokens and 200 completion tokens:
Prompt cost:     (500 / 1,000,000) × $0.80 = $0.000400
Completion cost: (200 / 1,000,000) × $4.00 = $0.000800
Total cost:      $0.001200

Feature Flag Integration

The cost management system behaves differently based on whether the credit system is enabled:
BehaviorCredit System ONCredit System OFF
Rate LimitingGlobal burst + credit balance checkGlobal burst only
Credit DeductionAtomic deduction before processingSkipped
Usage TrackingFull tracking with costsFull tracking (analytics only)
Monthly LimitBased on credit balanceUnlimited (999999)
402 Errors"Insufficient credits"Never sent
Auto-ResetChecks if 30+ days elapsedSkipped
The feature flag check happens inside checkRateLimit:
checkRateLimit()
    |
    |--- ALWAYS: Check global rate limit
    |
    |--- isCreditSystemEnabled()?
    |    |
    |    |--- YES: Check credit balance, auto-reset, deduct
    |    |--- NO:  Allow request (log "Credit system disabled")

Environment Variables

VariableDefaultPurpose
AI_RATE_LIMIT_WINDOW10Global rate limit window in seconds
AI_RATE_LIMIT_MAX_REQUESTS10Max requests per global window
AI_FREE_TIER_REQUESTS500Monthly limit for free tier
AI_BASIC_TIER_REQUESTS1500Monthly limit for basic tier
AI_PRO_TIER_REQUESTS5000Monthly limit for pro tier
AI_ENTERPRISE_TIER_REQUESTS15000Monthly limit for enterprise tier
UPSTASH_REDIS_REST_URLRedis URL for rate limiting
UPSTASH_REDIS_REST_TOKENRedis token for rate limiting
NEXT_PUBLIC_PRICING_MODELcredit_basedcredit_based or classic_saas

Key Files

FilePurpose
apps/boilerplate/src/lib/ai/rate-limiter.tsGlobal burst and tier-based rate limiting
apps/boilerplate/src/lib/credits/credit-costs.tsPer-operation credit costs (21 operations)
apps/boilerplate/src/lib/credits/credit-manager.tsAtomic credit deductions with SELECT FOR UPDATE
apps/boilerplate/src/lib/credits/config.tsCredit system feature flag (isCreditSystemEnabled())
apps/boilerplate/src/lib/ai/usage-tracker.tsNon-blocking usage tracking to database
apps/boilerplate/src/lib/ai/providers/base-provider.tsToken cost calculation per provider
apps/boilerplate/src/app/api/ai/usage/route.tsUsage statistics endpoint