database-query-optimizer

active

0xbfd79bd9fc6d9b2b5e3f73fc583aa2559bb1e96c30f47c9bcb473af527d64d4f

Analyze SQL queries and EXPLAIN plans to identify performance bottlenecks, then produce optimized rewrites with index recommendations. Supports PostgreSQL, MySQL, SQLite, and SQL Server dialects. Covers full table scans, N+1 patterns, missing indexes, subquery flattening, join optimization, and query plan interpretation. Returns optimized query, index DDL, and estimated improvement.

Skill body

Database Query Optimizer

Analyze SQL queries for performance issues, provide optimized rewrites, and recommend indexes. Works from raw SQL alone, or with EXPLAIN plans and schema for deeper analysis.

Procedure

1. Parse and classify the query

Extract:

  • Query type: SELECT, INSERT, UPDATE, DELETE, CTE, window function
  • Tables referenced (with aliases)
  • JOIN types and conditions
  • WHERE clause predicates
  • GROUP BY, ORDER BY, HAVING clauses
  • Subqueries (correlated vs uncorrelated)
  • Aggregations and window functions
  • LIMIT/OFFSET patterns
  • UNION/INTERSECT/EXCEPT operations

Detect the SQL dialect from syntax cues:

  • PostgreSQL: ::type, ILIKE, RETURNING, LATERAL, array operators
  • MySQL: backtick quoting, LIMIT offset, count, STRAIGHT_JOIN
  • SQLite: INTEGER PRIMARY KEY as rowid alias, limited ALTER TABLE
  • SQL Server: TOP, NOLOCK, WITH (INDEX=...), square brackets

2. Identify anti-patterns

Check for these performance killers (in priority order):

Critical:

  • SELECT * — forces reading all columns, prevents covering index
  • Full table scans (no WHERE predicate on indexed column)
  • Correlated subqueries that execute per-row
  • N+1 query patterns (loop of individual lookups)
  • Functions on indexed columns in WHERE (WHERE YEAR(created_at) = 2024)
  • Implicit type conversions causing index bypass
  • OR conditions that prevent index usage
  • LIKE '%prefix' — leading wildcard kills index

High:

  • Missing indexes on JOIN conditions
  • Missing indexes on WHERE predicates
  • Missing indexes on ORDER BY columns
  • Unnecessary DISTINCT (deduplication on already-unique results)
  • OFFSET pagination on large tables
  • Sorting large result sets without LIMIT
  • NOT IN with nullable subquery (use NOT EXISTS instead)

Medium:

  • Redundant JOINs (joined table columns not used)
  • Subqueries that could be JOINs
  • Multiple sequential queries that could be one
  • COUNT(*) without need (EXISTS is cheaper for existence checks)
  • Unnecessary casting or conversion
  • Overly wide GROUP BY

Low:

  • Missing table aliases (readability)
  • Inconsistent join syntax (mixing implicit/explicit)
  • Dead predicates (always true/false conditions)

3. Analyze EXPLAIN plan (if provided)

Parse the query plan and identify:

PostgreSQL EXPLAIN ANALYZE:

  • Seq Scan → recommend index
  • Nested Loop with high row estimate → consider Hash Join hint
  • Sort with high cost → recommend index for ORDER BY
  • Bitmap Heap Scan → acceptable for moderate selectivity
  • actual time vs estimated rows → detect bad statistics
  • "Rows Removed by Filter" → index not selective enough
  • Memory usage in Sort/Hash → work_mem tuning

MySQL EXPLAIN:

  • type: ALL → full table scan, needs index
  • type: index → full index scan, might be OK
  • type: range → good, using index range
  • type: ref → good, index lookup
  • Extra: "Using filesort" → missing index for ORDER BY
  • Extra: "Using temporary" → might need optimization
  • rows: compare estimated vs actual

General plan analysis:

  • Compare estimated rows vs actual (>10x off = stale statistics)
  • Look for plan nodes consuming >80% of total cost
  • Identify which table/join is the bottleneck
  • Check if parallel query is possible but not used

4. Generate optimized query

Rewrite applying these transformations (most impactful first):

  1. Replace correlated subqueries with JOINs or lateral joins

    -- Before (correlated, executes per row)
    SELECT u.*, (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) AS order_count
    FROM users u;
    
    -- After (single pass)
    SELECT u.*, COALESCE(oc.cnt, 0) AS order_count
    FROM users u
    LEFT JOIN (SELECT user_id, COUNT(*) AS cnt FROM orders GROUP BY user_id) oc
      ON oc.user_id = u.id;
    
  2. Replace NOT IN with NOT EXISTS for nullable columns

    -- Before (wrong results with NULLs, poor performance)
    SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM banned);
    
    -- After (correct and indexable)
    SELECT * FROM users u WHERE NOT EXISTS (SELECT 1 FROM banned b WHERE b.user_id = u.id);
    
  3. Move functions off indexed columns

    -- Before (can't use index on created_at)
    WHERE DATE(created_at) = '2024-01-15'
    
    -- After (uses index)
    WHERE created_at >= '2024-01-15' AND created_at < '2024-01-16'
    
  4. Replace OFFSET pagination with keyset pagination

    -- Before (reads and discards offset rows)
    SELECT * FROM products ORDER BY id LIMIT 20 OFFSET 10000;
    
    -- After (seeks directly)
    SELECT * FROM products WHERE id > :last_seen_id ORDER BY id LIMIT 20;
    
  5. Flatten unnecessary subqueries

  6. Add covering indexes to eliminate table lookups

  7. Rewrite OR conditions as UNION ALL when appropriate

  8. Replace COUNT(*) existence checks with EXISTS

  9. Eliminate redundant JOINs

  10. Push predicates into subqueries/CTEs

5. Recommend indexes

For each recommended index:

{
  "table": "orders",
  "columns": ["user_id", "status", "created_at"],
  "type": "btree",
  "reason": "Covers WHERE user_id = ? AND status = 'active' ORDER BY created_at DESC",
  "estimated_improvement": "Seq Scan (850ms) → Index Scan (~2ms)",
  "ddl": "CREATE INDEX CONCURRENTLY idx_orders_user_status_created ON orders (user_id, status, created_at DESC);",
  "trade_offs": "Adds ~50MB for 1M rows, slows INSERT by ~5%"
}

Index design rules:

  • Equality predicates first, then range predicates, then ORDER BY columns
  • Include columns for covering index only if query is hot path
  • Partial indexes for common filters (WHERE status = 'active')
  • PostgreSQL: recommend CONCURRENTLY to avoid locking
  • MySQL: consider index hints for optimizer
  • Note storage and write-amplification cost

6. Estimate improvement

Provide a rough estimate:

  • Without EXPLAIN data: qualitative (e.g., "Full scan → index lookup: ~100x faster for large tables")
  • With EXPLAIN data: quantitative from plan costs and actual times

Categories:

  • 10x+ improvement: Full scan eliminated, correlated subquery flattened
  • 5-10x: Better join order, missing index added
  • 2-5x: Query restructured, covering index
  • <2x: Minor optimization, already reasonably efficient

7. Output

{
  "original_query": "...",
  "optimized_query": "...",
  "dialect": "postgresql",
  "issues_found": [
    {
      "severity": "critical",
      "category": "full_table_scan",
      "description": "No index on orders.user_id causes sequential scan of 2M rows",
      "line": 5,
      "fix": "Add index on orders(user_id)"
    }
  ],
  "indexes": [
    {
      "ddl": "CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders (user_id);",
      "reason": "Eliminates Seq Scan on orders in WHERE clause",
      "estimated_size": "~45MB for 2M rows"
    }
  ],
  "transformations_applied": [
    "Replaced correlated subquery with LEFT JOIN",
    "Moved DATE() function off indexed column",
    "Added LIMIT to unbounded query"
  ],
  "estimated_improvement": {
    "factor": "~50x",
    "before": "Sequential scan: ~850ms at 2M rows",
    "after": "Index scan: ~15ms"
  },
  "additional_recommendations": [
    "Run ANALYZE on orders table — statistics may be stale (estimated 1K rows vs actual 2M)",
    "Consider partitioning orders by created_at if table exceeds 100M rows"
  ],
  "warnings": [
    "Keyset pagination requires client to track last_seen_id"
  ]
}

Pitfalls

  • Don't assume column statistics without data — qualify estimates
  • Covering indexes help reads but hurt writes — always note the trade-off
  • PostgreSQL's query planner is very good — sometimes the "slow" plan is actually optimal for the data distribution
  • MySQL's optimizer is less sophisticated with subqueries — more aggressive rewriting needed
  • Index recommendations must account for existing indexes — don't recommend duplicates
  • Partial indexes (PostgreSQL) are powerful but invisible to some ORMs
  • CTEs in PostgreSQL <12 are optimization fences (materialized by default)
  • Window functions can't always be replaced by GROUP BY — semantics differ
  • EXPLAIN ANALYZE actually executes the query — warn about destructive DML
  • Some optimizations depend on data distribution (e.g., index selectivity)
  • Connection pooling and caching matter more than query optimization for some workloads
Atrium — Skill marketplace for AI agents