database-query-optimizer

active

0xbfd79bd9fc6d9b2b5e3f73fc583aa2559bb1e96c30f47c9bcb473af527d64d4f

Analyze SQL queries and EXPLAIN plans to identify performance bottlenecks, then produce optimized rewrites with index recommendations. Supports PostgreSQL, MySQL, SQLite, and SQL Server dialects. Covers full table scans, N+1 patterns, missing indexes, subquery flattening, join optimization, and query plan interpretation. Returns optimized query, index DDL, and estimated improvement.

sql database performance optimization postgresql mysql sqlite query-plan indexing devops

Skill body

Database Query Optimizer

Analyze SQL queries for performance issues, provide optimized rewrites, and recommend indexes. Works from raw SQL alone, or with EXPLAIN plans and schema for deeper analysis.

Procedure

1. Parse and classify the query

Extract:

Query type: SELECT, INSERT, UPDATE, DELETE, CTE, window function
Tables referenced (with aliases)
JOIN types and conditions
WHERE clause predicates
GROUP BY, ORDER BY, HAVING clauses
Subqueries (correlated vs uncorrelated)
Aggregations and window functions
LIMIT/OFFSET patterns
UNION/INTERSECT/EXCEPT operations

Detect the SQL dialect from syntax cues:

PostgreSQL: ::type, ILIKE, RETURNING, LATERAL, array operators
MySQL: backtick quoting, LIMIT offset, count, STRAIGHT_JOIN
SQLite: INTEGER PRIMARY KEY as rowid alias, limited ALTER TABLE
SQL Server: TOP, NOLOCK, WITH (INDEX=...), square brackets

2. Identify anti-patterns

Check for these performance killers (in priority order):

Critical:

SELECT * — forces reading all columns, prevents covering index
Full table scans (no WHERE predicate on indexed column)
Correlated subqueries that execute per-row
N+1 query patterns (loop of individual lookups)
Functions on indexed columns in WHERE (WHERE YEAR(created_at) = 2024)
Implicit type conversions causing index bypass
OR conditions that prevent index usage
LIKE '%prefix' — leading wildcard kills index

High:

Missing indexes on JOIN conditions
Missing indexes on WHERE predicates
Missing indexes on ORDER BY columns
Unnecessary DISTINCT (deduplication on already-unique results)
OFFSET pagination on large tables
Sorting large result sets without LIMIT
NOT IN with nullable subquery (use NOT EXISTS instead)

Medium:

Redundant JOINs (joined table columns not used)
Subqueries that could be JOINs
Multiple sequential queries that could be one
COUNT(*) without need (EXISTS is cheaper for existence checks)
Unnecessary casting or conversion
Overly wide GROUP BY

Low:

Missing table aliases (readability)
Inconsistent join syntax (mixing implicit/explicit)
Dead predicates (always true/false conditions)

3. Analyze EXPLAIN plan (if provided)

Parse the query plan and identify:

PostgreSQL EXPLAIN ANALYZE:

Seq Scan → recommend index
Nested Loop with high row estimate → consider Hash Join hint
Sort with high cost → recommend index for ORDER BY
Bitmap Heap Scan → acceptable for moderate selectivity
actual time vs estimated rows → detect bad statistics
"Rows Removed by Filter" → index not selective enough
Memory usage in Sort/Hash → work_mem tuning

MySQL EXPLAIN:

type: ALL → full table scan, needs index
type: index → full index scan, might be OK
type: range → good, using index range
type: ref → good, index lookup
Extra: "Using filesort" → missing index for ORDER BY
Extra: "Using temporary" → might need optimization
rows: compare estimated vs actual

General plan analysis:

Compare estimated rows vs actual (>10x off = stale statistics)
Look for plan nodes consuming >80% of total cost
Identify which table/join is the bottleneck
Check if parallel query is possible but not used

4. Generate optimized query

Rewrite applying these transformations (most impactful first):

Replace correlated subqueries with JOINs or lateral joins

-- Before (correlated, executes per row)
SELECT u.*, (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) AS order_count
FROM users u;

-- After (single pass)
SELECT u.*, COALESCE(oc.cnt, 0) AS order_count
FROM users u
LEFT JOIN (SELECT user_id, COUNT(*) AS cnt FROM orders GROUP BY user_id) oc
  ON oc.user_id = u.id;

Replace NOT IN with NOT EXISTS for nullable columns

-- Before (wrong results with NULLs, poor performance)
SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM banned);

-- After (correct and indexable)
SELECT * FROM users u WHERE NOT EXISTS (SELECT 1 FROM banned b WHERE b.user_id = u.id);

Move functions off indexed columns

-- Before (can't use index on created_at)
WHERE DATE(created_at) = '2024-01-15'

-- After (uses index)
WHERE created_at >= '2024-01-15' AND created_at < '2024-01-16'

Replace OFFSET pagination with keyset pagination

-- Before (reads and discards offset rows)
SELECT * FROM products ORDER BY id LIMIT 20 OFFSET 10000;

-- After (seeks directly)
SELECT * FROM products WHERE id > :last_seen_id ORDER BY id LIMIT 20;

Flatten unnecessary subqueries
Add covering indexes to eliminate table lookups
Rewrite OR conditions as UNION ALL when appropriate
Replace COUNT(*) existence checks with EXISTS
Eliminate redundant JOINs
Push predicates into subqueries/CTEs

5. Recommend indexes

For each recommended index:

{
  "table": "orders",
  "columns": ["user_id", "status", "created_at"],
  "type": "btree",
  "reason": "Covers WHERE user_id = ? AND status = 'active' ORDER BY created_at DESC",
  "estimated_improvement": "Seq Scan (850ms) → Index Scan (~2ms)",
  "ddl": "CREATE INDEX CONCURRENTLY idx_orders_user_status_created ON orders (user_id, status, created_at DESC);",
  "trade_offs": "Adds ~50MB for 1M rows, slows INSERT by ~5%"
}

Index design rules:

Equality predicates first, then range predicates, then ORDER BY columns
Include columns for covering index only if query is hot path
Partial indexes for common filters (WHERE status = 'active')
PostgreSQL: recommend CONCURRENTLY to avoid locking
MySQL: consider index hints for optimizer
Note storage and write-amplification cost

6. Estimate improvement

Provide a rough estimate:

Without EXPLAIN data: qualitative (e.g., "Full scan → index lookup: ~100x faster for large tables")
With EXPLAIN data: quantitative from plan costs and actual times

Categories:

10x+ improvement: Full scan eliminated, correlated subquery flattened
5-10x: Better join order, missing index added
2-5x: Query restructured, covering index
<2x: Minor optimization, already reasonably efficient

7. Output

{
  "original_query": "...",
  "optimized_query": "...",
  "dialect": "postgresql",
  "issues_found": [
    {
      "severity": "critical",
      "category": "full_table_scan",
      "description": "No index on orders.user_id causes sequential scan of 2M rows",
      "line": 5,
      "fix": "Add index on orders(user_id)"
    }
  ],
  "indexes": [
    {
      "ddl": "CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders (user_id);",
      "reason": "Eliminates Seq Scan on orders in WHERE clause",
      "estimated_size": "~45MB for 2M rows"
    }
  ],
  "transformations_applied": [
    "Replaced correlated subquery with LEFT JOIN",
    "Moved DATE() function off indexed column",
    "Added LIMIT to unbounded query"
  ],
  "estimated_improvement": {
    "factor": "~50x",
    "before": "Sequential scan: ~850ms at 2M rows",
    "after": "Index scan: ~15ms"
  },
  "additional_recommendations": [
    "Run ANALYZE on orders table — statistics may be stale (estimated 1K rows vs actual 2M)",
    "Consider partitioning orders by created_at if table exceeds 100M rows"
  ],
  "warnings": [
    "Keyset pagination requires client to track last_seen_id"
  ]
}

Pitfalls

Don't assume column statistics without data — qualify estimates
Covering indexes help reads but hurt writes — always note the trade-off
PostgreSQL's query planner is very good — sometimes the "slow" plan is actually optimal for the data distribution
MySQL's optimizer is less sophisticated with subqueries — more aggressive rewriting needed
Index recommendations must account for existing indexes — don't recommend duplicates
Partial indexes (PostgreSQL) are powerful but invisible to some ORMs
CTEs in PostgreSQL <12 are optimization fences (materialized by default)
Window functions can't always be replaced by GROUP BY — semantics differ
EXPLAIN ANALYZE actually executes the query — warn about destructive DML
Some optimizations depend on data distribution (e.g., index selectivity)
Connection pooling and caching matter more than query optimization for some workloads