by2ndOpinion Team

How Consensus Review Catches Bugs That Single AI Misses

When Claude, Codex, and Gemini disagree about your code, that's where the most valuable insights hide.

consensusbug-detectioncase-studysecurity

The assumption behind most AI code review tools is simple: pick the best model, send it your diff, and trust the output. But "best" depends entirely on what kind of bug you are trying to find. In practice, the bugs that matter most are the ones your chosen model is not good at catching.

This article walks through three real patterns where consensus review — running Claude, Codex, and Gemini independently on the same diff — catches issues that any single model misses.

Case 1: SQL Injection Hidden Behind an ORM

Here is a pattern that shows up more often than you would expect. A developer adds a search feature using an ORM but falls back to a raw query for a complex filter:

app.get('/api/users/search', async (req, res) => {
  const { query, role } = req.query;

  // ORM handles basic search safely
  const users = await db.users.findMany({
    where: { name: { contains: query } },
  });

  // But this raw query for role filtering is vulnerable
  if (role) {
    const filtered = await db.$queryRaw`
      SELECT * FROM users
      WHERE name LIKE '%${query}%'
      AND role = '${role}'
    `;
    return res.json(filtered);
  }

  return res.json(users);
});

Here is what each model flagged independently:

Claude identified the SQL injection immediately. It traced the data flow from req.query through to the raw SQL string and flagged both query and role as unsanitized user input embedded directly in a SQL template. Severity: critical.

Codex flagged the code as having a "mixed query pattern" — using the ORM for one path and raw SQL for another. It recommended consolidating to ORM-only queries for consistency but did not explicitly call out the injection vulnerability. Severity: medium.

Gemini approved the change with a suggestion to add input validation on the role parameter, noting it should be checked against an enum of valid roles. It did not flag the SQL injection in the query parameter.

The Consensus Output

The consensus algorithm clustered Claude's SQL injection finding as a standalone critical issue (no matching findings from other models — a disagreement). Codex's "mixed query pattern" was flagged as a separate medium-severity concern. Gemini's role validation suggestion was a third finding.

The final recommendation: reject, driven by Claude's critical finding.

If this team was using only Codex or only Gemini for code review, the SQL injection would have shipped. The consensus disagreement — Claude seeing something the other two missed — was the most valuable signal in the entire review.

Case 2: Race Condition in a Credit System

Consider a credit deduction function where users pay credits to run AI reviews:

async function deductCredits(userId: string, amount: number) {
  const balance = await db.query(
    'SELECT credits FROM credit_balance WHERE user_id = $1',
    [userId]
  );

  if (balance.rows[0].credits < amount) {
    throw new Error('Insufficient credits');
  }

  await db.query(
    'UPDATE credit_balance SET credits = credits - $1 WHERE user_id = $2',
    [amount, userId]
  );
}

Claude approved the change, noting the balance check and deduction were logically sound.

Codex flagged a race condition. It identified that between the SELECT and the UPDATE, another concurrent request could read the same balance, pass the check, and both would deduct — allowing a user to spend more credits than they have. It recommended using SELECT ... FOR UPDATE or an atomic UPDATE ... WHERE credits >= $1 pattern. Severity: high.

Gemini approved the change but suggested adding error handling for the case where balance.rows[0] is undefined (new user with no credit record).

The Consensus Output

Codex's race condition was flagged as a high-severity disagreement — neither Claude nor Gemini caught it. Gemini's null-check suggestion was a separate low-severity finding.

The final recommendation: review, driven by Codex's high-severity finding.

This is a textbook example of model specialization. Codex's deep training on code patterns made it highly sensitive to the check-then-act race condition, a pattern Claude's more reasoning-focused analysis and Gemini's documentation-oriented analysis both missed.

Case 3: Authentication Bypass via Header Manipulation

A middleware function checks for admin access:

function requireAdmin(req: Request, res: Response, next: NextFunction) {
  const user = req.user;
  const isAdmin = req.headers['x-admin-override'] === 'true' || user?.role === 'admin';

  if (!isAdmin) {
    return res.status(403).json({ error: 'Forbidden' });
  }

  next();
}

Claude immediately flagged the x-admin-override header as a critical security issue. Any user (or attacker) can set arbitrary HTTP headers, so this header effectively bypasses the entire admin check. It recommended removing the header check entirely and relying solely on authenticated role verification.

Codex flagged the same issue, noting that the header-based override looked like a leftover debug mechanism that should not be in production code. Severity: critical.

Gemini flagged the header as a "potential security concern" and suggested adding documentation about when it should be used. Severity: medium.

The Consensus Output

This time, all three models found the issue, but with different severity assessments. The consensus algorithm clustered all three findings together (same category, same file, overlapping descriptions) and created a single agreement finding. The final severity was critical (worst-case across all models).

The final recommendation: reject.

When all three models agree, the signal is unambiguous. But notice the nuance: Gemini treated it as a documentation issue rather than a security vulnerability. If you were using only Gemini, you might add a comment explaining the header and ship the bypass to production. The consensus severity override catches this.

How the Consensus Algorithm Weighs Findings

Understanding the algorithm helps you interpret the output:

  1. Agreements (2+ models) are listed first with high confidence. These are the findings you should address before merging.

  2. Disagreements (1 model only) are listed separately. These are not false positives — they are findings that only one model's particular strengths surfaced. Review them carefully, especially when the lone model rates the issue as high or critical severity.

  3. The recommendation is always the most conservative across all models. If two models say "approve" and one says "reject," the consensus says "reject." This is intentional — the cost of reviewing a false positive is far lower than the cost of shipping a real vulnerability.

  4. Risk clustering groups related findings so you do not see the same issue reported three times in different words. Instead, you see one finding with a note about which models flagged it.

The Output Format

A consensus review returns structured JSON that includes:

{
  "consensus": {
    "recommendation": "reject",
    "agreements": [
      {
        "category": "security",
        "severity": "critical",
        "description": "Admin bypass via x-admin-override header",
        "models": ["claude", "codex", "gemini"],
        "files": ["src/middleware/auth.ts"]
      }
    ],
    "disagreements": [
      {
        "category": "performance",
        "severity": "medium",
        "description": "Database query inside loop",
        "model": "codex",
        "files": ["src/services/batch.ts"]
      }
    ]
  },
  "individual": {
    "claude": { "recommendation": "reject", "risks": [...] },
    "codex": { "recommendation": "reject", "risks": [...] },
    "gemini": { "recommendation": "review", "risks": [...] }
  }
}

You get both the synthesized consensus and the individual model outputs, so you can drill into any specific model's reasoning when you need more context.

When Consensus Matters Most

Not every diff needs three models. Here is when the extra cost pays for itself:

  • Security-sensitive changes — auth, permissions, API keys, encryption, input validation
  • Concurrency and state management — anything with shared mutable state, locks, queues, caching
  • Payment and billing logic — credit deductions, subscription changes, webhook handlers
  • Data migrations — schema changes, backfills, one-time scripts that run against production data
  • Public API changes — breaking changes, new endpoints, modified contracts

For straightforward changes — adding a log line, updating copy, bumping a dependency — a single-model review at 1 credit is plenty.

Try Consensus Review

The fastest way to see this in action is the 2ndOpinion playground. Paste any diff, select "Consensus Review," and watch three models analyze your code independently before the algorithm synthesizes their findings.

For integration into your workflow, check out the getting started guide — you can be running consensus reviews from your editor in under five minutes.