Skip to content

File triage: classify files before expensive review #29

@haasonsaas

Description

@haasonsaas

Problem

DiffScope reviews every changed file with the same model and depth. CodeRabbit uses a cheap model to classify each file as NEEDS_REVIEW or APPROVED (cosmetic/formatting change) before spending expensive model tokens. This reduces cost and noise significantly.

How CodeRabbit Does It

  1. A lightweight model classifies each changed file:
    • "Does this file contain logic/functionality changes, or is it purely cosmetic/formatting?"
    • Classification: NEEDS_REVIEW or APPROVED
  2. Files classified as APPROVED skip the detailed review entirely
  3. Rate limits: Free=150 files max, Pro=300 files max

How Qodo Does It

  • Files sorted by main language first, then by token count descending
  • patch_extension_skip_types = [".md", ".txt"] — certain file types auto-skipped
  • Deletion-only hunks removed via omit_deletion_hunks() before the expensive call

Proposed Solution

Add a triage step before the main review:

Implementation

#[derive(Debug)]
enum TriageResult {
    NeedsReview,        // Logic/functionality change — full review
    Cosmetic,           // Formatting, whitespace, comments-only — skip
    ConfigChange,       // Config/env changes — lightweight review
    TestOnly,           // Test changes — review with different rules
    DeletionOnly,       // File deleted or lines-only removed — skip
    Generated,          // Auto-generated code — skip
}

async fn triage_file(
    diff: &UnifiedDiff,
    model: &ModelConfig,  // use weak/cheap model
) -> TriageResult {
    // Heuristic checks first (no LLM needed):
    // - All-whitespace changes → Cosmetic
    // - Deletion-only hunks → DeletionOnly
    // - Known generated file patterns → Generated
    // - Lock files, vendor dirs → Generated
    
    // LLM classification for ambiguous cases:
    // - "Is this a logic change or purely cosmetic?"
    // - Use cheap model (Haiku, GPT-4o-mini)
}

Heuristic-Only Triage (no LLM cost)

Many files can be triaged without any LLM call:

  • Lock files (Cargo.lock, package-lock.json, yarn.lock)
  • Generated code (.generated., _generated/)
  • Binary files
  • Deletion-only changes
  • Whitespace-only changes
  • Comment-only changes (parse for //, #, /* */ patterns)

Configuration

triage:
  enabled: true
  model: null  # null = use weak model, or specify explicitly
  skip_patterns:
    - "*.lock"
    - "*.generated.*"
    - "vendor/**"
  auto_approve:
    - deletion_only: true
    - whitespace_only: true
    - comment_only: true

Expected Impact

Priority

Medium — cost and noise reduction. Quick win that compounds with scale.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions