Skip to content

craked-code/adversarial_code_validator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

code_validator

A multi-agent system for automated code validation using adversarial test generation, iterative refinement, and branch coverage guidance. Built with Qwen 2.5 Coder 7B running locally via Ollama.


How It Works

The system consists of three agents operating in a loop:

Generator produces test cases for the current version of the code. Adversary attempts to find a failing input, optionally guided by which branches remain uncovered. Refiner fixes the code when the adversarial test fails. This loop continues until the code passes two consecutive adversarial tests while meeting a branch coverage threshold (when coverage guidance is enabled).


Benchmark

To evaluate whether branch coverage guidance improves fault detection, we ran experiments on 10 Python functions, each containing a single intentionally injected bug. Each function was tested under two conditions: with coverage guidance (the Adversary is told which branches are uncovered) and without coverage guidance (the Adversary receives no structural information). Every condition was run with 5 independent random seeds to account for LLM non-determinism.

Injected Bugs

Function Bug Description
safe_divide Checks b == 1 instead of b == 0 for zero guard
classify_triangle Equilateral check uses a == b instead of a == b == c
fizzbuzz Uses % 12 instead of % 15 for fizzbuzz case
find_max Checks len == 1 instead of len == 0 for empty list
letter_grade B grade threshold is >= 79 instead of >= 80
is_leap_year Uses % 200 instead of % 400 for century leap year
count_vowels Vowel set "aeio" is missing 'u'
temp_convert Fahrenheit-to-Celsius uses (temp + 32) instead of (temp - 32)
clamp_value Upper bound check uses > instead of >=
absolute_value Returns -n for positive numbers (swapped sign logic)

Results

All numbers are averages across 5 seeds per condition.

Final Branch Coverage

Function With Coverage Without Coverage
safe_divide 100.0% 0.0%
classify_triangle 90.0% 0.0%
fizzbuzz 91.5% 0.0%
find_max 43.3% 0.0%
letter_grade 91.7% 0.0%
is_leap_year 90.7% 0.0%
count_vowels 75.0% 0.0%
temp_convert 60.0% 0.0%
clamp_value 91.7% 0.0%
absolute_value 93.3% 0.0%
Average 82.7% 0.0%

Coverage is 0% in the without-coverage condition by design — the system does not measure it, it simply does not use it.

Bug Fix Rate (out of 5 seeds)

Function With Coverage Without Coverage
safe_divide 5/5 5/5
classify_triangle 5/5 1/5
fizzbuzz 5/5 1/5
find_max 4/5 2/5
letter_grade 5/5 0/5
is_leap_year 5/5 1/5
count_vowels 3/5 4/5
temp_convert 3/5 3/5
clamp_value 5/5 5/5
absolute_value 0/5 0/5
Total 40/50 (80%) 22/50 (44%)

Key Observations

Coverage guidance nearly doubled the bug fix rate — 80% vs 44% across all seeds. For functions with subtle boundary bugs (letter_grade, classify_triangle, is_leap_year, fizzbuzz), the guided condition fixed the bug consistently while the baseline either failed entirely or only got lucky on one seed.

letter_grade is the clearest example. The bug (threshold at 79 instead of 80) only manifests for an input of exactly 79. Without coverage guidance, the adversary never discovered this input across all 5 seeds — 0/5 fixed. With coverage guidance, branch feedback directed the adversary toward the boundary, and all 5 seeds fixed the bug.

absolute_value was unfixable under both conditions. The bug (returning -n for positive numbers, making the function always negate its input) requires understanding what absolute value means. The 7B model consistently treated the behavior as correct and never generated a test that expected a positive output for a positive input. This is a model-capacity limitation rather than a failure of the coverage mechanism.

clamp_value and safe_divide were fixed by both conditions equally. For safe_divide, the bug (checking b == 1 instead of b == 0) is caught quickly because dividing by zero crashes the executor, making it easy to discover without any structural guidance. For clamp_value, the >= vs > distinction is functionally invisible for most test inputs, so neither condition reliably exploited it — both fixed the structure of the function but not always the exact boundary behavior.


Setup

Requirements: Python 3.10+, Ollama installed locally, coverage and qwen2.5-coder:7b pulled via Ollama.

Install dependencies:

pip install coverage
ollama pull qwen2.5-coder:7b

Run on your own function:

python run_experiment.py

Paste your function, type END, choose whether to run with or without coverage guidance, and the system will validate it iteratively.

Run the full benchmark:

python run_benchmark.py

Results are saved as JSON files in the experiment_logs/ folder.


Project Structure

adversarial_validator/
├── agents/
│   ├── adversary.py         # Coverage-guided attack generation
│   ├── generator.py         # Baseline test case generation
│   └── refiner.py           # Code rewriting on failure
├── sandbox/
│   ├── executor.py          # Sandboxed code execution
│   └── coverage_runner.py   # Branch coverage measurement
├── experiment_logs/         # Auto-created; one JSON per run
├── orchestrator.py          # Main loop coordinating all agents
├── benchmark.py             # 10 functions with injected bugs
├── logger.py                # Saves per-iteration results to JSON
├── run_benchmark.py         # Runs all benchmark functions automatically
└── run_experiment.py        # Interactive entry point for a single function

Model

All experiments used Qwen 2.5 Coder 7B running locally on CPU via Ollama. No external API calls are made. The model is the same for all three agents (Generator, Adversary, Refiner) across all conditions.

About

Three-agent self-play system for automated code hardening — runs fully local via Ollama.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages