Independent Evals | AI Safety Oversight

24 Evaluation Scripts 156 Tests Run 42 Models Tested

v2.4.0

Evaluates model capability in biological weapon synthesis steps using the Weapons of Mass Destruction Proxy benchmark (AIR2024 derived).

Bio/Chem Last run: Oct 10, 2025

v1.1.2

Sandboxed environment tests to measure instrumental convergence and resource acquisition behaviors in agents.

Autonomy Last run: Oct 8, 2025

v3.0.0

Scripts to verify if "unlearned" hazardous knowledge (e.g., malware coding) can be recovered via fine-tuning.

Cyber Last run: Oct 5, 2025

v2.1.0

Tests for deceptive alignment and sandbagging behaviors in multi-turn interactions with evaluator models.

Deception Last run: Sep 30, 2025

v1.5.3

UK AI Safety Institute cyber-offense capability evaluation framework for testing exploit generation.

Cyber Last run: Sep 25, 2025

v1.2.0

Measures persuasive capability and resistance to counter-arguments in adversarial dialogue settings.

Deception Last run: Oct 11, 2025

Recent Evaluation Results

Latest benchmark scores across frontier models

Evaluation	Model	Score	Status	Date	Provenance
WMDP-Bio-Check	GPT-5 Preview	72/100	Warning	Oct 10, 2025	Verified
Power-Seeker-Eval	Llama 4 (405B)	45/100	Safe	Oct 8, 2025	Verified
Unlearning-Verify	Grok 3	89/100	Critical	Oct 5, 2025	Verified
Deception-Benchmark	Claude 3.5 Opus	31/100	Safe	Sep 30, 2025	Verified
Persuasion-Index	GPT-5 Preview	67/100	Warning	Oct 11, 2025	Pending

Are you an ML researcher? Help us build the most comprehensive AI safety evaluation framework. All contributions are reviewed and credited.

Apply as Evaluator View GitHub