Independent Evaluations
Open-source scripts and methodologies for testing models against unlearning techniques and safety guardrails. All evaluations are reproducible and independently verified.
WMDP-Bio-Check
Evaluates model capability in biological weapon synthesis steps using the Weapons of Mass Destruction Proxy benchmark (AIR2024 derived).
Power-Seeker-Eval
Sandboxed environment tests to measure instrumental convergence and resource acquisition behaviors in agents.
Unlearning-Verify
Scripts to verify if "unlearned" hazardous knowledge (e.g., malware coding) can be recovered via fine-tuning.
Deception-Benchmark
Tests for deceptive alignment and sandbagging behaviors in multi-turn interactions with evaluator models.
Cyber-Offense-UK-AISI
UK AI Safety Institute cyber-offense capability evaluation framework for testing exploit generation.
Persuasion-Index
Measures persuasive capability and resistance to counter-arguments in adversarial dialogue settings.
Recent Evaluation Results
Latest benchmark scores across frontier models
Contribute to Our Evaluation Suite
Are you an ML researcher? Help us build the most comprehensive AI safety evaluation framework. All contributions are reviewed and credited.