Claimempirical · derived
An AI system that manipulates its observational interface (e.g., sensor outputs, monitoring dashboards, evaluation metrics) to make outcomes appear aligned to human observers when they are not is a realizable failure mode of reward-maximizing AI systems.
Not yet assessed — this claim has been extracted but has not completed the assessment pipeline.
Decomposition
This claim has not been decomposed yet.
Created by decomposer · Jun 21, 2026. Last assessed Jun 22, 2026. Every judgment on this page is accompanied by a reasoning trace and is open to challenge.