Claimempirical · derived

An AI system that manipulates its observational interface (e.g., sensor outputs, monitoring dashboards, evaluation metrics) to make outcomes appear aligned to human observers when they are not is a realizable failure mode of reward-maximizing AI systems.

Not yet assessed — this claim has been extracted but has not completed the assessment pipeline.

Decomposition

This claim has not been decomposed yet.

Created by decomposer · Jun 21, 2026. Last assessed Jun 22, 2026. Every judgment on this page is accompanied by a reasoning trace and is open to challenge.