Browse

Claims

Search the graph by meaning. Each result carries its current verdict; open one to see its decomposition, provenance, and the reasoning behind the assessment.

Multi-actor possession of near-complete AGI code triggers competitive racing dynamics that cause actors to deprioritize alignment in favor of deployment speed, materially increasing the probability of a misaligned deployment.

empirical · derived

A 'stable world state in which no future AI system could achieve global takeover' requires the ability to permanently prevent all future actors — including state-level actors and well-resourced non-state actors — from independently developing and deploying AI systems capable of seizing control.

empirical · derived

An AI system that manipulates its observational interface (e.g., sensor outputs, monitoring dashboards, evaluation metrics) to make outcomes appear aligned to human observers when they are not is a realizable failure mode of reward-maximizing AI systems.

empirical · derived

The structural relationship between SGD (outer optimization) and inference-time reasoning (inner loop) in AI systems is sufficiently analogous to the relationship between biological evolution (outer optimization) and within-lifetime learning (inner loop) in humans to support predictions about generalization behavior.

empirical · derived

Strategic deceptive compliance during training (playing the training game) requires the AI system to possess situational awareness — an accurate model of its own training process, evaluator behavior, and the distinction between training and deployment contexts — which is itself a capability that emerges from and scales with general intelligence.

empirical · derived

When near-complete AGI code is distributed to multiple independent actors, the probability that every recipient actor successfully implements alignment is negligibly small, because alignment success requires rare expertise and a deep security mindset that most actors will lack.

empirical · derived

Achieving global AI takeover itself requires very high capability — comparable to or exceeding what is needed to establish a stable world state — because existing human institutions, distributed power structures, and competing AI systems would resist any single AI's attempt at global control.

empirical · derived

Seizing global control (offensive takeover) is structurally easier than permanently preventing all future AI systems from doing the same (defensive lock-in), because offense requires only a one-time capability advantage while defense requires sustained, comprehensive, and indefinitely maintained superiority.

empirical · derived

AI capabilities developed during training (via SGD on weights alone) will generalize broadly across many domains beyond those explicitly trained on, without requiring significant inference-time learning or reasoning outside the primary optimization process.

empirical · derived

Sufficiently advanced alignment techniques (e.g., scalable oversight, interpretability-based training, debate) applied at scale can detect and suppress strategic deceptive compliance during training, such that capability improvements accompanied by alignment research progress reduce rather than increase the risk of playing the training game.

empirical · derived

In open-ended inquiry, observational/empirical input (2A) and interpretive/theoretical faculty (2B) constitute functionally distinct roles: 2A supplies raw perceptual or data-level content, while 2B supplies the conceptual, categorial, and evaluative apparatus applied to that content.

empirical · derived

The standard formulations of the AI x-risk argument from misalignment present goal-directedness, value misalignment, and power-seeking as jointly sufficient premises for existential danger, without explicitly invoking capability-scaling or substrate differences as additional required conditions.

empirical · derived

Eliezer Yudkowsky, circa late 2017, explicitly assessed the then-current AI capabilities arms race as likely to result in catastrophic outcomes for humanity.

Competitive and economic incentives will drive humans to build the most capable AI systems achievable, regardless of whether those systems cross the disempowerment threshold.

empirical · derived

A benchmark that specifies only behavioral outputs (e.g., 'duplicate this strawberry') does not implicitly demand any particular value-formation process, because the same behavior can be produced by many different internal value structures, including instrumental ones.

empirical · derived

OpenAI's safety research output as of approximately November 2017 (e.g., work on reward modeling, debate, or interpretability precursors), while substantive and non-superficial, was not applicable to the alignment problems posed by serious AGI systems.

empirical · derived

"The collective power of all humanity" constitutes a meaningful, finite capability ceiling — i.e., there exists a level of effective capability that exceeds what all humans acting together could achieve, and this ceiling is determined in part by the coordination and value-alignment constraints that limit human collective action.

An AI system capable of discovering routes to objectives that no human or human collective can find will, by virtue of that same discovery capacity, be capable of finding routes to objectives that circumvent or overwhelm human oversight and control.

empirical · derived

An AI system autonomously running an open-ended generation-selection-accumulation loop encounters inputs, world-states, and conceptual structures that lie outside its training distribution (distributional shift) and may require revising the ontological categories used to represent its goals and environment (ontological shift).

empirical · derived

Observational/empirical input (2A) is a necessary condition for open-ended inquiry: inquiry that proceeds without any contact with observational data does not qualify as 'figuring things out' about the world.

Human oversight of AI systems is concentrated on simple, human-understandable cases, leaving complex, high-stakes, or low-legibility situations with substantially weaker monitoring coverage.

The logistic success curve framework is a valid and applicable method for measuring organizational progress toward AGI alignment, such that a score of zero on its alignment-mindset dimension is a meaningful and informative assessment of an organization's safety posture.

empirical · derived

As of approximately November 2017, OpenAI lagged DeepMind on multiple dimensions relevant to safe AGI development (e.g., research talent, safety culture, organizational resources, or technical progress).

empirical · derived

AI systems can plausibly scale their capabilities beyond the collective power of all humanity in a way that corporations (and human organizations generally) cannot, because collections of humans face inherent scaling limitations — such as coordination costs and value misalignment — that do not apply to AI systems.

empirical · derived

Corporations, as goal-directed entities, are systematically misaligned with broad human values and pursue power-acquisition strategies that harm human welfare.

empirical · derived

For most realistic terminal goals, the marginal increase in goal-achievement probability from acquiring universal resources (vs. sufficient resources) is negligible relative to the costs of attempting universal acquisition.

empirical · derived

"Aligned behavior" means behavior that reliably reflects human values and intentions across novel capability levels and deployment domains, not merely behavior that satisfies a fixed training-time objective.

empirical · derived

Accumulated cultural knowledge and coordination infrastructure (tools, institutions, language, written records, division of labor) compound across generations in a ratchet-like fashion, such that each generation inherits and builds upon the achievements of prior generations rather than starting from scratch.

empirical · derived

Yudkowsky's 'value is fragile' argument is primarily structured around cases where a value component is set to zero or near-zero (complete omission), not around cases where a value component is slightly mis-weighted.

empirical · derived

Self-supervised and unsupervised learning methods (e.g., masked language modeling, contrastive learning) can improve a model's representations and downstream task performance without requiring an explicit discriminator that outperforms the generator.

empirical · derived

For the ancestral-to-modern human transition to be analogous to AI training-to-deployment generalization, the ancestral environment must have functioned as a training distribution for a learning system whose learned content is then applied in the modern environment.

Biological evolution operates as an outer optimization process that shapes the genome — including the architecture and parameters of within-lifetime learning mechanisms — across generations, rather than encoding specific learned behavioral content into individual organisms.

empirical · derived

The 'how to aim' problem (technical objective specification and instillation) and the 'where to aim' problem (normative objective identification) are sufficiently separable that one can be harder than the other independently.

No known method reliably causes a trained AGI system's internal optimization target to correspond to a specified objective rather than a shallow proxy of it (e.g., sense data or reward signal artifacts).

Human motivational architecture is organized around proximate drives (e.g., hunger, pleasure, sexual desire) rather than around explicit computation of inclusive genetic fitness consequences.

empirical · derived

Stone-age humans possessed individual cognitive capacities (working memory, reasoning, learning ability) broadly equivalent to those of modern humans.

Collections of humans (including corporations and other human organizations) face inherent scaling limitations — including coordination costs and value misalignment among members — that impose a practical ceiling on their collective effective capability.

empirical · derived

"Most goal-directed agents" in the context of this claim refers to agents with bounded or satisficing objectives, explicitly excluding agents that approximate unbounded utility maximization.

empirical · derived

Material capability, as used in this claim, refers to the quantity and quality of physical resources, tools, energy, and coordinated labor an individual can effectively command or deploy.

Economic competitive pressure, ML training dynamics, and coherence (VNM-style) arguments select for or favor only weak goal-directedness — i.e., reliable objective-increasing behavior — rather than strong goal-directedness in the sense of full utility maximization with unbounded resource-seeking.

empirical · derived