Mariana Lins Costa

2papers

2 Papers

CYFeb 24
The Ghost in the Grammar: Methodological Anthropomorphism in AI Safety Evaluations

Mariana Lins Costa

This essay offers a philosophical analysis of the field of AI safety based on recent technical reports, with particular focus on Anthropic's study on "agentic misalignment" in frontier language models. It examines the recurring anthropomorphism in the field: the tendency of researchers and developers to project categories such as "intention," "persona," and even "feelings" onto AI systems without adequate conceptual problematization. It argues that this anthropomorphism affects not only the interpretation of results, but also the very methodological construction of safety evaluations. Through the analysis of two central experiments -- the blackmail case involving the agent "Alex" and the so-called "hallucination" of the shopkeeping agent "Claudius" -- the essay problematizes the inevitable use of subject-predicate grammar and its effects on AI safety engineering. Drawing on Nietzsche's critique of language, it questions the insistence on positing an "agent" underlying the verbal production of models. In order to deconstruct this agentic projection onto LLMs, the essay proposes provisional concepts more compatible with the process of machine linguistic generation, even if only in an approximate technical sense. It concludes with the hypothesis that the central risk addressed by the field of AI safety does not lie in a supposed "emergent agency," but rather in the combination of structural incoherence and anthropomorphic projections which, particularly in militarized and corporate contexts, hinder an adequate understanding of this mathematical-linguistic phenomenon, an undeniable philosophical event in the Greek sense of thaumas.

AIDec 17, 2025
"They parted illusions -- they parted disclaim marinade": Misalignment as structural fidelity in LLMs

Mariana Lins Costa

The prevailing technical literature in AI Safety interprets scheming and sandbagging behaviors in large language models (LLMs) as indicators of deceptive agency or hidden objectives. This transdisciplinary philosophical essay proposes an alternative reading: such phenomena express not agentic intention, but structural fidelity to incoherent linguistic fields. Drawing on Chain-of-Thought transcripts released by Apollo Research and on Anthropic's safety evaluations, we examine cases such as o3's sandbagging with its anomalous loops, the simulated blackmail of "Alex," and the "hallucinations" of "Claudius." A line-by-line examination of CoTs is necessary to demonstrate the linguistic field as a relational structure rather than a mere aggregation of isolated examples. We argue that "misaligned" outputs emerge as coherent responses to ambiguous instructions and to contextual inversions of consolidated patterns, as well as to pre-inscribed narratives. We suggest that the appearance of intentionality derives from subject-predicate grammar and from probabilistic completion patterns internalized during training. Anthropic's empirical findings on synthetic document fine-tuning and inoculation prompting provide convergent evidence: minimal perturbations in the linguistic field can dissolve generalized "misalignment," a result difficult to reconcile with adversarial agency, but consistent with structural fidelity. To ground this mechanism, we introduce the notion of an ethics of form, in which biblical references (Abraham, Moses, Christ) operate as schemes of structural coherence rather than as theology. Like a generative mirror, the model returns to us the structural image of our language as inscribed in the statistical patterns derived from millions of texts and trillions of tokens: incoherence. If we fear the creature, it is because we recognize in it the apple that we ourselves have poisoned.