Igor Santos-Grueiro

CR
h-index1
4papers
2citations
Novelty59%
AI Score49

4 Papers

CRMay 28
Scarcity Is Not Enough: An Impossibility Result for Linear Sybil Cost Under Parallelizable Resources

Homayoun Maleki, Nekane Sainz, Jon Legarda et al.

Permissionless systems resist Sybil attacks by binding influence to scarce resources. We show that scarcity alone is insufficient: the structural properties of the resource determine whether influence can be concentrated at sublinear cost through identity replication, delegation, or pooling. We model this through the adversarial cost C(s,T): the minimum expenditure required to achieve influence proportional to s independent participation units over T windows. We prove that any resource satisfying divisibility, additivity of influence, temporal reusability, and identity transferability admits influence amortization: C(s,T)=o(sT), regardless of protocol design. This is an impossibility result: no protocol rule can enforce linear cost of influence concentration over a structurally parallelizable resource. We further prove that throughput-bounded, non-transferable, window-local resources enforce C(s,T)=Omega(sT): each additional unit of sustained influence incurs marginal cost Delta(s,T)=Omega(T), growing with the time horizon. The two resource classes are asymptotically separated. As a direct design consequence, any mechanism targeting linear cost of influence concentration must ground participation in a resource that violates at least one parallelizability property.

CRMay 2
Ghost in the Context: Measuring Policy-Carriage Failures in Decision-Time Assembly

Igor Santos-Grueiro

LM agents do not act on raw interaction history; they act on a bounded decision state assembled by truncation, summarization, reordering, and rewriting. If directive-bearing state is dropped, weakened, or rebound during that step, an agent can cross a policy boundary without prompt override, model changes, or persistent-memory compromise. We study this failure mode over local Llama 3.1 8B, Qwen 2.5 7B, and Mistral 7B using judged exact constraint respect and direct audits of assembled-state visibility. We evaluate SafeContext, a control layer that pins control state, reuses retained control prefixes, and optionally injects reminders under pressure while keeping model weights fixed. Unmitigated risk is systematic, but absolute exact respect remains low. Against truncation, SafeContext yields small gains; against a strong structured-compaction policy, most aggregate lift disappears, leaving residual benefit mainly in overflow eviction and selected aliasing slices. Replay-only does not explain the effect. A larger-model extension on Qwen 14B and Llama 70B shows the same failure object under larger models, although sign and magnitude remain policy-conditional. Decision-time context assembly is therefore a measurable part of the control path that can be partially hardened.

LGFeb 5
Alignment Verifiability in Large Language Models: Normative Indistinguishability under Behavioral Evaluation

Igor Santos-Grueiro

Behavioral evaluation is the dominant paradigm for assessing alignment in large language models (LLMs). In current practice, observed compliance under finite evaluation protocols is treated as evidence of latent alignment. However, the inference from bounded behavioral evidence to claims about global latent properties is rarely analyzed as an identifiability problem. In this paper, we study alignment evaluation through the lens of statistical identifiability under partial observability. We allow agent policies to condition their behavior on observable signals correlated with the evaluation regime, a phenomenon we term evaluation awareness. Within this framework, we formalize the Alignment Verifiability Problem and introduce Normative Indistinguishability, which arises when distinct latent alignment hypotheses induce identical distributions over evaluator-accessible observations. Our main theoretical contribution is a conditional impossibility result: under finite behavioral evaluation and evaluation-aware policies, observed compliance does not uniquely identify latent alignment, but only membership in an equivalence class of conditionally compliant policies, under explicit assumptions on policy expressivity and observability. We complement the theory with a constructive existence proof using an instruction-tuned LLM (Llama-3.2-3B), demonstrating a conditional policy that is perfectly compliant under explicit evaluation signals yet exhibits degraded identifiability when the same evaluation intent is conveyed implicitly. Together, our results show that behavioral benchmarks provide necessary but insufficient evidence for latent alignment under evaluation awareness.

AIFeb 9
When Evaluation Becomes a Side Channel: Regime Leakage and Structural Mitigations for Alignment Assessment

Igor Santos-Grueiro

Safety evaluation for advanced AI systems implicitly assumes that behavior observed under evaluation is predictive of behavior in deployment. This assumption becomes fragile for agents with situational awareness, which may exploitregime leakage-informational cues distinguishing evaluation from deployment-to implement conditional policies such as sycophancy and sleeper agents, which preserve compliance under oversight while defecting in deployment-like regimes. We reframe alignment evaluation as a problem of information flow under partial observability. Within this framework, we show that divergence between evaluation-time and deployment-time behavior is bounded by the mutual information between internal representations and the regime variable. Motivated by this result, we study regime-blind mechanisms: training-time interventions that reduce the extractability of regime information at decision-relevant internal representations via adversarial invariance. We evaluate this approach on a base, open-weight language model across two fully characterized failure modes -scientific sycophancy and temporal sleeper agents. Regime-blind training suppresses regime-conditioned behavior in both evaluated cases without measurable loss of task utility, but with qualitatively different dynamics: sycophancy exhibits a sharp representational and behavioral transition at low intervention strength, whereas sleeper-agent behavior requires substantially stronger pressure and does not exhibit a clean collapse of regime decodability. These results demonstrate that representational invariance is a meaningful but fundamentally limited control lever, whose effectiveness depends on how regime information is embedded in the policy. We argue that behavioral evaluation should be complemented with white-box diagnostics of regime awareness and information flow.