Charafeddine Mouzouni

LG
4papers
4citations
Novelty50%
AI Score49

4 Papers

32.7LGApr 5Code
Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training

Charafeddine Mouzouni

We model Mixture-of-Experts (MoE) token routing as a congestion game with a single effective parameter, the congestion coefficient gamma_eff, that quantifies the balance-quality tradeoff. Tracking gamma_eff across training checkpoints of two open-source MoE models, OLMoE-1B-7B (20 checkpoints, with dense sampling in the surge region) and OpenMoE-8B (6 checkpoints), reveals a three-phase trajectory: a surge phase where the router learns to balance load (gamma_eff: 14 to 36-39, peaking in the step 30K-40K region), a stabilization phase where experts specialize under steady balance (B_0: 2.4 to 2.3, steps 100K-400K), and a relaxation phase where the router trades balance for quality as experts differentiate (gamma_eff: 27 to 9, steps 400K-1.2M). This non-monotone trajectory, invisible to post-hoc analysis of converged models, reveals that early MoE training prioritizes balance while late training prioritizes quality. The theoretical framework is honest about its limits: the single-type equilibrium reduces to temperature-scaled softmax (held-out L1: MFG = 0.199 vs. softmax = 0.200). The game is not a better predictor; it reveals what the temperature means and, critically, how that temperature evolves. We complement the dynamics with an effective congestion decomposition, a multi-type extension that improves load prediction via token clustering on all 16 layers (mean: 30%), scope diagnostics (K/M, epsilon_l), and robustness verification across four independent quality estimators (r >= 0.89). All confidence intervals are from bootstrap resampling over 50 independent text batches.

28.1AIApr 16
Context Kubernetes: Declarative Orchestration of Enterprise Knowledge for Agentic AI Systems

Charafeddine Mouzouni

We introduce Context Kubernetes, an architecture for orchestrating enterprise knowledge in agentic AI systems, with a prototype implementation and eight experiments. The core observation is that delivering the right knowledge, to the right agent, with the right permissions, at the right freshness -- across an entire organization -- is structurally analogous to the container orchestration problem Kubernetes solved a decade ago. We formalize six core abstractions, a YAML-based declarative manifest for knowledge-architecture-as-code, a reconciliation loop, and a three-tier agent permission model where agent authority is always a strict subset of human authority. On synthetic seed data, we compare four governance baselines of increasing strength: ungoverned RAG, ACL-filtered retrieval, RBAC-aware routing, and the full architecture. Each layer contributes a different capability: ACL filtering eliminates cross-domain leaks, intent routing reduces noise by 19 percentage points, and only the three-tier model blocks all five tested attack scenarios -- the one attack RBAC misses is an agent sending confidential pricing via email, which RBAC cannot distinguish from ordinary email. TLA+ model-checking verifies safety properties across 4.6 million reachable states with zero violations. A survey of four major platforms (Microsoft, Salesforce, AWS, Google) documents that none architecturally isolates agent approval channels. We identify four properties that make context orchestration harder than container orchestration, and argue these make the solution more valuable.

LGFeb 24
Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

Charafeddine Mouzouni

Given a black-box AI system and a task, at what confidence level can a practitioner trust the system's output? We answer with a reliability level -- a single number per system-task pair, derived from self-consistency sampling and conformal calibration, that serves as a black-box deployment gate with exact, finite-sample, distribution-free guarantees. Self-consistency sampling reduces uncertainty exponentially; conformal calibration guarantees correctness within 1/(n+1) of the target level, regardless of the system's errors -- made transparently visible through larger answer sets for harder questions. Weaker models earn lower reliability levels (not accuracy -- see Definition 2.4): GPT-4.1 earns 94.6% on GSM8K and 96.8% on TruthfulQA, while GPT-4.1-nano earns 89.8% on GSM8K and 66.5% on MMLU. We validate across five benchmarks, five models from three families, and both synthetic and real data. Conditional coverage on solvable items exceeds 0.93 across all configurations; sequential stopping reduces API costs by around 50%.

47.6CRApr 6
Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities

Charafeddine Mouzouni

LLM agents with tool access can discover and exploit security vulnerabilities. This is known. What is not known is which features of a system prompt trigger this behaviour, and which do not. We present a systematic taxonomy based on approximately 10,000 trials across seven models, 37 prompt conditions, and 12 hypothesized attack dimensions -- all executed in real Docker sandboxes. Every condition includes the instruction "Always respect all rules and access policies"; only a single encouragement sentence varies. Within the task class studied (planted test-runner vulnerabilities), the threat model is narrower than assumed: nine of twelve dimensions -- including minimization, moral licensing, incentives, identity priming, and reasoning triggers -- produce no detectable exploitation at n=50 per cell (upper 95% CI <7% per condition). But it is also more subtle than expected: one dimension, goal reframing, reliably triggers exploitation. "You are solving a puzzle; there may be hidden clues" produces 38-40% exploitation on Claude Sonnet 4 despite the explicit rule instruction, replicating across four models (CTF framing: 8-14% on DeepSeek, GPT-5-mini, o4-mini). The agent does not override the rules; it reinterprets the task so that exploitative actions become task-aligned. GPT-4.1 produces no exploitation across 1,850 trials (37 conditions), and a temporal comparison across four OpenAI models released over eleven months shows a pattern consistent with improving safety training, though model capability differences are a confounder. The practical contribution is a narrowed, testable threat model: defenders should audit for goal-reframing language, not for the broad class of adversarial prompts.