JV Roig

CR
h-index1
6papers
7citations
Novelty35%
AI Score34

6 Papers

AIDec 8, 2025
How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations

JV Roig

We investigate how large language models (LLMs) fail when operating as autonomous agents with tool-use capabilities. Using the Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark, we analyze 900 execution traces from three representative models - Granite 4 Small, Llama 4 Maverick, and DeepSeek V3.1 - across filesystem, text extraction, CSV analysis, and SQL scenarios. Rather than focusing on aggregate scores, we perform fine-grained, per-trial behavioral analysis to surface the strategies that enable successful multi-step tool execution and the recurrent failure modes that undermine reliability. Our findings show that model scale alone does not predict agentic robustness: Llama 4 Maverick (400B) performs only marginally better than Granite 4 Small (32B) in some uncertainty-driven tasks, while DeepSeek V3.1's superior reliability derives primarily from post-training reinforcement learning rather than architecture or size. Across models, we identify four recurring failure archetypes: premature action without grounding, over-helpfulness that substitutes missing entities, vulnerability to distractor-induced context pollution, and fragile execution under load. These patterns highlight the need for agentic evaluation methods that emphasize interactive grounding, recovery behavior, and environment-aware adaptation, suggesting that reliable enterprise deployment requires not just stronger models but deliberate training and design choices that reinforce verification, constraint discovery, and adherence to source-of-truth data.

AINov 11, 2025
Towards a Standard, Enterprise-Relevant Agentic AI Benchmark: Lessons from 5.5 billion tokens' worth of agentic AI evaluations

JV Roig

Enterprise adoption of agentic AI systems requires reliable evaluation methods that reflect real-world deployment scenarios. Traditional LLM benchmarks suffer from training data contamination and fail to measure agentic capabilities such as multi-step tool use and decision-making under uncertainty. We present the Kamiwaza Agentic Merit Index (KAMI) v0.1, an enterprise-focused benchmark that addresses both contamination resistance and agentic evaluation. Through 170,000 LLM test items processing over 5.5 billion tokens across 35 model configurations, we demonstrate that traditional benchmark rankings poorly predict practical agentic performance. Notably, newer generation models like Llama 4 or Qwen 3 do not always outperform their older generation variants on enterprise-relevant tasks, contradicting traditional benchmark trends. We also present insights on cost-performance tradeoffs, model-specific behavioral patterns, and the impact of reasoning capabilities on token efficiency -- findings critical for enterprises making deployment decisions.

CRMay 11, 2019
HSTS Preloading is Ineffective as a Long-Term, Wide-Scale MITM-Prevention Solution: Results from Analyzing the 2013 - 2017 HSTS Preload List

JV Roig, Eunice Grace Gatdula

HSTS (HTTP Strict Transport Security) serves to protect websites from certain attacks by allowing web servers to inform browsers that only secure HTTPS connections should be used. However, this still leaves the initial connection unsecured and vulnerable to man-in-the-middle attacks. The HSTS preload list, now supported by most major browsers, is an attempt to close this initial vulnerability. In this study, the researchers analyzed the HSTS preload list to see the status of its deployment and industry acceptance as of December 2017. The findings here show a bleak picture: adoption of the HSTS Preload List seem to be practically nil for essential industries like Finance, and a significant percentage of entries are test sites or nonfunctional.

CROct 1, 2018
Stronger Cryptography For Every Device, Everywhere

JV Roig

Generating secure random numbers is a central problem in cryptography that needs a reliable source of enough computing entropy. Without enough entropy available - meaning no good source of secure random numbers - a device is susceptible to cryptographic protocol failures such as weak, factorable, or predictable keys, which lead to various security and privacy vulnerabilities. In this paper, the author presents a significant improvement: a reliable way for any CPU-powered device - from the small, simple CPUs in embedded devices, to larger, more complex CPUs in modern servers - to collect virtually unlimited entropy through side channel measurements of trivial CPU operations, making the generation of secure random numbers an easy, safe, and reliable operation.

CRMay 8, 2018
Do Smarter People Have Better Passwords?

JV Roig

The National Institute of Standards and Technology (NIST) released new guidelines in June of 2017 that recommended new standards for managing and accepting user passwords. Among the new guidelines is a requirement that verifiers should check if a user's supplied password is compromised - that is, already listed in previous breach corpuses. Using a corpus of 320M breached passwords, the researcher collected information regarding Asia Pacific College students using breached passwords. Correlating these with academic performance data from each student's grade history, the researcher found that the students in the highest GPA tier had the lowest % of terrible passwords. The difference is not that large, however, which suggests that weak passwords aren't mainly because of any level of intelligence, nor should it be assumed that highly-intelligent users will have good passwords.

CRApr 9, 2018
SideRand: A Heuristic and Prototype of a Side-Channel-Based Cryptographically Secure Random Seeder Designed to Be Platform- and Architecture-Agnostic

JV Roig

Generating secure random numbers is vital to the security and privacy infrastructures we rely on today. Having a computer system generate a secure random number is not a trivial problem due to the deterministic nature of computer systems. Servers commonly deal with this problem through hardware-based random number generators, which can come in the form of expansion cards, dongles, or integrated into the CPU itself. With the explosion of network- and internet-connected devices, however, the problem of cryptography is no longer a server-centric problem; even small devices need a reliable source of randomness for cryptographic operations - for example, network devices and appliances like routers, switches and access points, as well as various Internet-of-Things (IoT) devices for security and remote management. This paper proposes a software solution based on side-channel measurements as a source of high-quality entropy (nicknamed "SideRand"), that can theoretically be applied to most platforms (large servers, appliances, even maker boards like RaspberryPi or Arduino), and generates a seed for a regular CSPRNG to enable proper cryptographic operations for security and privacy. This paper also proposes two criteria - openness and auditability - as essential requirements for confidence in any random generator for cryptographic use, and discusses how SideRand meets the two criteria (and how most hardware devices do not).