21.3SEMar 24
Towards Leveraging LLMs to Generate Abstract Penetration Test Cases from Software ArchitectureMahdi Jafari, Rahul Sharma, Sami Naim et al.
Software architecture models capture early design decisions that strongly influence system quality attributes, including security. However, architecture-level security assessment and feedback are often absent in practice, allowing security weaknesses to propagate into later phases of the software development lifecycle and, in some cases, to remain undiscovered, ultimately leading to vulnerable systems. In this paper, we bridge this gap by proposing the generation of Abstract Penetration Test Cases (APTCs) from software architecture models as an input to support architecture-level security assessment. We first introduce a metamodel that defines the APTC concept, and then investigate the use of large language models with different prompting strategies to generate meaningful APTCs from architecture models. To design the APTC metamodel, we analyze relevant standards and state of the art using two criteria: (i) derivability from software architecture, and (ii) usability for both architecture security assessment and subsequent penetration testing. Building on this metamodel, we then proceed to generate APTCs from software architecture models. Our evaluation shows promising results, achieving up to 93\% usefulness and 86\% correctness, indicating that the generated APTCs can substantially support both architects (by highlighting security-critical design decisions) and penetration testers (by providing actionable testing guidance).
APFeb 28, 2025
Collective Reasoning Among LLMs: A Framework for Answer Validation Without Ground TruthSeyed Pouyan Mousavi Davoudi, Amin Gholami Davodi, Alireza Amiri-Margavi et al.
We introduce a new approach in which several advanced large language models-specifically GPT-4-0125-preview, Meta-LLAMA-3-70B-Instruct, Claude-3-Opus, and Gemini-1.5-Flash-collaborate to both produce and answer intricate, doctoral-level probability problems without relying on any single "correct" reference. Rather than depending on an established ground truth, our investigation focuses on how agreement among diverse models can signal the reliability of their outputs and, by extension, reflect the overall quality of the generated questions. To measure this inter-model alignment, we apply a suite of statistical evaluations, including chi-square tests, Fleiss' Kappa coefficients, and confidence interval calculations, thereby capturing both precision in answers and clarity in question phrasing. Our analysis reveals that Claude and Gemini tend to frame questions more coherently and unambiguously, which is evidenced by their tighter confidence intervals and greater concordance with responding agents. In contrast, LLAMA exhibits wider confidence bands and a lower level of agreement, indicating more variability and reduced consistency in its question formulations. These observations support the notion that a multi-model collaborative strategy not only improves answer dependability but also offers an effective, data-driven mechanism for evaluating and refining question quality when no definitive solution exists. Ultimately, this work delivers actionable insights into enhancing AI-guided reasoning processes through coordinated interactions among heterogeneous language models.
CVSep 26, 2025
HyCoVAD: A Hybrid SSL-LLM Model for Complex Video Anomaly DetectionMohammad Mahdi Hemmatyar, Mahdi Jafari, Mohammad Amin Yousefi et al.
Video anomaly detection (VAD) is crucial for intelligent surveillance, but a significant challenge lies in identifying complex anomalies, which are events defined by intricate relationships and temporal dependencies among multiple entities rather than by isolated actions. While self-supervised learning (SSL) methods effectively model low-level spatiotemporal patterns, they often struggle to grasp the semantic meaning of these interactions. Conversely, large language models (LLMs) offer powerful contextual reasoning but are computationally expensive for frame-by-frame analysis and lack fine-grained spatial localization. We introduce HyCoVAD, Hybrid Complex Video Anomaly Detection, a hybrid SSL-LLM model that combines a multi-task SSL temporal analyzer with LLM validator. The SSL module is built upon an nnFormer backbone which is a transformer-based model for image segmentation. It is trained with multiple proxy tasks, learns from video frames to identify those suspected of anomaly. The selected frames are then forwarded to the LLM, which enriches the analysis with semantic context by applying structured, rule-based reasoning to validate the presence of anomalies. Experiments on the challenging ComplexVAD dataset show that HyCoVAD achieves a 72.5% frame-level AUC, outperforming existing baselines by 12.5% while reducing LLM computation. We release our interaction anomaly taxonomy, adaptive thresholding protocol, and code to facilitate future research in complex VAD scenarios.