51.7COMay 22
List Reconstruction Problem with List Size TwoBinh Vu, Shuche Wang, Van Khu Vu
The problem of computing the cardinality of the intersection of multiple balls in the Hamming space has attracted a lot of attention recently due to their applications in the list reconstruction problem and information retrieval in Associative Memories. In previous work, most of the results are for the cases where the radii of each ball, $r$ and the distance between the centers of these balls, $k$ are fixed when the length $n$ of each codeword tend to infinity. In this work, we focus on the case where $r = αn$ and $k=βn$ for some constants $α$ and $β$ and compute the maximum asymptotic rate of the cardinality of the intersection of three balls. We provide the maximum asymptotic rate as a function of two parameters $α$ and $β$. We also provide numerical results and compare these results with the intersection of two balls.
7.7SIMay 15
CitePrism: Human-in-the-Loop AI for Citation Auditing and Editorial IntegrityGowrika Mahesh, Budanur Madappa Darshan Gowda, Kavana Gopladevarahalli Papegowda et al.
Editors and reviewers are expected to ensure that manuscripts cite relevant, accurate, current, and ethically appropriate literature, yet manuscript-level citation auditing remains largely manual, fragmented, and difficult to scale. Citation context, metadata quality, self-citation patterns, and bibliographic integrity all affect whether a reference appropriately supports a local claim. We present CitePrism, a transparent hybrid decision-support framework for editorial citation auditing that combines LLM-assisted contextual reasoning, embedding-based semantic similarity, metadata verification, integrity-oriented flags, and human-in-the-loop analyst review. CitePrism extracts citation neighborhoods, enriches reference metadata, computes fused relevance scores, surfaces metadata and self-citation review prompts, and supports configurable threshold-based triage. In a preliminary validation on a single case-study manuscript with 104 references from pavement engineering, agreement with human binary relevance labels reached Cohen's kappa = 0.429. At operating threshold tau = 17, CitePrism flagged all human-labeled irrelevant citations, while also producing false positives requiring analyst review. These results suggest that CitePrism may support conservative editorial screening and citation-quality triage, but they do not establish general editorial performance. CitePrism is intended as pilot-stage decision support, not as an autonomous misconduct detector or automated editorial decision system. Broader validation across manuscripts, domains, annotators, baselines, and deployment settings is required before operational use.
30.0SIMay 10
Astro Generative Network: A Variational Framework for Controlled Node Insertion in Incomplete Complex NetworksMehrdad Jalali, Binh Vu, Swati Chandna et al.
Empirical networked systems are often only partially observed: sampling frames, crawling policies, privacy constraints, and temporal gaps can leave actors and edges unobserved. This complicates robustness and sensitivity analysis because many graph-learning pipelines implicitly treat the observed node set as exhaustive. Link prediction and graph completion repair structure among known vertices, whereas full-graph generators synthesize new graphs rather than extending an observed one as a fixed backbone. We study the complementary task of controlled node insertion: generating plausible new actors and attaching them to an existing graph while preserving interpretable global topology. We introduce the Astro Generative Network (AGN), a variational graph autoencoder that samples latent vectors to decode node features and then integrates new vertices through similarity-based attachment to the observed backbone. We distinguish the recommended configuration, AGN, from AGN-original, a diagnostic baseline that permits generated-generated edges. Across three synthetic regimes, AGN-original forms dense generated-generated subgraphs that artificially inflate clustering and density. Disabling those edges removes this artifact while preserving degree and path-length behavior. In our experiments, AGN keeps clustering and modularity changes modest relative to pre-insertion values, while novelty diagnostics show non-trivial separation from existing nodes without claiming domain-grounded identities. Our contribution is methodological: a reproducible insertion protocol and evaluation lens for incomplete network science and engineering
DLDec 19, 2025
Intelligent Knowledge Mining Framework: Bridging AI Analysis and Trustworthy PreservationBinh Vu
The unprecedented proliferation of digital data presents significant challenges in access, integration, and value creation across all data-intensive sectors. Valuable information is frequently encapsulated within disparate systems, unstructured documents, and heterogeneous formats, creating silos that impede efficient utilization and collaborative decision-making. This paper introduces the Intelligent Knowledge Mining Framework (IKMF), a comprehensive conceptual model designed to bridge the critical gap between dynamic AI-driven analysis and trustworthy long-term preservation. The framework proposes a dual-stream architecture: a horizontal Mining Process that systematically transforms raw data into semantically rich, machine-actionable knowledge, and a parallel Trustworthy Archiving Stream that ensures the integrity, provenance, and computational reproducibility of these assets. By defining a blueprint for this symbiotic relationship, the paper provides a foundational model for transforming static repositories into living ecosystems that facilitate the flow of actionable intelligence from producers to consumers. This paper outlines the motivation, problem statement, and key research questions guiding the research and development of the framework, presents the underlying scientific methodology, and details its conceptual design and modeling.
CLDec 16, 2025
VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language ModelsNguyen Tien Dong, Minh-Anh Nguyen, Thanh Dat Hoang et al.
The rapid advancement of large language models (LLMs) has enabled new possibilities for applying artificial intelligence within the legal domain. Nonetheless, the complexity, hierarchical organization, and frequent revisions of Vietnamese legislation pose considerable challenges for evaluating how well these models interpret and utilize legal knowledge. To address this gap, the Vietnamese Legal Benchmark (VLegal-Bench) is introduced, the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. Informed by Bloom's cognitive taxonomy, VLegal-Bench encompasses multiple levels of legal understanding through tasks designed to reflect practical usage scenarios. The benchmark comprises 10,450 samples generated through a rigorous annotation pipeline, where legal experts label and cross-validate each instance using our annotation system to ensure every sample is grounded in authoritative legal documents and mirrors real-world legal assistant workflows, including general legal questions and answers, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving tailored to Vietnamese law. By providing a standardized, transparent, and cognitively informed evaluation framework, VLegal-Bench establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts and supports the development of more reliable, interpretable, and ethically aligned AI-assisted legal systems. To facilitate access and reproducibility, we provide a public landing page for this benchmark at https://vilegalbench.cmcai.vn/.
HCNov 24, 2021
Picasso: Model-free Feature VisualizationBinh Vu, Igor Markov
Today, Machine Learning (ML) applications can have access to tens of thousands of features. With such feature sets, efficiently browsing and curating subsets of most relevant features is a challenge. In this paper, we present a novel approach to visualize up to several thousands of features in a single image. The image not only shows information on individual features, but also expresses feature interactions via the relative positioning of features.