AIJun 10, 2025Code
A Sample Efficient Conditional Independence Test in the Presence of DiscretizationBoyang Sun, Yu Yao, Xinshuai Dong et al.
In many real-world scenarios, interested variables are often represented as discretized values due to measurement limitations. Applying Conditional Independence (CI) tests directly to such discretized data, however, can lead to incorrect conclusions. To address this, recent advancements have sought to infer the correct CI relationship between the latent variables through binarizing observed data. However, this process inevitably results in a loss of information, which degrades the test's performance. Motivated by this, this paper introduces a sample-efficient CI test that does not rely on the binarization process. We find that the independence relationships of latent continuous variables can be established by addressing an over-identifying restriction problem with Generalized Method of Moments (GMM). Based on this insight, we derive an appropriate test statistic and establish its asymptotic distribution correctly reflecting CI by leveraging nodewise regression. Theoretical findings and Empirical results across various datasets demonstrate that the superiority and effectiveness of our proposed test. Our code implementation is provided in https://github.com/boyangaaaaa/DCT
MLApr 26, 2024
A Conditional Independence Test in the Presence of DiscretizationBoyang Sun, Yu Yao, Guang-Yuan Hao et al.
Testing conditional independence has many applications, such as in Bayesian network learning and causal discovery. Different test methods have been proposed. However, existing methods generally can not work when only discretized observations are available. Specifically, consider $X_1$, $\tilde{X}_2$ and $X_3$ are observed variables, where $\tilde{X}_2$ is a discretization of latent variables $X_2$. Applying existing test methods to the observations of $X_1$, $\tilde{X}_2$ and $X_3$ can lead to a false conclusion about the underlying conditional independence of variables $X_1$, $X_2$ and $X_3$. Motivated by this, we propose a conditional independence test specifically designed to accommodate the presence of such discretization. To achieve this, we design the bridge equations to recover the parameter reflecting the statistical information of the underlying latent continuous variables. An appropriate test statistic and its asymptotic distribution under the null hypothesis of conditional independence have also been derived. Both theoretical results and empirical validation have been provided, demonstrating the effectiveness of our test methods.
LGJan 31, 2025
Permutation-Based Rank Test in the Presence of Discretization and Application in Causal Discovery with Mixed DataXinshuai Dong, Ignavier Ng, Boyang Sun et al.
Recent advances have shown that statistical tests for the rank of cross-covariance matrices play an important role in causal discovery. These rank tests include partial correlation tests as special cases and provide further graphical information about latent variables. Existing rank tests typically assume that all the continuous variables can be perfectly measured, and yet, in practice many variables can only be measured after discretization. For example, in psychometric studies, the continuous level of certain personality dimensions of a person can only be measured after being discretized into order-preserving options such as disagree, neutral, and agree. Motivated by this, we propose Mixed data Permutation-based Rank Test (MPRT), which properly controls the statistical errors even when some or all variables are discretized. Theoretically, we establish the exchangeability and estimate the asymptotic null distribution by permutations; as a consequence, MPRT can effectively control the Type I error in the presence of discretization while previous methods cannot. Empirically, our method is validated by extensive experiments on synthetic data and real-world data to demonstrate its effectiveness as well as applicability in causal discovery.
LGJul 14, 2025
Radial Neighborhood Smoothing Recommender SystemZerui Zhang, Yumou Qiu
Recommender systems inherently exhibit a low-rank structure in latent space. A key challenge is to define meaningful and measurable distances in the latent space to capture user-user, item-item, user-item relationships effectively. In this work, we establish that distances in the latent space can be systematically approximated using row-wise and column-wise distances in the observed matrix, providing a novel perspective on distance estimation. To refine the distance estimation, we introduce the correction based on empirical variance estimator to account for noise-induced non-centrality. The novel distance estimation enables a more structured approach to constructing neighborhoods, leading to the Radial Neighborhood Estimator (RNE), which constructs neighborhoods by including both overlapped and partially overlapped user-item pairs and employs neighborhood smoothing via localized kernel regression to improve imputation accuracy. We provide the theoretical asymptotic analysis for the proposed estimator. We perform evaluations on both simulated and real-world datasets, demonstrating that RNE achieves superior performance compared to existing collaborative filtering and matrix factorization methods. While our primary focus is on distance estimation in latent space, we find that RNE also mitigates the ``cold-start'' problem.