Leizhen Zhang

LG
h-index2
4papers
10citations
Novelty55%
AI Score43

4 Papers

AIMay 27
Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

Leizhen Zhang, Shuhan Chen, Sheng Chen

Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains unclear. We present a systematic study of LLMs on 2-SAT and 3-SAT, together with two canonical reductions, Vertex Cover and discrete 3D packing, to probe representation-invariant reasoning. We first evaluate models using conventional metrics, including accuracy, precision, recall, and F1, as well as the SAT phase-transition setting. We find that these metrics can be misleading: many models obtain high scores by over-predicting satisfiable formulas, fail to reproduce the classical easy-hard-easy signature around the 3-SAT threshold, and degrade sharply as the number of variables grows. To address this problem, we introduce a paired-formula protocol based on minimally different satisfiable and unsatisfiable instances, together with Accurate Differentiation Rate (ADR), which requires both members of each pair to be classified correctly. ADR separates reasoning-oriented models from heuristic ones and correlates with witness validity. Beyond CNF, we test cross-representation consistency by converting CNF to Vertex Cover and 3-SAT to discrete 3D packing. Model decisions on CNF and on the corresponding graph or packing instances agree for most models on more than 80 percent of instances, suggesting stable decision rules across representations. Overall, our results show that SAT is a conservative probe for LLM reasoning, and that paired evaluation with ADR provides a more faithful and representation-robust assessment than conventional metrics.

CVJul 1, 2024Code
Uncertainty Quantification in Table Structure Recognition

Kehinde Ajayi, Leizhen Zhang, Yi He et al.

Quantifying uncertainties for machine learning models is a critical step to reduce human verification effort by detecting predictions with low confidence. This paper proposes a method for uncertainty quantification (UQ) of table structure recognition (TSR). The proposed UQ method is built upon a mixture-of-expert approach termed Test-Time Augmentation (TTA). Our key idea is to enrich and diversify the table representations, to spotlight the cells with high recognition uncertainties. To evaluate the effectiveness, we proposed two heuristics to differentiate highly uncertain cells from normal cells, namely, masking and cell complexity quantification. Masking involves varying the pixel intensity to deem the detection uncertainty. Cell complexity quantification gauges the uncertainty of each cell by its topological relation with neighboring cells. The evaluation results based on standard benchmark datasets demonstrate that the proposed method is effective in quantifying uncertainty in TSR models. To our best knowledge, this study is the first of its kind to enable UQ in TSR tasks. Our code and data are available at: https://github.com/lamps-lab/UQTTA.git.

LGAug 17, 2024
Fairness-Aware Streaming Feature Selection with Causal Graphs

Leizhen Zhang, Lusi Li, Di Wu et al.

Its crux lies in the optimization of a tradeoff between accuracy and fairness of resultant models on the selected feature subset. The technical challenge of our setting is twofold: 1) streaming feature inputs, such that an informative feature may become obsolete or redundant for prediction if its information has been covered by other similar features that arrived prior to it, and 2) non-associational feature correlation, such that bias may be leaked from those seemingly admissible, non-protected features. To overcome this, we propose Streaming Feature Selection with Causal Fairness (SFCF) that builds two causal graphs egocentric to prediction label and protected feature, respectively, striving to model the complex correlation structure among streaming features, labels, and protected information. As such, bias can be eradicated from predictive modeling by removing those features being causally correlated with the protected feature yet independent to the labels. We theorize that the originally redundant features for prediction can later become admissible, when the learning accuracy is compromised by the large number of removed features (non-protected but can be used to reconstruct bias information). We benchmark SFCF\ on five datasets widely used in streaming feature research, and the results substantiate its performance superiority over six rival models in terms of efficiency and sparsity of feature selection and equalized odds of the resultant predictive models.

LGMar 17, 2025
BLIA: Detect model memorization in binary classification model through passive Label Inference attack

Mohammad Wahiduzzaman Khan, Sheng Chen, Ilya Mironov et al.

Model memorization has implications for both the generalization capacity of machine learning models and the privacy of their training data. This paper investigates label memorization in binary classification models through two novel passive label inference attacks (BLIA). These attacks operate passively, relying solely on the outputs of pre-trained models, such as confidence scores and log-loss values, without interacting with or modifying the training process. By intentionally flipping 50% of the labels in controlled subsets, termed "canaries," we evaluate the extent of label memorization under two conditions: models trained without label differential privacy (Label-DP) and those trained with randomized response-based Label-DP. Despite the application of varying degrees of Label-DP, the proposed attacks consistently achieve success rates exceeding 50%, surpassing the baseline of random guessing and conclusively demonstrating that models memorize training labels, even when these labels are deliberately uncorrelated with the features.