84.0CLApr 21Code
PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in PuzzlehuntsHengzhi Li, Justin Zhang, Brendon Jiang et al.
Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions and constrained environments, puzzlehunts requires discovering the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite progress in foundation models, their performance on open-ended settings remains largely untested. We introduce PuzzleWorld, a comprehensive benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-4% final answer accuracy. On PuzzleWorld, the best model solves only 18% of puzzles and reaches 40% stepwise accuracy, matching human puzzle novices but falling significantly behind puzzle enthusiasts. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces boosts stepwise accuracy from 4% to 11%, which translates to improvements in downstream visual reasoning tasks. Our detailed error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at https://github.com/MIT-MI/PuzzleWorld to support future work on building more general, open-ended, and creative reasoning systems.
LGAug 17, 2022
Prediction of Oral Food Challenge Outcomes via Ensemble LearningJustin Zhang, Deborah Lee, Kylie Jungles et al.
Oral Food Challenges (OFCs) are essential to accurately diagnosing food allergy due to the limitations of existing clinical testing. However, some patients are hesitant to undergo OFCs, while those willing suffer from limited access to allergists in rural/community healthcare settings. Despite its success in predicting patient outcomes in other clinical settings, few applications of machine learning to food allergy have been developed. Thus, in this study, we seek to leverage machine learning methodologies for OFC outcome prediction. Retrospective data was gathered from 1,112 patients who collectively underwent a total of 1,284 OFCs, and consisted of clinical factors including serum-specific Immunoglobulin E (IgE), total IgE, skin prick tests (SPTs), comorbidities, sex, and age. Using these features, multiple machine learning models were constructed to predict OFC outcomes for three common allergens: peanut, egg, and milk. The best performing model for each allergen was an ensemble of random forest (egg) or Learning Using Concave and Convex Kernels (LUCCK) (peanut, milk) models, which achieved an Area under the Curve (AUC) of 0.91, 0.96, and 0.94, in predicting OFC outcomes for peanut, egg, and milk, respectively. Moreover, all such models had sensitivity and specificity values 89%. Model interpretation via SHapley Additive exPlanations (SHAP) indicates that specific IgE, along with wheal and flare values from SPTs, are highly predictive of OFC outcomes. The results of this analysis suggest that ensemble learning has the potential to predict OFC outcomes and reveal relevant clinical factors for further study.
LGDec 9, 2021
A Novel Tropical Geometry-based Interpretable Machine Learning Method: Application in Prognosis of Advanced Heart FailureHeming Yao, Harm Derksen, Jessica R. Golbus et al.
A model's interpretability is essential to many practical applications such as clinical decision support systems. In this paper, a novel interpretable machine learning method is presented, which can model the relationship between input variables and responses in humanly understandable rules. The method is built by applying tropical geometry to fuzzy inference systems, wherein variable encoding functions and salient rules can be discovered by supervised learning. Experiments using synthetic datasets were conducted to investigate the performance and capacity of the proposed algorithm in classification and rule discovery. Furthermore, the proposed method was applied to a clinical application that identified heart failure patients that would benefit from advanced therapies such as heart transplant or durable mechanical circulatory support. Experimental results show that the proposed network achieved great performance on the classification tasks. In addition to learning humanly understandable rules from the dataset, existing fuzzy domain knowledge can be easily transferred into the network and used to facilitate model training. From our results, the proposed model and the ability of learning existing domain knowledge can significantly improve the model generalizability. The characteristics of the proposed network make it promising in applications requiring model reliability and justification.
LGMar 19, 2021
A Physics-Informed Neural Network Framework For Partial Differential Equations on 3D Surfaces: Time-Dependent ProblemsZhiwei Fang, Justin Zhang, Xiu Yang
In this paper, we show a physics-informed neural network solver for the time-dependent surface PDEs. Unlike the traditional numerical solver, no extension of PDE and mesh on the surface is needed. We show a simplified prior estimate of the surface differential operators so that PINN's loss value will be an indicator of the residue of the surface PDEs. Numerical experiments verify efficacy of our algorithm.
LGJul 9, 2020
Transparency Tools for Fairness in AI (Luskin)Mingliang Chen, Aria Shahverdi, Sarah Anderson et al.
We propose new tools for policy-makers to use when assessing and correcting fairness and bias in AI algorithms. The three tools are: - A new definition of fairness called "controlled fairness" with respect to choices of protected features and filters. The definition provides a simple test of fairness of an algorithm with respect to a dataset. This notion of fairness is suitable in cases where fairness is prioritized over accuracy, such as in cases where there is no "ground truth" data, only data labeled with past decisions (which may have been biased). - Algorithms for retraining a given classifier to achieve "controlled fairness" with respect to a choice of features and filters. Two algorithms are presented, implemented and tested. These algorithms require training two different models in two stages. We experiment with combinations of various types of models for the first and second stage and report on which combinations perform best in terms of fairness and accuracy. - Algorithms for adjusting model parameters to achieve a notion of fairness called "classification parity". This notion of fairness is suitable in cases where accuracy is prioritized. Two algorithms are presented, one which assumes that protected features are accessible to the model during testing, and one which assumes protected features are not accessible during testing. We evaluate our tools on three different publicly available datasets. We find that the tools are useful for understanding various dimensions of bias, and that in practice the algorithms are effective in starkly reducing a given observed bias when tested on new data.
CVApr 9, 2019
Contextual Attention for Hand Detection in the WildSupreeth Narasimhaswamy, Zhengwei Wei, Yang Wang et al.
We present Hand-CNN, a novel convolutional network architecture for detecting hand masks and predicting hand orientations in unconstrained images. Hand-CNN extends MaskRCNN with a novel attention mechanism to incorporate contextual cues in the detection process. This attention mechanism can be implemented as an efficient network module that captures non-local dependencies between features. This network module can be inserted at different stages of an object detection network, and the entire detector can be trained end-to-end. We also introduce a large-scale annotated hand dataset containing hands in unconstrained images for training and evaluation. We show that Hand-CNN outperforms existing methods on several datasets, including our hand detection benchmark and the publicly available PASCAL VOC human layout challenge. We also conduct ablation studies on hand detection to show the effectiveness of the proposed contextual attention module.