Christopher M. White

HC
5papers
55citations
Novelty41%
AI Score21

5 Papers

LGAug 31, 2021
When are Deep Networks really better than Decision Forests at small sample sizes, and how?

Haoyin Xu, Kaleab A. Kinfu, Will LeVine et al.

Deep networks and decision forests (such as random forests and gradient boosted trees) are the leading machine learning methods for structured and tabular data, respectively. Many papers have empirically compared large numbers of classifiers on one or two different domains (e.g., on 100 different tabular data settings). However, a careful conceptual and empirical comparison of these two strategies using the most contemporary best practices has yet to be performed. Conceptually, we illustrate that both can be profitably viewed as "partition and vote" schemes. Specifically, the representation space that they both learn is a partitioning of feature space into a union of convex polytopes. For inference, each decides on the basis of votes from the activated nodes. This formulation allows for a unified basic understanding of the relationship between these methods. Empirically, we compare these two strategies on hundreds of tabular data settings, as well as several vision and auditory settings. Our focus is on datasets with at most 10,000 samples, which represent a large fraction of scientific and biomedical datasets. In general, we found forests to excel at tabular and structured data (vision and audition) with small sample sizes, whereas deep nets performed better on structured data with larger sample sizes. This suggests that further gains in both scenarios may be realized via further combining aspects of forests and networks. We will continue revising this technical report in the coming months with updated results.

LGJun 23, 2021
Leveraging semantically similar queries for ranking via combining representations

Hayden S. Helm, Marah Abdin, Benjamin D. Pedigo et al.

In modern ranking problems, different and disparate representations of the items to be ranked are often available. It is sensible, then, to try to combine these representations to improve ranking. Indeed, learning to rank via combining representations is both principled and practical for learning a ranking function for a particular query. In extremely data-scarce settings, however, the amount of labeled data available for a particular query can lead to a highly variable and ineffective ranking function. One way to mitigate the effect of the small amount of data is to leverage information from semantically similar queries. Indeed, as we demonstrate in simulation settings and real data examples, when semantically similar queries are available it is possible to gainfully use them when ranking with respect to a particular query. We describe and explore this phenomenon in the context of the bias-variance trade off and apply it to the data-scarce settings of a Bing navigational graph and the Drosophila larva connectome.

HCMay 12, 2020
Design of a Privacy-Preserving Data Platform for Collaboration Against Human Trafficking

Darren Edge, Weiwei Yang, Kate Lytvynets et al.

Case records on victims of human trafficking are highly sensitive, yet the ability to share such data is critical to evidence-based practice and policy development across government, business, and civil society. We present new methods to anonymize, publish, and explore such data, implemented as a pipeline generating three artifacts: (1) synthetic data mitigating the privacy risk that published attribute combinations might be linked to known individuals or groups; (2) aggregate data mitigating the utility risk that synthetic data might misrepresent statistics needed for official reporting; and (3) visual analytics interfaces to both datasets mitigating the accessibility risk that privacy mechanisms or analysis tools might not be understandable and usable by all stakeholders. We present our work as a design study motivated by the goal of transforming how the world's largest database of identified victims is made available for global collaboration against human trafficking.

HCMay 1, 2020
Workgroup Mapping: Visual Analysis of Collaboration Culture

Darren Edge, Jonathan Larson, Nikolay Trandev et al.

The digital transformation of work presents new opportunities to understand how informal workgroups organize around the dynamic needs of organizations, potentially in contrast to the formal, static, and idealized hierarchies depicted by org charts. We present a design study that spans multiple enabling capabilities for the visual mapping and analysis of organizational workgroups, including metrics for quantifying two dimensions of collaboration culture: the fluidity of collaborative relationships (measured using network machine learning) and the freedom with which workgroups form across organizational boundaries. These capabilities come together to create a turnkey pipeline that combines the analysis of a target organization, the generation of data graphics and statistics, and their integration in a template-based presentation that enables narrative visualization of results. Our metrics and visuals have supported hundreds of presentations to executives of major US-based and multinational organizations, while our engineering practices have created an ensemble of standalone tools with broad relevance to visualization and visual analytics. We present our work as an example of applied visual analytics research, describing the design iterations that allowed us to move from experimentation to production, as well as the perspectives of the research team and the customer-facing team at each stage in this process.

AIApr 27, 2020
Simple Lifelong Learning Machines

Jayanta Dey, Joshua T. Vogelstein, Hayden S. Helm et al.

In lifelong learning, data are used to improve performance not only on the present task, but also on past and future (unencountered) tasks. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain performance on old tasks given new tasks. But striving to avoid forgetting sets the goal unnecessarily low. The goal of lifelong learning should be to use data to improve performance on both future tasks (forward transfer) and past tasks (backward transfer). In this paper, we show that a simple approach -- representation ensembling -- demonstrates both forward and backward transfer in a variety of simulated and benchmark data scenarios, including tabular, vision (CIFAR-100, 5-dataset, Split Mini-Imagenet, and Food1k), and speech (spoken digit), in contrast to various reference algorithms, which typically failed to transfer either forward or backward, or both. Moreover, our proposed approach can flexibly operate with or without a computational budget.