LGAug 21, 2025
Low-dimensional embeddings of high-dimensional dataCyril de Bodt, Alex Diaz-Papkovich, Michael Bleher et al.
Large collections of high-dimensional data have become nearly ubiquitous across many academic fields and application domains, ranging from biology to the humanities. Since working directly with high-dimensional data poses challenges, the demand for algorithms that create low-dimensional representations, or embeddings, for data visualization, exploration, and analysis is now greater than ever. In recent years, numerous embedding algorithms have been developed, and their usage has become widespread in research and industry. This surge of interest has resulted in a large and fragmented research field that faces technical challenges alongside fundamental debates, and it has left practitioners without clear guidance on how to effectively employ existing methods. Aiming to increase coherence and facilitate future work, in this review we provide a detailed and critical overview of recent developments, derive a list of best practices for creating and using low-dimensional embeddings, evaluate popular approaches on a variety of datasets, and discuss the remaining challenges and open problems in the field.
CLJun 11, 2024
Delving into LLM-assisted writing in biomedical publications through excess vocabularyDmitry Kobak, Rita González-Márquez, Emőke-Ágnes Horvát et al.
Large language models (LLMs) like ChatGPT can generate and revise text with human-level performance. These models come with clear limitations: they can produce inaccurate information, reinforce existing biases, and be easily misused. Yet, many scientists use them for their scholarly writing. But how wide-spread is such LLM usage in the academic literature? To answer this question for the field of biomedical research, we present an unbiased, large-scale approach: we study vocabulary changes in over 15 million biomedical abstracts from 2010--2024 indexed by PubMed, and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words. This excess word analysis suggests that at least 13.5% of 2024 abstracts were processed with LLMs. This lower bound differed across disciplines, countries, and journals, reaching 40% for some subcorpora. We show that LLMs have had an unprecedented impact on scientific writing in biomedical research, surpassing the effect of major world events such as the Covid pandemic.
HCJan 15, 2021
A Multi-Platform Study of Crowd Signals Associated with Successful Online FundraisingHenry K. Dambanemuya, Emőke-Ágnes Horvát
The growing popularity of online fundraising (aka "crowdfunding") has attracted significant research on the subject. In contrast to previous studies that attempt to predict the success of crowdfunded projects based on specific characteristics of the projects and their creators, we present a more general approach that focuses on crowd dynamics and is robust to the particularities of different crowdfunding platforms. We rely on a multi-method analysis to investigate the correlates, predictive importance, and quasi-causal effects of features that describe crowd dynamics in determining the success of crowdfunded projects. By applying a multi-method analysis to a study of fundraising in three different online markets, we uncover general crowd dynamics that ultimately decide which projects will succeed. In all analyses and across the three different platforms, we consistently find that funders' behavioural signals (1) are significantly correlated with fundraising success; (2) approximate fundraising outcomes better than the characteristics of projects and their creators such as credit grade, company valuation, and subject domain; and (3) have significant quasi-causal effects on fundraising outcomes while controlling for potentially confounding project variables. By showing that universal features deduced from crowd behaviour are predictive of fundraising success on different crowdfunding platforms, our work provides design-relevant insights about novel types of collective decision-making online. This research inspires thus potential ways to leverage cues from the crowd and catalyses research into crowd-aware system design.