CLFeb 6
Inference-Time Rethinking with Latent Thought Vectors for Math ReasoningDeqian Kong, Minglu Zhao, Aoyang Qin et al.
Standard chain-of-thought reasoning generates a solution in a single forward pass, committing irrevocably to each token and lacking a mechanism to recover from early errors. We introduce Inference-Time Rethinking, a generative framework that enables iterative self-correction by decoupling declarative latent thought vectors from procedural generation. We factorize reasoning into a continuous latent thought vector (what to reason about) and a decoder that verbalizes the trace conditioned on this vector (how to reason). Beyond serving as a declarative buffer, latent thought vectors compress the reasoning structure into a continuous representation that abstracts away surface-level token variability, making gradient-based optimization over reasoning strategies well-posed. Our prior model maps unstructured noise to a learned manifold of valid reasoning patterns, and at test time we employ a Gibbs-style procedure that alternates between generating a candidate trace and optimizing the latent vector to better explain that trace, effectively navigating the latent manifold to refine the reasoning strategy. Training a 0.2B-parameter model from scratch on GSM8K, our method with 30 rethinking iterations surpasses baselines with 10 to 15 times more parameters, including a 3B counterpart. This result demonstrates that effective mathematical reasoning can emerge from sophisticated inference-time computation rather than solely from massive parameter counts.
94.0HCMar 12
Human-Centred LLM Privacy Audits: Findings and FrictionsDimitri Staufer, Kirsten Morehouse, David Hartmann et al.
Large language models (LLMs) learn statistical associations from massive training corpora and user interactions, and deployed systems can surface or infer information about individuals. Yet people lack practical ways to inspect what a model associates with their name. We report interim findings from an ongoing study and introduce LMP2, a browser-based self-audit tool. In two user studies ($N_{total}{=}458$), GPT-4o predicts 11 of 50 features for everyday people with $\ge$60\% accuracy, and participants report wanting control over LLM-generated associations despite not considering all outputs privacy violations. To validate our probing method, we evaluate eight LLMs on public figures and non-existent names, observing clear separation between stable name-conditioned associations and model defaults. Our findings also contribute to exposing a broader generative AI evaluation crisis: when outputs are probabilistic, context-dependent, and user-mediated through elicitation, what model--individual associations even include is under-specified and operationalisation relies on crafting probes and metrics that are hard to validate or compare. To move towards reliable, actionable human-centred LLM privacy audits, we identify nine frictions that emerged in our study and offer recommendations for future work and the design of human-centred LLM privacy audits.
CYFeb 26, 2024
Addressing the regulatory gap: moving towards an EU AI audit ecosystem beyond the AI Act by including civil societyDavid Hartmann, José Renato Laranjeira de Pereira, Chiara Streitbörger et al.
The European legislature has proposed the Digital Services Act (DSA) and Artificial Intelligence Act (AIA) to regulate platforms and Artificial Intelligence (AI) products. We review to what extent third-party audits are part of both laws and how is access to information on models and the data provided. By considering the value of third-party audits and third-party data access in an audit ecosystem, we identify a regulatory gap in that the AIA does not provide access to data for researchers and civil society. Our contributions to the literature include: (1) Defining an AI audit ecosystem incorporating compliance and oversight. (2) Highlighting a regulatory gap within the DSA and AIA regulatory framework, preventing the establishment of an AI audit ecosystem that has effective oversight by civil society and academia. (3) Emphasizing that third-party audits by research and civil society must be part of that ecosystem, we call for AIA amendments and delegated acts to include data and model access for certain AI products. Furthermore, we call for the DSA to provide NGOs and investigative journalists with data access to platforms by delegated acts and for adaptions and amendments of the AIA to provide third-party audits and data and model access, at least for high-risk systems. Regulations modeled after EU AI regulations should enable data access and third-party audits, fostering an AI audit ecosystem that promotes compliance and oversight mechanisms.
48.5CLApr 28
Bye Bye Perspective API: Lessons for Measurement Infrastructure in NLP, CSS and LLM EvaluationDavid Hartmann, Manuel Tonneau, Angelie Kraft et al.
The closure of Perspective API at the end of 2026 discards what has functioned as the de facto standard for automated toxicity measurement in NLP, CSS, and LLM evaluation research. We document the structural dependence that the communities built on this single proprietary tool and discuss how this dependence caused epistemic problems that have affected - and will likely continue to affect - collective research efforts. Perspective's model was periodically updated without versioning or disclosure, its annotation structure reflected a single corporate operationalisation of a contested concept, and its scores were used simultaneously as an evaluation target and an evaluation standard. Its closure leaves behind non-updatable benchmarks, irreproducible results, and ultimately a field at risk of perpetuating these issues by turning to closed-source LLMs. We use Perspective's announced termination as an opportunity to call for an independent, valid, adaptable, and reproducible toxicity and hate speech measurement infrastructure, with the technical and governance requirements outlined in this paper.
HCMar 3, 2025
Lost in Moderation: How Commercial Content Moderation APIs Over- and Under-Moderate Group-Targeted Hate Speech and Linguistic VariationsDavid Hartmann, Amin Oueslati, Dimitri Staufer et al.
Commercial content moderation APIs are marketed as scalable solutions to combat online hate speech. However, the reliance on these APIs risks both silencing legitimate speech, called over-moderation, and failing to protect online platforms from harmful speech, known as under-moderation. To assess such risks, this paper introduces a framework for auditing black-box NLP systems. Using the framework, we systematically evaluate five widely used commercial content moderation APIs. Analyzing five million queries based on four datasets, we find that APIs frequently rely on group identity terms, such as ``black'', to predict hate speech. While OpenAI's and Amazon's services perform slightly better, all providers under-moderate implicit hate speech, which uses codified messages, especially against LGBTQIA+ individuals. Simultaneously, they over-moderate counter-speech, reclaimed slurs and content related to Black, LGBTQIA+, Jewish, and Muslim people. We recommend that API providers offer better guidance on API implementation and threshold setting and more transparency on their APIs' limitations. Warning: This paper contains offensive and hateful terms and concepts. We have chosen to reproduce these terms for reasons of transparency.
CLMay 8, 2025
Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of PerspectiveDaniel Franzen, Jan Disselhoff, David Hartmann
The Abstraction and Reasoning Corpus (ARC-AGI) poses a significant challenge for large language models (LLMs), exposing limitations in their abstract reasoning abilities. In this work, we leverage task-specific data augmentations throughout the training, generation, and scoring phases, and employ a depth-first search algorithm to generate diverse, high-probability candidate solutions. Furthermore, we utilize the LLM not only as a generator but also as a scorer, using its output probabilities to select the most promising solutions. Our method achieves a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set, demonstrating state-of-the-art performance among publicly available approaches. While concurrent closed-source work has reported higher scores, our method distinguishes itself through its transparency, reproducibility, and remarkably low inference cost, averaging only around 2ct per task on readily available hardware (we assume a price of 36ct/hour for a Nvidia 4090 GPU).
CLApr 17, 2025
Out of Sight Out of Mind, Out of Sight Out of Mind: Measuring Bias in Language Models Against Overlooked Marginalized Groups in Regional ContextsFatma Elsafoury, David Hartmann
We know that language models (LMs) form biases and stereotypes of minorities, leading to unfair treatments of members of these groups, thanks to research mainly in the US and the broader English-speaking world. As the negative behavior of these models has severe consequences for society and individuals, industry and academia are actively developing methods to reduce the bias in LMs. However, there are many under-represented groups and languages that have been overlooked so far. This includes marginalized groups that are specific to individual countries and regions in the English speaking and Western world, but crucially also almost all marginalized groups in the rest of the world. The UN estimates, that between 600 million to 1.2 billion people worldwide are members of marginalized groups and in need for special protection. If we want to develop inclusive LMs that work for everyone, we have to broaden our understanding to include overlooked marginalized groups and low-resource languages and dialects. In this work, we contribute to this effort with the first study investigating offensive stereotyping bias in 23 LMs for 270 marginalized groups from Egypt, the remaining 21 Arab countries, Germany, the UK, and the US. Additionally, we investigate the impact of low-resource languages and dialects on the study of bias in LMs, demonstrating the limitations of current bias metrics, as we measure significantly higher bias when using the Egyptian Arabic dialect versus Modern Standard Arabic. Our results show, LMs indeed show higher bias against many marginalized groups in comparison to dominant groups. However, this is not the case for Arabic LMs, where the bias is high against both marginalized and dominant groups in relation to religion and ethnicity. Our results also show higher intersectional bias against Non-binary, LGBTQIA+ and Black women.
CYJun 20, 2024
Watching the Watchers: A Comparative Fairness Audit of Cloud-based Content Moderation ServicesDavid Hartmann, Amin Oueslati, Dimitri Staufer
Online platforms face the challenge of moderating an ever-increasing volume of content, including harmful hate speech. In the absence of clear legal definitions and a lack of transparency regarding the role of algorithms in shaping decisions on content moderation, there is a critical need for external accountability. Our study contributes to filling this gap by systematically evaluating four leading cloud-based content moderation services through a third-party audit, highlighting issues such as biases against minorities and vulnerable groups that may arise through over-reliance on these services. Using a black-box audit approach and four benchmark data sets, we measure performance in explicit and implicit hate speech detection as well as counterfactual fairness through perturbation sensitivity analysis and present disparities in performance for certain target identity groups and data sets. Our analysis reveals that all services had difficulties detecting implicit hate speech, which relies on more subtle and codified messages. Moreover, our results point to the need to remove group-specific bias. It seems that biases towards some groups, such as Women, have been mostly rectified, while biases towards other groups, such as LGBTQ+ and PoC remain.
IVJun 30, 2021
Fast whole-slide cartography in colon cancer histology using superpixels and CNN classificationFrauke Wilm, Michaela Benz, Volker Bruns et al.
Automatic outlining of different tissue types in digitized histological specimen provides a basis for follow-up analyses and can potentially guide subsequent medical decisions. The immense size of whole-slide-images (WSI), however, poses a challenge in terms of computation time. In this regard, the analysis of non-overlapping patches outperforms pixelwise segmentation approaches, but still leaves room for optimization. Furthermore, the division into patches, regardless of the biological structures they contain, is a drawback due to the loss of local dependencies. We propose to subdivide the WSI into coherent regions prior to classification by grouping visually similar adjacent pixels into superpixels. Afterwards, only a random subset of patches per superpixel is classified and patch labels are combined into a superpixel label. We propose a metric for identifying superpixels with an uncertain classification and evaluate two medical applications, namely tumor area and invasive margin estimation and tumor composition analysis. The algorithm has been developed on 159 hand-annotated WSIs of colon resections and its performance is compared to an analysis without prior segmentation. The algorithm shows an average speed-up of 41% and an increase in accuracy from 93.8% to 95.7%. By assigning a rejection label to uncertain superpixels, we further increase the accuracy by 0.4%. Whilst tumor area estimation shows high concordance to the annotated area, the analysis of tumor composition highlights limitations of our approach. By combining superpixel segmentation and patch classification, we designed a fast and accurate framework for whole-slide cartography that is AI-model agnostic and provides the basis for various medical endpoints.
LGApr 3, 2019
Progressive Stochastic Binarization of Deep NetworksDavid Hartmann, Michael Wand
A plethora of recent research has focused on improving the memory footprint and inference speed of deep networks by reducing the complexity of (i) numerical representations (for example, by deterministic or stochastic quantization) and (ii) arithmetic operations (for example, by binarization of weights). We propose a stochastic binarization scheme for deep networks that allows for efficient inference on hardware by restricting itself to additions of small integers and fixed shifts. Unlike previous approaches, the underlying randomized approximation is progressive, thus permitting an adaptive control of the accuracy of each operation at run-time. In a low-precision setting, we match the accuracy of previous binarized approaches. Our representation is unbiased - it approaches continuous computation with increasing sample size. In a high-precision regime, the computational costs are competitive with previous quantization schemes. Progressive stochastic binarization also permits localized, dynamic accuracy control within a single network, thereby providing a new tool for adaptively focusing computational attention. We evaluate our method on networks of various architectures, already pretrained on ImageNet. With representational costs comparable to previous schemes, we obtain accuracies close to the original floating point implementation. This includes pruned networks, except the known special case of certain types of separated convolutions. By focusing computational attention using progressive sampling, we reduce inference costs on ImageNet further by a factor of up to 33% (before network pruning).