CLJun 17, 2022
Automatic Correction of Human TranslationsJessy Lin, Geza Kovacs, Aditya Shastry et al.
We introduce translation error correction (TEC), the task of automatically correcting human-generated translations. Imperfections in machine translations (MT) have long motivated systems for improving translations post-hoc with automatic post-editing. In contrast, little attention has been devoted to the problem of automatically correcting human translations, despite the intuition that humans make distinct errors that machines would be well-suited to assist with, from typos to inconsistencies in translation conventions. To investigate this, we build and release the Aced corpus with three TEC datasets. We show that human errors in TEC exhibit a more diverse range of errors and far fewer translation fluency errors than the MT errors in automatic post-editing datasets, suggesting the need for dedicated TEC models that are specialized to correct human errors. We show that pre-training instead on synthetic errors based on human errors improves TEC F-score by as much as 5.1 points. We conducted a human-in-the-loop user study with nine professional translation editors and found that the assistance of our TEC system led them to produce significantly higher quality revised translations.
CLJul 7, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic CapabilitiesGheorghe Comanici, Eric Bieber, Mike Schaekermann et al. · amazon-science, baidu
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
CLFeb 17, 2025Code
SMOL: Professionally translated parallel data for 115 under-represented languagesIsaac Caswell, Elizabeth Nielsen, Jiaming Luo et al. · mit
We open-source SMOL (Set of Maximal Overall Leverage), a suite of training data to unlock machine translation for low-resource languages. SMOL has been translated into 124 (and growing) under-resourced languages (125 language pairs), including many for which there exist no previous public resources, for a total of 6.1M translated tokens. SMOL comprises two sub-datasets, each carefully chosen for maximum impact given its size: SMOLSENT, a set of sentences chosen for broad unique token coverage, and SMOLDOC, a document-level resource focusing on a broad topic coverage. They join the already released GATITOS for a trifecta of paragraph, sentence, and token-level content. We demonstrate that using SMOL to prompt or fine-tune Large Language Models yields robust chrF improvements. In addition to translation, we provide factuality ratings and rationales for all documents in SMOLDOC, yielding the first factuality datasets for most of these languages.
CLFeb 18, 2025
WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & DialectsDaniel Deutsch, Eleftheria Briakou, Isaac Caswell et al.
As large language models (LLM) become more and more capable in languages other than English, it is important to collect benchmark datasets in order to evaluate their multilingual performance, including on tasks like machine translation (MT). In this work, we extend the WMT24 dataset to cover 55 languages by collecting new human-written references and post-edits for 46 new languages and dialects in addition to post-edits of the references in 8 out of 9 languages in the original WMT24 dataset. The dataset covers four domains: literary, news, social, and speech. We benchmark a variety of MT providers and LLMs on the collected dataset using automatic metrics and find that LLMs are the best-performing MT systems in all 55 languages. These results should be confirmed using a human-based evaluation, which we leave for future work.
CLNov 5, 2024
Mitigating Metric Bias in Minimum Bayes Risk DecodingGeza Kovacs, Daniel Deutsch, Markus Freitag
While Minimum Bayes Risk (MBR) decoding using metrics such as COMET or MetricX has outperformed traditional decoding methods such as greedy or beam search, it introduces a challenge we refer to as metric bias. As MBR decoding aims to produce translations that score highly according to a specific utility metric, this very process makes it impossible to use the same metric for both decoding and evaluation, as improvements might simply be due to reward hacking rather than reflecting real quality improvements. In this work we find that compared to human ratings, neural metrics not only overestimate the quality of MBR decoding when the same metric is used as the utility metric, but they also overestimate the quality of MBR/QE decoding with other neural utility metrics as well. We also show that the metric bias issue can be mitigated by using an ensemble of utility metrics during MBR decoding: human evaluations show that MBR decoding using an ensemble of utility metrics outperforms a single utility metric.
CLNov 23, 2024
From Jack of All Trades to Master of One: Specializing LLM-based Autoraters to a Test SetMara Finkelstein, Dan Deutsch, Parker Riley et al.
As LLMs continue to become more powerful and versatile, human evaluation has quickly become intractable at scale and reliance on automatic metrics has become the norm. Recently, it has been shown that LLMs are themselves state-of-the-art evaluators for many tasks. These Autoraters are typically designed so that they generalize to new systems and test sets. In practice, however, evaluation is performed on a small set of fixed, canonical test sets, which are carefully curated to measure certain capabilities of interest and are not changed frequently. In this work, we design a method which specializes a prompted Autorater to a given test set, by leveraging historical ratings on the test set to construct in-context learning (ICL) examples. We evaluate our Specialist method on the task of fine-grained machine translation evaluation, and show that it dramatically outperforms the state-of-the-art XCOMET metric by 54% and 119% on the WMT'23 and WMT'24 test sets, respectively. We perform extensive analyses to understand the representations learned by our Specialist metrics, and how variability in rater behavior affects their performance. We also verify the generalizability and robustness of our Specialist method for designing automatic metrics across different numbers of ICL examples, LLM backbones, systems to evaluate, and evaluation tasks.
CLOct 28, 2025
MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared TaskJuraj Juraska, Tobias Domhan, Mara Finkelstein et al.
In this paper, we present our submissions to the unified WMT25 Translation Evaluation Shared Task. For the Quality Score Prediction subtask, we create a new generation of MetricX with improvements in the input format and the training protocol, while for the Error Span Detection subtask we develop a new model, GemSpanEval, trained to predict error spans along with their severities and categories. Both systems are based on the state-of-the-art multilingual open-weights model Gemma 3, fine-tuned on publicly available WMT data. We demonstrate that MetricX-25, adapting Gemma 3 to an encoder-only architecture with a regression head on top, can be trained to effectively predict both MQM and ESA quality scores, and significantly outperforms its predecessor. Our decoder-only GemSpanEval model, on the other hand, we show to be competitive in error span detection with xCOMET, a strong encoder-only sequence-tagging baseline. With error span detection formulated as a generative task, we instruct the model to also output the context for each predicted error span, thus ensuring that error spans are identified unambiguously.
AIJun 10, 2024
Transforming Wearable Data into Personal Health Insights using Large Language Model AgentsMike A. Merrill, Akshay Paruchuri, Naghmeh Rezaei et al.
Deriving personalized insights from popular wearable trackers requires complex numerical reasoning that challenges standard LLMs, necessitating tool-based approaches like code generation. Large language model (LLM) agents present a promising yet largely untapped solution for this analysis at scale. We introduce the Personal Health Insights Agent (PHIA), a system leveraging multistep reasoning with code generation and information retrieval to analyze and interpret behavioral health data. To test its capabilities, we create and share two benchmark datasets with over 4000 health insights questions. A 650-hour human expert evaluation shows that PHIA significantly outperforms a strong code generation baseline, achieving 84% accuracy on objective, numerical questions and, for open-ended ones, earning 83% favorable ratings while being twice as likely to achieve the highest quality rating. This work can advance behavioral health by empowering individuals to understand their data, enabling a new era of accessible, personalized, and data-driven wellness for the wider population.
CLMay 24, 2023
Large Language Models are Few-Shot Health LearnersXin Liu, Daniel McDuff, Geza Kovacs et al.
Large language models (LLMs) can capture rich representations of concepts that are useful for real-world tasks. However, language alone is limited. While existing LLMs excel at text-based inferences, health applications require that models be grounded in numerical data (e.g., vital signs, laboratory values in clinical domains; steps, movement in the wellness domain) that is not easily or readily expressed as text in existing training corpus. We demonstrate that with only few-shot tuning, a large language model is capable of grounding various physiological and behavioral time-series data and making meaningful inferences on numerous health tasks for both clinical and wellness contexts. Using data from wearable and medical sensor recordings, we evaluate these capabilities on the tasks of cardiac signal analysis, physical activity recognition, metabolic calculation (e.g., calories burned), and estimation of stress reports and mental health screeners.
HCFeb 7, 2021
Reconstructing Detailed Browsing Activities from Browser HistoryGeza Kovacs
Users' detailed browsing activity - such as what sites they are spending time on and for how long, and what tabs they have open and which one is focused at any given time - is useful for a number of research and practical applications. Gathering such data, however, requires that users install and use a monitoring tool over long periods of time. In contrast, browser extensions can gain instantaneous access months of browser history data. However, the browser history is incomplete: it records only navigation events, missing important information such as time spent or tab focused. In this work, we aim to reconstruct time spent on sites with only users' browsing histories. We gathered three months of browsing history and two weeks of ground-truth detailed browsing activity from 185 participants. We developed a machine learning algorithm that predicts whether the browser window is focused and active at one second-level granularity with an F1-score of 0.84. During periods when the browser is active, the algorithm can predict which the domain the user was looking at with 76.2% accuracy. We can use these results to reconstruct the total time spent online for each user with an R^2 value of 0.96, and the total time each user spent on each domain with an R^2 value of 0.92.
HCFeb 3, 2021
Edvertisements: Adding Microlearning to Social News Feeds and WebsitesGeza Kovacs
Many long-term goals, such as learning a language, require people to regularly practice every day to achieve mastery. At the same time, people regularly surf the web and read social news feeds in their spare time. We have built a browser extension that teaches vocabulary to users in the context of Facebook feeds and arbitrary websites, by showing users interactive quizzes they can answer without leaving the website. On Facebook, the quizzes show up as part of the news feed, while on other sites, the quizzes appear where advertisements normally would. In our user study, we examined the effectiveness of inserting microlearning tasks into social news feeds. We compared vocabulary learning rates when we inserted interactive quizzes into feeds, versus inserting links that lead them to a website where they could do the quizzes. Our results suggest that users engage with and learn from our embedded quizzes, and engagement increases when the quizzes can be done directly within their feeds.
HCFeb 3, 2021
QuizCram: A Quiz-Driven Lecture Viewing InterfaceGeza Kovacs, Darren Edge
QuizCram is an interface for navigating lecture videos that uses quizzes to help users determine what they should view. We developed it in response to observing peaks in video seeking behaviors centered around Coursera's in-video quizzes. QuizCram shows users a question to answer, with an associated video segment. Users can use these questions to navigate through video segments, and find video segments they need to review. We also allow users to review using a timeline of previously answered questions and videos. To encourage users to review the material, QuizCram keeps track of their question-answering and video-watching history and schedules sections they likely have not mastered for review. QuizCram-format materials can be generated from existing lectures with in-video quizzes. Our user study comparing QuizCram to in-video quizzes found that users practice answering and reviewing questions more when using QuizCram, and are better able to remember answers to questions they encountered.
HCJan 27, 2021
Not Now, Ask Later: Users Weaken Their Behavior Change Regimen Over Time, But Expect To Re-Strengthen It ImminentlyGeza Kovacs, Zhengxuan Wu, Michael S. Bernstein
How effectively do we adhere to nudges and interventions that help us control our online browsing habits? If we have a temporary lapse and disable the behavior change system, do we later resume our adherence, or has the dam broken? In this paper, we investigate these questions through log analyses of 8,000+ users on HabitLab, a behavior change platform that helps users reduce their time online. We find that, while users typically begin with high-challenge interventions, over time they allow themselves to slip into easier and easier interventions. Despite this, many still expect to return to the harder interventions imminently: they repeatedly choose to be asked to change difficulty again on the next visit, declining to have the system save their preference for easy interventions.
CLNov 11, 2020
The Impact of Text Presentation on Translator PerformanceSamuel Läubli, Patrick Simianer, Joern Wuebker et al.
Widely used computer-aided translation (CAT) tools divide documents into segments such as sentences and arrange them in a side-by-side, spreadsheet-like view. We present the first controlled evaluation of these design choices on translator performance, measuring speed and accuracy in three experimental text processing tasks. We find significant evidence that sentence-by-sentence presentation enables faster text reproduction and within-sentence error identification compared to unsegmented text, and that a top-and-bottom arrangement of source and target sentences enables faster text reproduction compared to a side-by-side arrangement. For revision, on the other hand, our results suggest that presenting unsegmented text results in the highest accuracy and time efficiency. Our findings have direct implications for best practices in designing CAT tools.