Geza Kovacs

h-index9

5papers

17citations

Novelty48%

AI Score29

Ranked #146,370 of 194,257 authors (top 75%)#1,311 in HC (top 52%)

5 Papers

16.3CLFeb 17, 2025Code

SMOL: Professionally translated parallel data for 115 under-represented languages

Isaac Caswell, Elizabeth Nielsen, Jiaming Luo et al. · mit

We open-source SMOL (Set of Maximal Overall Leverage), a suite of training data to unlock machine translation for low-resource languages. SMOL has been translated into 124 (and growing) under-resourced languages (125 language pairs), including many for which there exist no previous public resources, for a total of 6.1M translated tokens. SMOL comprises two sub-datasets, each carefully chosen for maximum impact given its size: SMOLSENT, a set of sentences chosen for broad unique token coverage, and SMOLDOC, a document-level resource focusing on a broad topic coverage. They join the already released GATITOS for a trifecta of paragraph, sentence, and token-level content. We demonstrate that using SMOL to prompt or fine-tune Large Language Models yields robust chrF improvements. In addition to translation, we provide factuality ratings and rationales for all documents in SMOLDOC, yielding the first factuality datasets for most of these languages.

3.7HCFeb 7, 2021

Reconstructing Detailed Browsing Activities from Browser History

Geza Kovacs

Users' detailed browsing activity - such as what sites they are spending time on and for how long, and what tabs they have open and which one is focused at any given time - is useful for a number of research and practical applications. Gathering such data, however, requires that users install and use a monitoring tool over long periods of time. In contrast, browser extensions can gain instantaneous access months of browser history data. However, the browser history is incomplete: it records only navigation events, missing important information such as time spent or tab focused. In this work, we aim to reconstruct time spent on sites with only users' browsing histories. We gathered three months of browsing history and two weeks of ground-truth detailed browsing activity from 185 participants. We developed a machine learning algorithm that predicts whether the browser window is focused and active at one second-level granularity with an F1-score of 0.84. During periods when the browser is active, the algorithm can predict which the domain the user was looking at with 76.2% accuracy. We can use these results to reconstruct the total time spent online for each user with an R^2 value of 0.96, and the total time each user spent on each domain with an R^2 value of 0.92.

3.7HCFeb 3, 2021

Edvertisements: Adding Microlearning to Social News Feeds and Websites

Geza Kovacs

Many long-term goals, such as learning a language, require people to regularly practice every day to achieve mastery. At the same time, people regularly surf the web and read social news feeds in their spare time. We have built a browser extension that teaches vocabulary to users in the context of Facebook feeds and arbitrary websites, by showing users interactive quizzes they can answer without leaving the website. On Facebook, the quizzes show up as part of the news feed, while on other sites, the quizzes appear where advertisements normally would. In our user study, we examined the effectiveness of inserting microlearning tasks into social news feeds. We compared vocabulary learning rates when we inserted interactive quizzes into feeds, versus inserting links that lead them to a website where they could do the quizzes. Our results suggest that users engage with and learn from our embedded quizzes, and engagement increases when the quizzes can be done directly within their feeds.

3.7HCFeb 3, 2021

QuizCram: A Quiz-Driven Lecture Viewing Interface

Geza Kovacs, Darren Edge

QuizCram is an interface for navigating lecture videos that uses quizzes to help users determine what they should view. We developed it in response to observing peaks in video seeking behaviors centered around Coursera's in-video quizzes. QuizCram shows users a question to answer, with an associated video segment. Users can use these questions to navigate through video segments, and find video segments they need to review. We also allow users to review using a timeline of previously answered questions and videos. To encourage users to review the material, QuizCram keeps track of their question-answering and video-watching history and schedules sections they likely have not mastered for review. QuizCram-format materials can be generated from existing lectures with in-video quizzes. Our user study comparing QuizCram to in-video quizzes found that users practice answering and reviewing questions more when using QuizCram, and are better able to remember answers to questions they encountered.

0.3CLNov 11, 2020

The Impact of Text Presentation on Translator Performance

Samuel Läubli, Patrick Simianer, Joern Wuebker et al.

Widely used computer-aided translation (CAT) tools divide documents into segments such as sentences and arrange them in a side-by-side, spreadsheet-like view. We present the first controlled evaluation of these design choices on translator performance, measuring speed and accuracy in three experimental text processing tasks. We find significant evidence that sentence-by-sentence presentation enables faster text reproduction and within-sentence error identification compared to unsegmented text, and that a top-and-bottom arrangement of source and target sentences enables faster text reproduction compared to a side-by-side arrangement. For revision, on the other hand, our results suggest that presenting unsegmented text results in the highest accuracy and time efficiency. Our findings have direct implications for best practices in designing CAT tools.