CLJun 2, 2025
WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web TasksAtsuyuki Miyai, Zaiying Zhao, Kazuki Egashira et al.
Powered by a large language model (LLM), a web browsing agent operates web browsers in a human-like manner and offers a highly transparent path toward automating a wide range of everyday tasks. As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves? In this paper, we introduce WebChoreArena, a new fully reproducible benchmark comprising 532 carefully curated tasks designed to extend the scope of WebArena beyond general browsing to more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) Massive Memory tasks requiring accurate retrieval of large amounts of information in the observations, (ii) Calculation tasks demanding precise mathematical reasoning, and (iii) Long-Term Memory tasks necessitating long-term memory across multiple webpages. Built on top of the fully reproducible and widely adopted four WebArena simulation environments, WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress. Our experimental results demonstrate that as LLMs evolve, represented by GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro, significant improvements in performance are observed on WebChoreArena. These findings suggest that WebChoreArena is well-suited to measure the advancement of state-of-the-art LLMs with greater clarity. Nevertheless, the results also indicate that even with Gemini 2.5 Pro, there remains substantial room for improvement compared to WebArena, highlighting the increased challenges posed by WebChoreArena.
CLAug 25, 2023
Media of Langue: The Interface for Exploring Word Translation Network/SpaceGoki Muramoto, Atsuki Sato, Takayoshi Koyama
In the human activity of word translation, two languages face each other, mutually searching their own language system for the semantic place of words in the other language. We discover the huge network formed by the chain of these mutual translations as Word Translation Network, a network where words are nodes, and translation volume is represented as edges, and propose Media of Langue, a novel interface for exploring this network. Media of Langue points to the semantic configurations of many words in multiple languages at once, containing the information of existing dictionaries such as bilingual and synonym dictionaries. We have also implemented and published this interface as a web application, focusing on seven language pairs. This paper first defines the Word Translation Network and describes how to actually construct the network from bilingual corpora, followed by an analysis of the properties of the network. Next, we explain how to design a Media of Langue using the Word Translation Network, and finally, we analyze the features of the Media of Langue as a dictionary. Our website is https://www.media-of-langue.org .
LGDec 13, 2025
Optimized Learned Count-Min SketchKyosuke Nishishita, Atsuki Sato, Yusuke Matsui
Count-Min Sketch (CMS) is a memory-efficient data structure for estimating the frequency of elements in a multiset. Learned Count-Min Sketch (LCMS) enhances CMS with a machine learning model to reduce estimation error under the same memory usage, but suffers from slow construction due to empirical parameter tuning and lacks theoretical guarantees on intolerable error probability. We propose Optimized Learned Count-Min Sketch (OptLCMS), which partitions the input domain and assigns each partition to its own CMS instance, with CMS parameters analytically derived for fixed thresholds, and thresholds optimized via dynamic programming with approximate feasibility checks. This reduces the need for empirical validation, enabling faster construction while providing theoretical guarantees under these assumptions. OptLCMS also allows explicit control of the allowable error threshold, improving flexibility in practice. Experiments show that OptLCMS builds faster, achieves lower intolerable error probability, and matches the estimation accuracy of LCMS.
DSFeb 6, 2025
Cascaded Learned Bloom Filter for Optimal Model-Filter Size Balance and Fast RejectionAtsuki Sato, Yusuke Matsui
Recent studies have demonstrated that learned Bloom filters, which combine machine learning with the classical Bloom filter, can achieve superior memory efficiency. However, existing learned Bloom filters face two critical unresolved challenges: the balance between the machine learning model size and the Bloom filter size is not optimal, and the reject time cannot be minimized effectively. We propose the Cascaded Learned Bloom Filter (CLBF) to address these issues. Our dynamic programming-based optimization automatically selects configurations that achieve an optimal balance between the model and filter sizes while minimizing reject time. Experiments on real-world datasets show that CLBF reduces memory usage by up to 24% and decreases reject time by up to 14 times compared to state-of-the-art learned Bloom filters.