59.3HCApr 10Code
DroidRetriever: A Transparent and Steerable Automation System for Collaborative Mobile Information SeekingYiheng Bian, Yunpeng Song, Guiyu Ma et al.
Information seeking on mobile devices is often fragmented, trapping users in repetitive cycles of context switching and data re-entry, which increases cognitive load and disrupts workflow. Existing mobile agents provide limited cross-source integration and are largely opaque, presenting progress as a linear feed with few opportunities to intervene, steer, or take control. We present DroidRetriever, a transparent, steerable system for cross-source mobile information seeking. It accepts voice or typed input and the multi-LLM system decomposes the task, navigates to target pages, takes screenshots, and synthesizes a concise report with citation-linked screenshots. We make the process transparent through a progress dashboard combining sub-task progress and real-time exploration maps for seamless takeover. DroidRetriever also pauses on detected privacy or high-risk screens and prompts intervention. Across 35 tasks over 24 apps, experiments and user studies demonstrate improvements in coverage, transparency, and reduced workload. We release our code at https://github.com/AkimotoAyako/DroidRetriever.
CLJun 9, 2024
QGEval: Benchmarking Multi-dimensional Evaluation for Question GenerationWeiping Fu, Bifan Wei, Jianxiang Hu et al.
Automatically generated questions often suffer from problems such as unclear expression or factual inaccuracies, requiring a reliable and comprehensive evaluation of their quality. Human evaluation is widely used in the field of question generation (QG) and serves as the gold standard for automatic metrics. However, there is a lack of unified human evaluation criteria, which hampers consistent and reliable evaluations of both QG models and automatic metrics. To address this, we propose QGEval, a multi-dimensional Evaluation benchmark for Question Generation, which evaluates both generated questions and existing automatic metrics across 7 dimensions: fluency, clarity, conciseness, relevance, consistency, answerability, and answer consistency. We demonstrate the appropriateness of these dimensions by examining their correlations and distinctions. Through consistent evaluations of QG models and automatic metrics with QGEval, we find that 1) most QG models perform unsatisfactorily in terms of answerability and answer consistency, and 2) existing metrics fail to align well with human judgments when evaluating generated questions across the 7 dimensions. We expect this work to foster the development of both QG technologies and their evaluation.
IRJun 15, 2015
Re-scale AdaBoost for Attack Detection in Collaborative Filtering Recommender SystemsZhihai Yang, Lin Xu, Zhongmin Cai
Collaborative filtering recommender systems (CFRSs) are the key components of successful e-commerce systems. Actually, CFRSs are highly vulnerable to attacks since its openness. However, since attack size is far smaller than that of genuine users, conventional supervised learning based detection methods could be too "dull" to handle such imbalanced classification. In this paper, we improve detection performance from following two aspects. First, we extract well-designed features from user profiles based on the statistical properties of the diverse attack models, making hard classification task becomes easier to perform. Then, refer to the general idea of re-scale Boosting (RBoosting) and AdaBoost, we apply a variant of AdaBoost, called the re-scale AdaBoost (RAdaBoost) as our detection method based on extracted features. RAdaBoost is comparable to the optimal Boosting-type algorithm and can effectively improve the performance in some hard scenarios. Finally, a series of experiments on the MovieLens-100K data set are conducted to demonstrate the outperformance of RAdaBoost comparing with some classical techniques such as SVM, kNN and AdaBoost.