Omar Alonso

h-index25

10papers

140citations

Novelty32%

AI Score33

Ranked #118,047 of 194,257 authors (top 61%)#1,230 in IR (top 57%)

10 Papers

3.9CLDec 15, 2022Code

Measuring Annotator Agreement Generally across Complex Structured, Multi-object, and Free-text Annotation Tasks

Alexander Braylan, Omar Alonso, Matthew Lease

When annotators label data, a key metric for quality assurance is inter-annotator agreement (IAA): the extent to which annotators agree on their labels. Though many IAA measures exist for simple categorical and ordinal labeling tasks, relatively little work has considered more complex labeling tasks, such as structured, multi-object, and free-text annotations. Krippendorff's alpha, best known for use with simpler labeling tasks, does have a distance-based formulation with broader applicability, but little work has studied its efficacy and consistency across complex annotation tasks. We investigate the design and evaluation of IAA measures for complex annotation tasks, with evaluation spanning seven diverse tasks: image bounding boxes, image keypoints, text sequence tagging, ranked lists, free text translations, numeric vectors, and syntax trees. We identify the difficulty of interpretability and the complexity of choosing a distance function as key obstacles in applying Krippendorff's alpha generally across these tasks. We propose two novel, more interpretable measures, showing they yield more consistent IAA measures across tasks and annotation distance functions.

7.7LGDec 20, 2023Code

A General Model for Aggregating Annotations Across Simple, Complex, and Multi-Object Annotation Tasks

Alexander Braylan, Madalyn Marabella, Omar Alonso et al.

Human annotations are vital to supervised learning, yet annotators often disagree on the correct label, especially as annotation tasks increase in complexity. A strategy to improve label quality is to ask multiple annotators to label the same item and aggregate their labels. Many aggregation models have been proposed for categorical or numerical annotation tasks, but far less work has considered more complex annotation tasks involving open-ended, multivariate, or structured responses. While a variety of bespoke models have been proposed for specific tasks, our work is the first to introduce aggregation methods that generalize across many diverse complex tasks, including sequence labeling, translation, syntactic parsing, ranking, bounding boxes, and keypoints. This generality is achieved by devising a task-agnostic method to model distances between labels rather than the labels themselves. This article extends our prior work with investigation of three new research questions. First, how do complex annotation properties impact aggregation accuracy? Second, how should a task owner navigate the many modeling choices to maximize aggregation accuracy? Finally, what diagnoses can verify that aggregation models are specified correctly for the given data? To understand how various factors impact accuracy and to inform model selection, we conduct simulation studies and experiments on real, complex datasets. Regarding testing, we introduce unit tests for aggregation models and present a suite of such tests to ensure that a given model is not mis-specified and exhibits expected behavior. Beyond investigating these research questions above, we discuss the foundational concept of annotation complexity, present a new aggregation model as a bridge between traditional models and our own, and contribute a new semi-supervised learning method for complex label aggregation that outperforms prior work.

4.9CLSep 9, 2025

Instance-level Performance Prediction for Long-form Generation Tasks

Chi-Yang Hsu, Alexander Braylan, Yiheng Su et al.

We motivate and share a new benchmark for instance-level performance prediction of long-form generation tasks having multi-faceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples. Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.

1.7IRAug 5, 2019

Local versus Global Strategies in Social Query Expansion

Omar Alonso, Vasileios Kandylas, Serge-Eric Tremblay

Link sharing in social media can be seen as a collaboratively retrieved set of documents for a query or topic expressed by a hashtag. Temporal information plays an important role for identifying the correct context for which such annotations are valid for retrieval purposes. We investigate how social data as temporal context can be used for query expansion and compare global versus local strategies for computing such contextual information for a set of hashtags.

3.1IRJun 14, 2019

Scalable Knowledge Graph Construction from Twitter

Omar Alonso, Vasileios Kandylas, Serge-Eric Tremblay

We describe a knowledge graph derived from Twitter data with the goal of discovering relationships between people, links, and topics. The goal is to filter out noise from Twitter and surface an inside-out view that relies on high quality content. The generated graph contains many relationships where the user can query and traverse the structure from different angles allowing the development of new applications.

2.7IRDec 10, 2016

Label Visualization and Exploration in IR

Omar Alonso

There is a renaissance in visual analytics systems for data analysis and sharing, in particular, in the current wave of big data applications. We introduce RAVE, a prototype that automates the generation of an interface that uses facets and visualization techniques for exploring and analyzing relevance assessments data sets collected via crowdsourcing. We present a technical description of the main components and demonstrate its use.

14.0AINov 1, 2014

How Many Workers to Ask? Adaptive Exploration for Collecting High Quality Labels

Ittai Abraham, Omar Alonso, Vasilis Kandylas et al.

Crowdsourcing has been part of the IR toolbox as a cheap and fast mechanism to obtain labels for system development and evaluation. Successful deployment of crowdsourcing at scale involves adjusting many variables, a very important one being the number of workers needed per human intelligence task (HIT). We consider the crowdsourcing task of learning the answer to simple multiple-choice HITs, which are representative of many relevance experiments. In order to provide statistically significant results, one often needs to ask multiple workers to answer the same HIT. A stopping rule is an algorithm that, given a HIT, decides for any given set of worker answers if the system should stop and output an answer or iterate and ask one more worker. Knowing the historic performance of a worker in the form of a quality score can be beneficial in such a scenario. In this paper we investigate how to devise better stopping rules given such quality scores. We also suggest adaptive exploration as a promising approach for scalable and automatic creation of ground truth. We conduct a data analysis on an industrial crowdsourcing platform, and use the observations from this analysis to design new stopping rules that use the workers' quality scores in a non-trivial manner. We then perform a simulation based on a real-world workload, showing that our algorithm performs better than the more naive approaches.

3.5HCOct 10, 2014

A Study on Placement of Social Buttons in Web Pages

Omar Alonso, Vasilis Kandylas

With the explosion of social media in the last few years, web pages nowadays include different social network buttons where users can express if they support or recommend content. Those social buttons are very visual and their presentations, along with the counters, mark the importance of the social network and the interest on the content. In this paper, we analyze the presence of four types of social buttons (Facebook, Twitter, Google+1, and LinkedIn) in a large collection of web pages that we tracked over a period of time. We report on the distribution and counts along with some characteristics per domain. Finally, we outline some research directions.

1.6LGJul 13, 2013

A Data Management Approach for Dataset Selection Using Human Computation

Alexandros Ntoulas, Omar Alonso, Vasilis Kandylas

As the number of applications that use machine learning algorithms increases, the need for labeled data useful for training such algorithms intensifies. Getting labels typically involves employing humans to do the annotation, which directly translates to training and working costs. Crowdsourcing platforms have made labeling cheaper and faster, but they still involve significant costs, especially for the cases where the potential set of candidate data to be labeled is large. In this paper we describe a methodology and a prototype system aiming at addressing this challenge for Web-scale problems in an industrial setting. We discuss ideas on how to efficiently select the data to use for training of machine learning algorithms in an attempt to reduce cost. We show results achieving good performance with reduced cost by carefully selecting which instances to label. Our proposed algorithm is presented as part of a framework for managing and generating training datasets, which includes, among other components, a human computation element.

17.4LGFeb 13, 2013

Adaptive Crowdsourcing Algorithms for the Bandit Survey Problem

Ittai Abraham, Omar Alonso, Vasilis Kandylas et al.

Very recently crowdsourcing has become the de facto platform for distributing and collecting human computation for a wide range of tasks and applications such as information retrieval, natural language processing and machine learning. Current crowdsourcing platforms have some limitations in the area of quality control. Most of the effort to ensure good quality has to be done by the experimenter who has to manage the number of workers needed to reach good results. We propose a simple model for adaptive quality control in crowdsourced multiple-choice tasks which we call the \emph{bandit survey problem}. This model is related to, but technically different from the well-known multi-armed bandit problem. We present several algorithms for this problem, and support them with analysis and simulations. Our approach is based in our experience conducting relevance evaluation for a large commercial search engine.