AIMay 22
Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified SystemsShubham Agarwal, Alexander Krentsel, Shu Liu et al.
AI agents increasingly excel at generating, testing, and refining code. However, they fall short on tasks requiring formal guarantees of full coverage that testing alone cannot provide. Distributed systems are a prime example: properties such as consistency between reads and writes must hold under every possible interleaving of events. Mechanized formal verification can guarantee such correctness, but typically demands months to years of expert effort. As evidence, even SOTA coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) succeed on only 2/7 distributed key-value-store specifications. In this paper, we present the first effective approach to addressing this gap, Inductive Deductive Synthesis (IDS), which jointly and incrementally synthesizes implementation and proof, and learns from failed attempts to systematically try promising strategies. Built as an agentic LLM system, IDS achieves 7/7 in about 6.8 hours and $106 per spec on average, roughly 200x faster than expert effort and 17% cheaper than SOTA agents. IDS further incorporates performance feedback into the same loop, yielding implementations up to 3x faster than published verified systems.
DBMay 22
The Time is Here for Just-in-Time Systems: Challenges and OpportunitiesShu Liu, Alexander Krentsel, Shubham Agarwal et al.
Core systems like key-value stores have historically taken years to build, and are designed to be general so as to amortize cost across deployments, paying a significant performance cost. We argue that LLM-based coding agents now make a different approach tractable: Just-in-Time Systems, in which the entire system is synthesized from scratch, specialized to the environment, workload, and required system properties. We present a JIT system synthesis pipeline, Jitskit, and explore its effectiveness in synthesizing key-value stores from spec cards that span different YCSB workloads, deployment constraints (e.g., compute resources), and system properties (e.g., consistency and durability). Jitskit iteratively refines a system implementation to match the specification against an evolving evaluation test suite. The resulting synthesized systems are performant, beating comparable state-of-the-art systems on 18 of 18 specs tried, by up to 4.6x over the best off-the-shelf baseline on the most favorable spec. Naively running Claude Code either reward-hacks or underperforms Jitskit by up to 5.4x. We discuss the challenges we overcame in building Jitskit and our key takeaways.
LGSep 4, 2024
NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for RetrievalSepanta Zeighami, Zac Wellmer, Aditya Parameswaran
$k$-Nearest Neighbor search on dense vector embeddings ($k$-NN retrieval) from pre-trained embedding models is the predominant retrieval method for text and images, as well as Retrieval-Augmented Generation (RAG) pipelines. In practice, application developers often fine-tune the embeddings to improve their accuracy on the dataset and query workload in hand. Existing approaches either fine-tune the pre-trained model itself or, more efficiently, but at the cost of accuracy, train adaptor models to transform the output of the pre-trained model. We present NUDGE, a family of novel non-parametric embedding fine-tuning approaches that are significantly more accurate and efficient than both sets of existing approaches. NUDGE directly modifies the embeddings of data records to maximize the accuracy of $k$-NN retrieval. We present a thorough theoretical and experimental study of NUDGE's non-parametric approach. We show that even though the underlying problem is NP-Hard, constrained variations can be solved efficiently. These constraints additionally ensure that the changes to the embeddings are modest, avoiding large distortions to the semantics learned during pre-training. In experiments across five pre-trained models and nine standard text and image retrieval datasets, NUDGE runs in minutes and often improves NDCG@10 by more than 10% over existing fine-tuning methods. On average, NUDGE provides 3.3x and 4.3x higher increase in accuracy and runs 200x and 3x faster, respectively, over fine-tuning the pre-trained model and training adaptors.
CLApr 20, 2025Code
PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model PipelinesReya Vir, Shreya Shankar, Harrison Chase et al.
Large language models (LLMs) are increasingly deployed in specialized production data processing pipelines across diverse domains -- such as finance, marketing, and e-commerce. However, when running them in production across many inputs, they often fail to follow instructions or meet developer expectations. To improve reliability in these applications, creating assertions or guardrails for LLM outputs to run alongside the pipelines is essential. Yet, determining the right set of assertions that capture developer requirements for a task is challenging. In this paper, we introduce PROMPTEVALS, a dataset of 2087 LLM pipeline prompts with 12623 corresponding assertion criteria, sourced from developers using our open-source LLM pipeline tools. This dataset is 5x larger than previous collections. Using a hold-out test split of PROMPTEVALS as a benchmark, we evaluated closed- and open-source models in generating relevant assertions. Notably, our fine-tuned Mistral and Llama 3 models outperform GPT-4o by 20.93% on average, offering both reduced latency and improved performance. We believe our dataset can spur further research in LLM reliability, alignment, and prompt engineering.
DBApr 10
Horrila: Cost-Based Placement of Semantic Operators in Hybrid Query PlansQiuyang Mang, Yufan Xiang, Hangrui Zhou et al.
Recent database systems have introduced semantic operators that leverage large language models (LLMs) to filter, join, and project over structured data using natural language predicates. In practice, these operators are combined with traditional relational operators, e.g., equi-joins, producing hybrid query plans whose execution cost depends on both expensive LLM calls and conventional database processing. A key optimization question is where to place each semantic operator relative to the relational operators in the plan: placing them earlier reduces the data that subsequent operators process, but requires more LLM calls; placing them later reduces LLM calls through deduplication, but forces relational operators to process larger intermediate data. Existing systems either ignore this placement question or apply simple heuristics without considering the full cost trade-off. We present Horrila, a plan-level optimizer for hybrid semantic-relational queries. Horrila reduces hybrid query planning to semantic filter placement via two equivalence-preserving rewrites. We prove that deferring all semantic filters to the latest possible position minimizes LLM invocations under function caching, but show that this can cause relational processing costs to dominate on complex multi-table queries. To balance LLM cost against relational cost, Horrila uses a dynamic-programming-based cost model that finds the placement minimizing their weighted sum. On 44 semantic SQL queries across five schemas and two benchmarks, Horrila achieves up to 1.5$\times$ speedup and 4.29$\times$ cost reduction while maintaining high output quality: an average F1 of 0.85 against the unoptimized baseline and 0.84 against human-annotated ground truth on SemBench. Overall, Horrila achieves a significant cost reduction while preserving the highest accuracy among six publicly available systems.
AIMar 17, 2025
Why Do Multi-Agent LLM Systems Fail?Mert Cemri, Melissa Z. Pan, Shuyi Yang et al. · berkeley
Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of failures for MAST-Data, we build the first Multi-Agent System Failure Taxonomy (MAST). We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators and validated by high inter-annotator agreement (kappa = 0.88). This process identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification. To enable scalable annotation, we develop an LLM-as-a-Judge pipeline with high agreement with human annotations. We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), demonstrating improvement headrooms from better MAS design. Our analysis provides insights revealing that identified failures require more sophisticated solutions, highlighting a clear roadmap for future research. We publicly release our comprehensive dataset (MAST-Data), the MAST, and our LLM annotator to facilitate widespread research and development in MAS.
LGMay 4, 2020Code
Demystifying a Dark Art: Understanding Real-World Machine Learning Model DevelopmentAngela Lee, Doris Xin, Doris Lee et al.
It is well-known that the process of developing machine learning (ML) workflows is a dark-art; even experts struggle to find an optimal workflow leading to a high accuracy model. Users currently rely on empirical trial-and-error to obtain their own set of battle-tested guidelines to inform their modeling decisions. In this study, we aim to demystify this dark art by understanding how people iterate on ML workflows in practice. We analyze over 475k user-generated workflows on OpenML, an open-source platform for tracking and sharing ML workflows. We find that users often adopt a manual, automated, or mixed approach when iterating on their workflows. We observe that manual approaches result in fewer wasted iterations compared to automated approaches. Yet, automated approaches often involve more preprocessing and hyperparameter options explored, resulting in higher performance overall--suggesting potential benefits for a human-in-the-loop ML system that appropriately recommends a clever combination of the two strategies.
HCApr 18, 2025
RAG Without the Lag: Interactive Debugging for Retrieval-Augmented Generation PipelinesQuentin Romero Lauro, Shreya Shankar, Sepanta Zeighami et al.
Retrieval-augmented generation (RAG) pipelines have become the de-facto approach for building AI assistants with access to external, domain-specific knowledge. Given a user query, RAG pipelines typically first retrieve (R) relevant information from external sources, before invoking a Large Language Model (LLM), augmented (A) with this information, to generate (G) responses. Modern RAG pipelines frequently chain multiple retrieval and generation components, in any order. However, developing effective RAG pipelines is challenging because retrieval and generation components are intertwined, making it hard to identify which component(s) cause errors in the eventual output. The parameters with the greatest impact on output quality often require hours of pre-processing after each change, creating prohibitively slow feedback cycles. To address these challenges, we present RAGGY, a developer tool that combines a Python library of composable RAG primitives with an interactive interface for real-time debugging. We contribute the design and implementation of RAGGY, insights into expert debugging patterns through a qualitative study with 12 engineers, and design implications for future RAG tools that better align with developers' natural workflows.
DBSep 2, 2025
Cut Costs, Not Accuracy: LLM-Powered Data Processing with GuaranteesSepanta Zeighami, Shreya Shankar, Aditya Parameswaran
Large Language Models (LLMs) are being increasingly used as a building block in data systems to process large text datasets. To do so, LLM model providers offer multiple LLMs with different sizes, spanning various cost-quality trade-offs when processing text at scale. Top-of-the-line LLMs (e.g., GPT-4o, Claude Sonnet) operate with high accuracy but are prohibitively expensive when processing many records. To avoid high costs, more affordable but lower quality LLMs (e.g., GPT-4o-mini, Claude Haiku) can be used to process records, but we need to ensure that the overall accuracy does not deviate substantially from that of the top-of-the-line LLMs. The model cascade framework provides a blueprint to manage this trade-off, by using the confidence of LLMs in their output (e.g., log-probabilities) to decide on which records to use the affordable LLM. However, existing solutions following this framework provide only marginal cost savings and weak theoretical guarantees because of poor estimation of the quality of the affordable LLM's outputs. We present BARGAIN, a method that judiciously uses affordable LLMs in data processing to significantly reduce cost while providing strong theoretical guarantees on the solution quality. BARGAIN employs a novel adaptive sampling strategy and statistical estimation procedure that uses data and task characteristics and builds on recent statistical tools to make accurate estimations with tight theoretical guarantees. Variants of BARGAIN can support guarantees on accuracy, precision, or recall of the output. Experimental results across 8 real-world datasets show that BARGAIN reduces cost, on average, by up to 86% more than state-of-the-art, while providing stronger theoretical guarantees on accuracy of output, with similar gains when guaranteeing a desired level of precision or recall.
DBFeb 18, 2025
LLM-Powered Proactive Data SystemsSepanta Zeighami, Yiming Lin, Shreya Shankar et al.
With the power of LLMs, we now have the ability to query data that was previously impossible to query, including text, images, and video. However, despite this enormous potential, most present-day data systems that leverage LLMs are reactive, reflecting our community's desire to map LLMs to known abstractions. Most data systems treat LLMs as an opaque black box that operates on user inputs and data as is, optimizing them much like any other approximate, expensive UDFs, in conjunction with other relational operators. Such data systems do as they are told, but fail to understand and leverage what the LLM is being asked to do (i.e. the underlying operations, which may be error-prone), the data the LLM is operating on (e.g., long, complex documents), or what the user really needs. They don't take advantage of the characteristics of the operations and/or the data at hand, or ensure correctness of results when there are imprecisions and ambiguities. We argue that data systems instead need to be proactive: they need to be given more agency -- armed with the power of LLMs -- to understand and rework the user inputs and the data and to make decisions on how the operations and the data should be represented and processed. By allowing the data system to parse, rewrite, and decompose user inputs and data, or to interact with the user in ways that go beyond the standard single-shot query-result paradigm, the data system is able to address user needs more efficiently and effectively. These new capabilities lead to a rich design space where the data system takes more initiative: they are empowered to perform optimization based on the transformation operations, data characteristics, and user intent. We discuss various successful examples of how this framework has been and can be applied in real-world tasks, and present future directions for this ambitious research agenda.
SEAug 31, 2021
Towards Observability for Production Machine Learning PipelinesShreya Shankar, Aditya Parameswaran
Software organizations are increasingly incorporating machine learning (ML) into their product offerings, driving a need for new data management tools. Many of these tools facilitate the initial development of ML applications, but sustaining these applications post-deployment is difficult due to lack of real-time feedback (i.e., labels) for predictions and silent failures that could occur at any component of the ML pipeline (e.g., data distribution shift or anomalous features). We propose a new type of data management system that offers end-to-end observability, or visibility into complex system behavior, for deployed ML pipelines through assisted (1) detection, (2) diagnosis, and (3) reaction to ML-related bugs. We describe new research challenges and suggest preliminary solution ideas in all three aspects. Finally, we introduce an example architecture for a "bolt-on" ML observability system, or one that wraps around existing tools in the stack.
DBMar 30, 2021
Production Machine Learning Pipelines: Empirical Analysis and Optimization OpportunitiesDoris Xin, Hui Miao, Aditya Parameswaran et al.
Machine learning (ML) is now commonplace, powering data-driven applications in various organizations. Unlike the traditional perception of ML in research, ML production pipelines are complex, with many interlocking analytical components beyond training, whose sub-parts are often run multiple times on overlapping subsets of data. However, there is a lack of quantitative evidence regarding the lifespan, architecture, frequency, and complexity of these pipelines to understand how data management research can be used to make them more efficient, effective, robust, and reproducible. To that end, we analyze the provenance graphs of 3000 production ML pipelines at Google, comprising over 450,000 models trained, spanning a period of over four months, in an effort to understand the complexity and challenges underlying production ML. Our analysis reveals the characteristics, components, and topologies of typical industry-strength ML pipelines at various granularities. Along the way, we introduce a specialized data model for representing and reasoning about repeatedly run components in these ML pipelines, which we call model graphlets. We identify several rich opportunities for optimization, leveraging traditional data management ideas. We show how targeting even one of these opportunities, i.e., identifying and pruning wasted computation that does not translate to model deployment, can reduce wasted computation cost by 50% without compromising the model deployment cadence.
HCFeb 14, 2021
Deconstructing Categorization in Visualization Recommendation: A Taxonomy and Comparative StudyDoris Jung-Lin Lee, Vidya Setlur, Melanie Tory et al.
Visualization recommendation (VisRec) systems provide users with suggestions for potentially interesting and useful next steps during exploratory data analysis. These recommendations are typically organized into categories based on their analytical actions, i.e., operations employed to transition from the current exploration state to a recommended visualization. However, despite the emergence of a plethora of VisRec systems in recent work, the utility of the categories employed by these systems in analytical workflows has not been systematically investigated. Our paper explores the efficacy of recommendation categories by formalizing a taxonomy of common categories and developing a system, Frontier, that implements these categories. Using Frontier, we evaluate workflow strategies adopted by users and how categories influence those strategies. Participants found recommendations that add attributes to enhance the current visualization and recommendations that filter to sub-populations to be comparatively most useful during data exploration. Our findings pave the way for next-generation VisRec systems that are adaptive and personalized via carefully chosen, effective recommendation categories.
HCJan 13, 2021
Whither AutoML? Understanding the Role of Automation in Machine Learning WorkflowsDoris Xin, Eva Yiwei Wu, Doris Jung-Lin Lee et al.
Efforts to make machine learning more widely accessible have led to a rapid increase in Auto-ML tools that aim to automate the process of training and deploying machine learning. To understand how Auto-ML tools are used in practice today, we performed a qualitative study with participants ranging from novice hobbyists to industry researchers who use Auto-ML tools. We present insights into the benefits and deficiencies of existing tools, as well as the respective roles of the human and automation in ML workflows. Finally, we discuss design implications for the future of Auto-ML tool development. We argue that instead of full automation being the ultimate goal of Auto-ML, designers of these tools should focus on supporting a partnership between the user and the Auto-ML tool. This means that a range of Auto-ML tools will need to be developed to support varying user goals such as simplicity, reproducibility, and reliability.
SEDec 13, 2020
Fine-Grained Lineage for Safer Notebook InteractionsStephen Macke, Hongpu Gong, Doris Jung-Lin Lee et al.
Computational notebooks have emerged as the platform of choice for data science and analytical workflows, enabling rapid iteration and exploration. By keeping intermediate program state in memory and segmenting units of execution into so-called "cells", notebooks allow users to execute their workflows interactively and enjoy particularly tight feedback. However, as cells are added, removed, reordered, and rerun, this hidden intermediate state accumulates in a way that is not necessarily correlated with the notebook's visible code, making execution behavior difficult to reason about, and leading to errors and lack of reproducibility. We present NBSafety, a custom Jupyter kernel that uses runtime tracing and static analysis to automatically manage lineage associated with cell execution and global notebook state. NBSafety detects and prevents errors that users make during unaided notebook interactions, all while preserving the flexibility of existing notebook semantics. We evaluate NBSafety's ability to prevent erroneous interactions by replaying and analyzing 666 real notebook sessions. Of these, NBSafety identified 117 sessions with potential safety errors, and in the remaining 549 sessions, the cells that NBSafety identified as resolving safety issues were more than $7\times$ more likely to be selected by users for re-execution compared to a random baseline, even though the users were not using NBSafety and were therefore not influenced by its suggestions.
HCJul 26, 2019
SCATTERSEARCH: Visual Querying of Scatterplot VisualizationsDoris Jung-Lin Lee, Jaewoo Kim, Renxuan Wang et al.
Scatterplots are one of the simplest and most commonly-used visualizations for understanding quantitative, multidimensional data. However, since scatterplots only depict two attributes at a time, analysts often need to manually generate and inspect large numbers of scatterplots to make sense of large datasets with many attributes. We present a visual query system for scatterplots, SCATTERSEARCH, that enables users to visually search and browse through large collections of scatterplots. Users can query for other visualizations based on a region of interest or find other scatterplots that "look similar'' to a selected one. We present two demo scenarios, provide a system overview of SCATTERSEARCH, and outline future directions.
DBDec 14, 2018
Helix: Holistic Optimization for Accelerating Iterative Machine LearningDoris Xin, Stephen Macke, Litian Ma et al.
Machine learning workflow development is a process of trial-and-error: developers iterate on workflows by testing out small modifications until the desired accuracy is achieved. Unfortunately, existing machine learning systems focus narrowly on model training---a small fraction of the overall development time---and neglect to address iterative development. We propose Helix, a machine learning system that optimizes the execution across iterations---intelligently caching and reusing, or recomputing intermediates as appropriate. Helix captures a wide variety of application needs within its Scala DSL, with succinct syntax defining unified processes for data preprocessing, model specification, and learning. We demonstrate that the reuse problem can be cast as a Max-Flow problem, while the caching problem is NP-Hard. We develop effective lightweight heuristics for the latter. Empirical evaluation shows that Helix is not only able to handle a wide variety of use cases in one unified workflow but also much faster, providing run time reductions of up to 19x over state-of-the-art systems, such as DeepDive or KeystoneML, on four real-world applications in natural language processing, computer vision, social and natural sciences.
LGAug 3, 2018
Helix: Accelerating Human-in-the-loop Machine LearningDoris Xin, Litian Ma, Jialin Liu et al.
Data application developers and data scientists spend an inordinate amount of time iterating on machine learning (ML) workflows -- by modifying the data pre-processing, model training, and post-processing steps -- via trial-and-error to achieve the desired model performance. Existing work on accelerating machine learning focuses on speeding up one-shot execution of workflows, failing to address the incremental and dynamic nature of typical ML development. We propose Helix, a declarative machine learning system that accelerates iterative development by optimizing workflow execution end-to-end and across iterations. Helix minimizes the runtime per iteration via program analysis and intelligent reuse of previous results, which are selectively materialized -- trading off the cost of materialization for potential future benefits -- to speed up future iterations. Additionally, Helix offers a graphical interface to visualize workflow DAGs and compare versions to facilitate iterative development. Through two ML applications, in classification and in structured prediction, attendees will experience the succinctness of Helix programming interface and the speed and ease of iterative development using Helix. In our evaluations, Helix achieved up to an order of magnitude reduction in cumulative run time compared to state-of-the-art machine learning tools.
LGMar 27, 2018
How Developers Iterate on Machine Learning Workflows -- A Survey of the Applied Machine Learning LiteratureDoris Xin, Litian Ma, Shuchen Song et al.
Machine learning workflow development is anecdotally regarded to be an iterative process of trial-and-error with humans-in-the-loop. However, we are not aware of quantitative evidence corroborating this popular belief. A quantitative characterization of iteration can serve as a benchmark for machine learning workflow development in practice, and can aid the development of human-in-the-loop machine learning systems. To this end, we conduct a small-scale survey of the applied machine learning literature from five distinct application domains. We collect and distill statistics on the role of iteration within machine learning workflow development, and report preliminary trends and insights from our investigation, as a starting point towards this benchmark. Based on our findings, we finally describe desiderata for effective and versatile human-in-the-loop machine learning systems that can cater to users in diverse domains.
LGFeb 10, 2018
The Importance of Norm Regularization in Linear Graph Embedding: Theoretical Analysis and Empirical DemonstrationYihan Gao, Chao Zhang, Jian Peng et al.
Learning distributed representations for nodes in graphs is a crucial primitive in network analysis with a wide spectrum of applications. Linear graph embedding methods learn such representations by optimizing the likelihood of both positive and negative edges while constraining the dimension of the embedding vectors. We argue that the generalization performance of these methods is not due to the dimensionality constraint as commonly believed, but rather the small norm of embedding vectors. Both theoretical and empirical evidence are provided to support this argument: (a) we prove that the generalization error of these methods can be bounded by limiting the norm of vectors, regardless of the embedding dimension; (b) we show that the generalization performance of linear graph embedding methods is correlated with the norm of embedding vectors, which is small due to the early stopping of SGD and the vanishing gradients. We performed extensive experiments to validate our analysis and showcased the importance of proper norm regularization in practice.
HCJan 11, 2018
Characterizing Scalability Issues in Spreadsheet Software using Online ForumsKelly Mack, John Lee, Kevin Chang et al.
In traditional usability studies, researchers talk to users of tools to understand their needs and challenges. Insights gained via such interviews offer context, detail, and background. Due to costs in time and money, we are beginning to see a new form of tool interrogation that prioritizes scale, cost, and breadth by utilizing existing data from online forums. In this case study, we set out to apply this method of using online forum data to a specific issue---challenges that users face with Excel spreadsheets. Spreadsheets are a versatile and powerful processing tool if used properly. However, with versatility and power come errors, from both users and the software, which make using spreadsheets less effective. By scraping posts from the website Reddit, we collected a dataset of questions and complaints about Excel. Specifically, we explored and characterized the issues users were facing with spreadsheet software in general, and in particular, as resulting from a large amount of data in their spreadsheets. We discuss the implications of our findings on the design of next-generation spreadsheet software.
DBOct 2, 2017
You can't always sketch what you want: Understanding Sensemaking in Visual Query SystemsDoris Jung-Lin Lee, John Lee, Tarique Siddiqui et al.
Visual query systems (VQSs) empower users to interactively search for line charts with desired visual patterns, typically specified using intuitive sketch-based interfaces. Despite decades of past work on VQSs, these efforts have not translated to adoption in practice, possibly because VQSs are largely evaluated in unrealistic lab-based settings. To remedy this gap in adoption, we collaborated with experts from three diverse domains---astronomy, genetics, and material science---via a year-long user-centered design process to develop a VQS that supports their workflow and analytical needs, and evaluate how VQSs can be used in practice. Our study results reveal that ad-hoc sketch-only querying is not as commonly used as prior work suggests, since analysts are often unable to precisely express their patterns of interest. In addition, we characterize three essential sensemaking processes supported by our enhanced VQS. We discover that participants employ all three processes, but in different proportions, depending on the analytical needs in each domain. Our findings suggest that all three sensemaking processes must be integrated in order to make future VQSs useful for a wide range of analytical inquiries.
HCJan 22, 2017
Understanding Workers, Developing Effective Tasks, and Enhancing Marketplace Dynamics: A Study of a Large Crowdsourcing MarketplaceAyush Jain, Akash Das Sarma, Aditya Parameswaran et al.
We conduct an experimental analysis of a dataset comprising over 27 million microtasks performed by over 70,000 workers issued to a large crowdsourcing marketplace between 2012-2016. Using this data---never before analyzed in an academic context---we shed light on three crucial aspects of crowdsourcing: (1) Task design --- helping requesters understand what constitutes an effective task, and how to go about designing one; (2) Marketplace dynamics --- helping marketplace administrators and designers understand the interaction between tasks and workers, and the corresponding marketplace load; and (3) Worker behavior --- understanding worker attention spans, lifetimes, and general behavior, for the improvement of the crowdsourcing ecosystem as a whole.
HCOct 17, 2016
Optimizing Open-Ended Crowdsourcing: The Next Frontier in Crowdsourced Data ManagementAditya Parameswaran, Akash Das Sarma, Vipul Venkataraman
Crowdsourcing is the primary means to generate training data at scale, and when combined with sophisticated machine learning algorithms, crowdsourcing is an enabler for a variety of emergent automated applications impacting all spheres of our lives. This paper surveys the emerging field of formally reasoning about and optimizing open-ended crowdsourcing, a popular and crucially important, but severely understudied class of crowdsourcing---the next frontier in crowdsourced data management. The underlying challenges include distilling the right answer when none of the workers agree with each other, teasing apart the various perspectives adopted by workers when answering tasks, and effectively selecting between the many open-ended operators appropriate for a problem. We describe the approaches that we've found to be effective for open-ended crowdsourcing, drawing from our experiences in this space.
LGJun 9, 2015
On the Interpretability of Conditional Probability Estimates in the Agnostic SettingYihan Gao, Aditya Parameswaran, Jian Peng
We study the interpretability of conditional probability estimates for binary classification under the agnostic setting or scenario. Under the agnostic setting, conditional probability estimates do not necessarily reflect the true conditional probabilities. Instead, they have a certain calibration property: among all data points that the classifier has predicted P(Y = 1|X) = p, p portion of them actually have label Y = 1. For cost-sensitive decision problems, this calibration property provides adequate support for us to use Bayes Decision Theory. In this paper, we define a novel measure for the calibration property together with its empirical counterpart, and prove an uniform convergence result between them. This new measure enables us to formally justify the calibration property of conditional probability estimations, and provides new insights on the problem of estimating and calibrating conditional probabilities.
LGAug 15, 2014
Indexing Cost Sensitive PredictionLeilani Battle, Edward Benson, Aditya Parameswaran et al.
Predictive models are often used for real-time decision making. However, typical machine learning techniques ignore feature evaluation cost, and focus solely on the accuracy of the machine learning models obtained utilizing all the features available. We develop algorithms and indexes to support cost-sensitive prediction, i.e., making decisions using machine learning models taking feature evaluation cost into account. Given an item and a online computation cost (i.e., time) budget, we present two approaches to return an appropriately chosen machine learning model that will run within the specified time on the given item. The first approach returns the optimal machine learning model, i.e., one with the highest accuracy, that runs within the specified time, but requires significant up-front precomputation time. The second approach returns a possibly sub- optimal machine learning model, but requires little up-front precomputation time. We study these two algorithms in detail and characterize the scenarios (using real and synthetic data) in which each performs well. Unlike prior work that focuses on a narrow domain or a specific algorithm, our techniques are very general: they apply to any cost-sensitive prediction scenario on any machine learning algorithm.