CLOct 27, 2023Code
ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language ModelsBenjamin Feuer, Yurong Liu, Chinmay Hegde et al.
Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type and incur large run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark. Our code is available at https://github.com/penfever/ArcheType.
LGApr 17, 2023
eTOP: Early Termination of Pipelines for Faster Training of AutoML SystemsHaoxiang Zhang, Juliana Freire, Yash Garg
Recent advancements in software and hardware technologies have enabled the use of AI/ML models in everyday applications has significantly improved the quality of service rendered. However, for a given application, finding the right AI/ML model is a complex and costly process, that involves the generation, training, and evaluation of multiple interlinked steps (called pipelines), such as data pre-processing, feature engineering, selection, and model tuning. These pipelines are complex (in structure) and costly (both in compute resource and time) to execute end-to-end, with a hyper-parameter associated with each step. AutoML systems automate the search of these hyper-parameters but are slow, as they rely on optimizing the pipeline's end output. We propose the eTOP Framework which works on top of any AutoML system and decides whether or not to execute the pipeline to the end or terminate at an intermediate step. Experimental evaluation on 26 benchmark datasets and integration of eTOPwith MLBox4 reduces the training time of the AutoML system upto 40x than baseline MLBox.
HCJul 28, 2025
BDIViz: An Interactive Visualization System for Biomedical Schema Matching with LLM-Powered ValidationEden Wu, Dishita G Turakhia, Guande Wu et al.
Biomedical data harmonization is essential for enabling exploratory analyses and meta-studies, but the process of schema matching - identifying semantic correspondences between elements of disparate datasets (schemas) - remains a labor-intensive and error-prone task. Even state-of-the-art automated methods often yield low accuracy when applied to biomedical schemas due to the large number of attributes and nuanced semantic differences between them. We present BDIViz, a novel visual analytics system designed to streamline the schema matching process for biomedical data. Through formative studies with domain experts, we identified key requirements for an effective solution and developed interactive visualization techniques that address both scalability challenges and semantic ambiguity. BDIViz employs an ensemble approach that combines multiple matching methods with LLM-based validation, summarizes matches through interactive heatmaps, and provides coordinated views that enable users to quickly compare attributes and their values. Our method-agnostic design allows the system to integrate various schema matching algorithms and adapt to application-specific needs. Through two biomedical case studies and a within-subject user study with domain experts, we demonstrate that BDIViz significantly improves matching accuracy while reducing cognitive load and curation time compared to baseline approaches.
IRApr 12
BDIViz in Action: Interactive Curation and Benchmarking for Schema Matching MethodsEden Wu, Christos Koutras, Cláudio T. Silva et al.
Schema matching remains fundamental to data integration, yet evaluating and comparing matching methods is hindered by limited benchmark diversity and lack of interactive validation frameworks. BDIViz, recently published at IEEE VIS 2025, is an interactive visualization system for schema matching with LLM-assisted validation. Given source and target datasets, BDIViz applies automatic matching methods and visualizes candidates in an interactive heatmap with hierarchical navigation, zoom, and filtering. Users validate matches directly in the heatmap and inspect ambiguous cases using coordinated views that show attribute descriptions, example values, and distributions. An LLM assistant generates structured explanations for selected candidates to support decision-making. This demonstration showcases a new extension to BDIViz that addresses a critical need in data integration research: human-in-the-loop benchmarking and iterative matcher development. New matchers can be integrated through a standardized interface, while user validations become evolving ground truth for real-time performance evaluation. This enables benchmarking new algorithms, constructing high-quality ground-truth datasets through expert validation, and comparing matcher behavior across diverse schemas and domains. We demonstrate two complementary scenarios: (i) data harmonization, where users map a large tabular dataset to a target schema with value-level inspection and LLM-generated explanations; and (ii) developer-in-the-loop benchmarking, where developers integrate custom matchers, observe performance metrics, and refine their algorithms.
SESep 6, 2013Code
Enabling Reproducible Science with VisTrailsDavid Koop, Juliana Freire, Claudio T. Silva
With the increasing amount of data and use of computation in science, software has become an important component in many different domains. Computing is now being used more often and in more aspects of scientific work including data acquisition, simulation, analysis, and visualization. To ensure reproducibility, it is important to capture the different computational processes used as well as their executions. VisTrails is an open-source scientific workflow system for data analysis and visualization that seeks to address the problem of integrating varied tools as well as automatically documenting the methods and parameters employed. Growing from a specific project need to supporting a wide array of users required close collaborations in addition to new research ideas to design a usable and efficient system. The VisTrails project now includes standard software processes like unit testing and developer documentation while serving as a base for further research. In this paper, we describe how VisTrails has developed and how our efforts in structuring and advertising the system have contributed to its adoption in many domains.
DBDec 11, 2024
Magneto: Combining Small and Large Language Models for Schema MatchingYurong Liu, Eduardo Pena, Aecio Santos et al.
Recent advances in language models opened new opportunities to address complex schema matching tasks. Schema matching approaches have been proposed that demonstrate the usefulness of language models, but they have also uncovered important limitations: Small language models (SLMs) require training data (which can be both expensive and challenging to obtain), and large language models (LLMs) often incur high computational costs and must deal with constraints imposed by context windows. We present Magneto, a cost-effective and accurate solution for schema matching that combines the advantages of SLMs and LLMs to address their limitations. By structuring the schema matching pipeline in two phases, retrieval and reranking, Magneto can use computationally efficient SLM-based strategies to derive candidate matches which can then be reranked by LLMs, thus making it possible to reduce runtime without compromising matching accuracy. We propose a self-supervised approach to fine-tune SLMs which uses LLMs to generate syntactically diverse training data, and prompting strategies that are effective for reranking. We also introduce a new benchmark, developed in collaboration with domain experts, which includes real biomedical datasets and presents new challenges to schema matching methods. Through a detailed experimental evaluation, using both our new and existing benchmarks, we show that Magneto is scalable and attains high accuracy for datasets from different domains.
AIApr 7
BDI-Kit Demo: A Toolkit for Programmable and Conversational Data HarmonizationRoque Lopez, Yurong Liu, Christos Koutras et al.
Data harmonization remains a major bottleneck for integrative analysis due to heterogeneity in schemas, value representations, and domain-specific conventions. BDI-Kit provides an extensible toolkit for schema and value matching. It exposes two complementary interfaces tailored to different user needs: a Python API enabling developers to construct harmonization pipelines programmatically, and an AI-assisted chat interface allowing domain experts to harmonize data through natural language dialogue. This demonstration showcases how users interact with BDI-Kit to iteratively explore, validate, and refine schema and value matches through a combination of automated matching, AI-assisted reasoning, and user-driven refinement. We present two scenarios: (i) using the Python API to programmatically compose primitives, examine intermediate outputs, and reuse transformations; and (ii) conversing with the AI assistant in natural language to access BDI-Kit's capabilities and iteratively refine outputs based on the assistant's suggestions.
AIFeb 10, 2025
Interactive Data Harmonization with LLM Agents: Opportunities and ChallengesAécio Santos, Eduardo H. M. Pena, Roque Lopez et al.
Data harmonization is an essential task that entails integrating datasets from diverse sources. Despite years of research in this area, it remains a time-consuming and challenging task due to schema mismatches, varying terminologies, and differences in data collection methodologies. This paper presents the case for agentic data harmonization as a means to both empower experts to harmonize their data and to streamline the process. We introduce Harmonia, a system that combines LLM-based reasoning, an interactive user interface, and a library of data harmonization primitives to automate the synthesis of data harmonization pipelines. We demonstrate Harmonia in a clinical data harmonization scenario, where it helps to interactively create reusable pipelines that map datasets to a standard format. Finally, we discuss challenges and open problems, and suggest research directions for advancing our vision.
LGApr 29, 2025
A Cost-Effective LLM-based Approach to Identify Wildlife Trafficking in Online MarketplacesJuliana Barbosa, Ulhas Gondhali, Gohar Petrossian et al.
Wildlife trafficking remains a critical global issue, significantly impacting biodiversity, ecological stability, and public health. Despite efforts to combat this illicit trade, the rise of e-commerce platforms has made it easier to sell wildlife products, putting new pressure on wild populations of endangered and threatened species. The use of these platforms also opens a new opportunity: as criminals sell wildlife products online, they leave digital traces of their activity that can provide insights into trafficking activities as well as how they can be disrupted. The challenge lies in finding these traces. Online marketplaces publish ads for a plethora of products, and identifying ads for wildlife-related products is like finding a needle in a haystack. Learning classifiers can automate ad identification, but creating them requires costly, time-consuming data labeling that hinders support for diverse ads and research questions. This paper addresses a critical challenge in the data science pipeline for wildlife trafficking analytics: generating quality labeled data for classifiers that select relevant data. While large language models (LLMs) can directly label advertisements, doing so at scale is prohibitively expensive. We propose a cost-effective strategy that leverages LLMs to generate pseudo labels for a small sample of the data and uses these labels to create specialized classification models. Our novel method automatically gathers diverse and representative samples to be labeled while minimizing the labeling costs. Our experimental evaluation shows that our classifiers achieve up to 95% F1 score, outperforming LLMs at a lower cost. We present real use cases that demonstrate the effectiveness of our approach in enabling analyses of different aspects of wildlife trafficking.
GRJul 25, 2025
TiVy: Time Series Visual Summary for Scalable VisualizationGromit Yeuk-Yin Chan, Luis Gustavo Nonato, Themis Palpanas et al.
Visualizing multiple time series presents fundamental tradeoffs between scalability and visual clarity. Time series capture the behavior of many large-scale real-world processes, from stock market trends to urban activities. Users often gain insights by visualizing them as line charts, juxtaposing or superposing multiple time series to compare them and identify trends and patterns. However, existing representations struggle with scalability: when covering long time spans, leading to visual clutter from too many small multiples or overlapping lines. We propose TiVy, a new algorithm that summarizes time series using sequential patterns. It transforms the series into a set of symbolic sequences based on subsequence visual similarity using Dynamic Time Warping (DTW), then constructs a disjoint grouping of similar subsequences based on the frequent sequential patterns. The grouping result, a visual summary of time series, provides uncluttered superposition with fewer small multiples. Unlike common clustering techniques, TiVy extracts similar subsequences (of varying lengths) aligned in time. We also present an interactive time series visualization that renders large-scale time series in real-time. Our experimental evaluation shows that our algorithm (1) extracts clear and accurate patterns when visualizing time series data, (2) achieves a significant speed-up (1000X) compared to a straightforward DTW clustering. We also demonstrate the efficiency of our approach to explore hidden structures in massive time series data in two usage scenarios.
DSJan 29, 2025
Matrix Product Sketching via Coordinated SamplingMajid Daliri, Juliana Freire, Danrong Li et al.
We revisit the well-studied problem of approximating a matrix product, $\mathbf{A}^T\mathbf{B}$, based on small space sketches $\mathcal{S}(\mathbf{A})$ and $\mathcal{S}(\mathbf{B})$ of $\mathbf{A} \in \R^{n \times d}$ and $\mathbf{B}\in \R^{n \times m}$. We are interested in the setting where the sketches must be computed independently of each other, except for the use of a shared random seed. We prove that, when $\mathbf{A}$ and $\mathbf{B}$ are sparse, methods based on \emph{coordinated random sampling} can outperform classical linear sketching approaches, like Johnson-Lindenstrauss Projection or CountSketch. For example, to obtain Frobenius norm error $ε\|\mathbf{A}\|_F\|\mathbf{B}\|_F$, coordinated sampling requires sketches of size $O(s/ε^2)$ when $\mathbf{A}$ and $\mathbf{B}$ have at most $s \leq d,m$ non-zeros per row. In contrast, linear sketching leads to sketches of size $O(d/ε^2)$ and $O(m/ε^2)$ for $\mathbf{A}$ and $\mathbf{B}$. We empirically evaluate our approach on two applications: 1) distributed linear regression in databases, a problem motivated by tasks like dataset discovery and augmentation, and 2) approximating attention matrices in transformer-based language models. In both cases, our sampling algorithms yield an order of magnitude improvement over linear sketching.
LGNov 3, 2021
AlphaD3M: Machine Learning Pipeline SynthesisIddo Drori, Yamuna Krishnamurthy, Remi Rampin et al.
We introduce AlphaD3M, an automatic machine learning (AutoML) system based on meta reinforcement learning using sequence models with self play. AlphaD3M is based on edit operations performed over machine learning pipeline primitives providing explainability. We compare AlphaD3M with state-of-the-art AutoML systems: Autosklearn, Autostacker, and TPOT, on OpenML datasets. AlphaD3M achieves competitive performance while being an order of magnitude faster, reducing computation time from hours to minutes, and is explainable by design.
DBApr 7, 2021
Correlation Sketches for Approximate Join-Correlation QueriesAécio Santos, Aline Bessa, Fernando Chirigati et al.
The increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities~to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column $Q$ and a join column $K_Q$ from a query table $\mathcal{T}_Q$, retrieve tables $\mathcal{T}_X$ in a dataset collection such that $\mathcal{T}_X$ is joinable with $\mathcal{T}_Q$ on $K_Q$ and there is a column $C \in \mathcal{T}_X$ such that $Q$ is correlated with $C$. A naïve approach to evaluate these queries, which first finds joinable tables and then explicitly joins and computes correlations between $Q$ and all columns of the discovered tables, is prohibitively expensive. To efficiently support correlated column discovery, we 1) propose a sketching method that enables the construction of an index for a large number of tables and that provides accurate estimates for join-correlation queries, and 2) explore different scoring strategies that effectively rank the query results based on how well the columns are correlated with the query. We carry out a detailed experimental evaluation, using both synthetic and real data, which shows that our sketches attain high accuracy and the scoring strategies lead to high-quality rankings.
IRFeb 10, 2021
Auctus: A Dataset Search Engine for Data AugmentationSonia Castelo, Rémi Rampin, Aécio Santos et al.
The large volumes of structured data currently available, from Web tables to open-data portals and enterprise data, open up new opportunities for progress in answering many important scientific, societal, and business questions. However, finding relevant data is difficult. While search engines have addressed this problem for Web documents, there are many new challenges involved in supporting the discovery of structured data. We demonstrate how the Auctus dataset search engine addresses some of these challenges. We describe the system architecture and how users can explore datasets through a rich set of queries. We also present case studies which show how Auctus supports data augmentation to improve machine learning models as well as to enrich analytics.
HCSep 1, 2020
Towards Evaluating Exploratory Model Building Process with AutoML SystemsSungsoo Ray Hong, Sonia Castelo, Vito D'Orazio et al.
The use of Automated Machine Learning (AutoML) systems are highly open-ended and exploratory. While rigorously evaluating how end-users interact with AutoML is crucial, establishing a robust evaluation methodology for such exploratory systems is challenging. First, AutoML is complex, including multiple sub-components that support a variety of sub-tasks for synthesizing ML pipelines, such as data preparation, problem specification, and model generation, making it difficult to yield insights that tell us which components were successful or not. Second, because the usage pattern of AutoML is highly exploratory, it is not possible to rely solely on widely used task efficiency and effectiveness metrics as success metrics. To tackle the challenges in evaluation, we propose an evaluation methodology that (1) guides AutoML builders to divide their AutoML system into multiple sub-system components, and (2) helps them reason about each component through visualization of end-users' behavioral patterns and attitudinal data. We conducted a study to understand when, how, why, and applying our methodology can help builders to better understand their systems and end-users. We recruited 3 teams of professional AutoML builders. The teams prepared their own systems and let 41 end-users use the systems. Using our methodology, we visualized end-users' behavioral and attitudinal data and distributed the results to the teams. We analyzed the results in two directions: what types of novel insights the AutoML builders learned from end-users, and (2) how the evaluation methodology helped the builders to understand workflows and the effectiveness of their systems. Our findings suggest new insights explaining future design opportunities in the AutoML domain as well as how using our methodology helped the builders to determine insights and let them draw concrete directions for improving their systems.
HCMay 1, 2020
PipelineProfiler: A Visual Analytics Tool for the Exploration of AutoML PipelinesJorge Piazentin Ono, Sonia Castelo, Roque Lopez et al.
In recent years, a wide variety of automated machine learning (AutoML) methods have been proposed to search and generate end-to-end learning pipelines. While these techniques facilitate the creation of models for real-world applications, given their black-box nature, the complexity of the underlying algorithms, and the large number of pipelines they derive, it is difficult for their developers to debug these systems. It is also challenging for machine learning experts to select an AutoML system that is well suited for a given problem or class of problems. In this paper, we present the PipelineProfiler, an interactive visualization tool that allows the exploration and comparison of the solution space of machine learning (ML) pipelines produced by AutoML systems. PipelineProfiler is integrated with Jupyter Notebook and can be used together with common data science tools to enable a rich set of analyses of the ML pipelines and provide insights about the algorithms that generated them. We demonstrate the utility of our tool through several use cases where PipelineProfiler is used to better understand and improve a real-world AutoML system. Furthermore, we validate our approach by presenting a detailed analysis of a think-aloud experiment with six data scientists who develop and evaluate AutoML tools.
LGFeb 11, 2020
Debugging Machine Learning PipelinesRaoni Lourenço, Juliana Freire, Dennis Shasha
Machine learning tasks entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous or uninformative outputs, the pipeline may fail or produce incorrect results. Inferring the root cause of failures and unexpected behavior is challenging, usually requiring much human thought, and is both time-consuming and error-prone. We propose a new approach that makes use of iteration and provenance to automatically infer the root causes and derive succinct explanations of failures. Through a detailed experimental evaluation, we assess the cost, precision, and recall of our approach compared to the state of the art. Our source code and experimental data will be available for reproducibility and enhancement.
LGOct 8, 2019
AutoML using Metadata Language EmbeddingsIddo Drori, Lu Liu, Yi Nian et al.
As a human choosing a supervised learning algorithm, it is natural to begin by reading a text description of the dataset and documentation for the algorithms you might use. We demonstrate that the same idea improves the performance of automated machine learning methods. We use language embeddings from modern NLP to improve state-of-the-art AutoML systems by augmenting their recommendations with vector embeddings of datasets and of algorithms. We use these embeddings in a neural architecture to learn the distance between best-performing pipelines. The resulting (meta-)AutoML framework improves on the performance of existing AutoML frameworks. Our zero-shot AutoML system using dataset metadata embeddings provides good solutions instantaneously, running in under one second of computation. Performance is competitive with AutoML systems OBOE, AutoSklearn, AlphaD3M, and TPOT when each framework is allocated a minute of computation. We make our data, models, and code publicly available.
LGJul 5, 2019
Visus: An Interactive System for Automatic Machine Learning Model Building and CurationAécio Santos, Sonia Castelo, Cristian Felix et al.
While the demand for machine learning (ML) applications is booming, there is a scarcity of data scientists capable of building such models. Automatic machine learning (AutoML) approaches have been proposed that help with this problem by synthesizing end-to-end ML data processing pipelines. However, these follow a best-effort approach and a user in the loop is necessary to curate and refine the derived pipelines. Since domain experts often have little or no expertise in machine learning, easy-to-use interactive interfaces that guide them throughout the model building process are necessary. In this paper, we present Visus, a system designed to support the model building process and curation of ML data processing pipelines generated by AutoML systems. We describe the framework used to ground our design choices and a usage scenario enabled by Visus. Finally, we discuss the feedback received in user testing sessions with domain experts.
LGMay 24, 2019
Automatic Machine Learning by Pipeline Synthesis using Model-Based Reinforcement Learning and a GrammarIddo Drori, Yamuna Krishnamurthy, Raoni Lourenco et al.
Automatic machine learning is an important problem in the forefront of machine learning. The strongest AutoML systems are based on neural networks, evolutionary algorithms, and Bayesian optimization. Recently AlphaD3M reached state-of-the-art results with an order of magnitude speedup using reinforcement learning with self-play. In this work we extend AlphaD3M by using a pipeline grammar and a pre-trained model which generalizes from many different datasets and similar tasks. Our results demonstrate improved performance compared with our earlier work and existing methods on AutoML benchmark datasets for classification and regression tasks. In the spirit of reproducible research we make our data, models, and code publicly available.
CLMay 2, 2019
A Topic-Agnostic Approach for Identifying Fake News PagesSonia Castelo, Thais Almeida, Anas Elghafari et al.
Fake news and misinformation have been increasingly used to manipulate popular opinion and influence political processes. To better understand fake news, how they are propagated, and how to counter their effect, it is necessary to first identify them. Recently, approaches have been proposed to automatically classify articles as fake based on their content. An important challenge for these approaches comes from the dynamic nature of news: as new political events are covered, topics and discourse constantly change and thus, a classifier trained using content from articles published at a given time is likely to become ineffective in the future. To address this challenge, we propose a topic-agnostic (TAG) classification strategy that uses linguistic and web-markup features to identify fake news pages. We report experimental results using multiple data sets which show that our approach attains high accuracy in the identification of fake news, even as topics evolve over time.
IRFeb 25, 2019
Bootstrapping Domain-Specific Content Discovery on the WebKien Pham, Aécio Santos, Juliana Freire
The ability to continuously discover domain-specific content from the Web is critical for many applications. While focused crawling strategies have been shown to be effective for discovery, configuring a focused crawler is difficult and time-consuming. Given a domain of interest $D$, subject-matter experts (SMEs) must search for relevant websites and collect a set of representative Web pages to serve as training examples for creating a classifier that recognizes pages in $D$, as well as a set of pages to seed the crawl. In this paper, we propose DISCO, an approach designed to bootstrap domain-specific search. Given a small set of websites, DISCO aims to discover a large collection of relevant websites. DISCO uses a ranking-based framework that mimics the way users search for information on the Web: it iteratively discovers new pages, distills, and ranks them. It also applies multiple discovery strategies, including keyword-based and related queries issued to search engines, backward and forward crawling. By systematically combining these strategies, DISCO is able to attain high harvest rates and coverage for a variety of domains. We perform extensive experiments in four social-good domains, using data gathered by SMEs in the respective domains, and show that our approach is effective and outperforms state-of-the-art methods.
SEAug 4, 2018
ReproServer: Making Reproducibility Easier and Less IntensiveRemi Rampin, Fernando Chirigati, Vicky Steeves et al.
Reproducibility in the computational sciences has been stymied because of the complex and rapidly changing computational environments in which modern research takes place. While many will espouse reproducibility as a value, the challenge of making it happen (both for themselves and testing the reproducibility of others' work) often outweigh the benefits. There have been a few reproducibility solutions designed and implemented by the community. In particular, the authors are contributors to ReproZip, a tool to enable computational reproducibility by tracing and bundling together research in the environment in which it takes place (e.g. one's computer or server). In this white paper, we introduce a tool for unpacking ReproZip bundles in the cloud, ReproServer. ReproServer takes an uploaded ReproZip bundle (.rpz file) or a link to a ReproZip bundle, and users can then unpack them in the cloud via their browser, allowing them to reproduce colleagues' work without having to install anything locally. This will help lower the barrier to reproducing others' work, which will aid reviewers in verifying the claims made in papers and reusing previously published research.
HCJan 22, 2016
RioBusData: Outlier Detection in Bus Routes of Rio de JaneiroAline Bessa, Fernando de Mesentier Silva, Rodrigo Frassetto Nogueira et al.
Buses are the primary means of public transportation in the city of Rio de Janeiro, carrying around 100 million passengers every month. Recently, real-time GPS coordinates of all operating public buses has been made publicly available - roughly 1 million GPS entries each captured each day. In an initial study, we observed that a substantial number of buses follow trajectories that do not follow the expected behavior. In this paper, we present RioBusData, a tool that helps users identify and explore, through different visualizations, the behavior of outlier trajectories. We describe how the system automatically detects these outliers using a Convolutional Neural Network (CNN) and we also discuss a series of case studies which show how RioBusData helps users better understand not only the flow and service of outlier buses but also the bus system as a whole.
SEFeb 9, 2015
YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from ScriptsTimothy McPhillips, Tianhong Song, Tyler Kolisnik et al.
Scientific workflow management systems offer features for composing complex computational pipelines from modular building blocks, for executing the resulting automated workflows, and for recording the provenance of data products resulting from workflow runs. Despite the advantages such features provide, many automated workflows continue to be implemented and executed outside of scientific workflow systems due to the convenience and familiarity of scripting languages (such as Perl, Python, R, and MATLAB), and to the high productivity many scientists experience when using these languages. YesWorkflow is a set of software tools that aim to provide such users of scripting languages with many of the benefits of scientific workflow systems. YesWorkflow requires neither the use of a workflow engine nor the overhead of adapting code to run effectively in such a system. Instead, YesWorkflow enables scientists to annotate existing scripts with special comments that reveal the computational modules and dataflows otherwise implicit in these scripts. YesWorkflow tools extract and analyze these comments, represent the scripts in terms of entities based on the typical scientific workflow model, and provide graphical renderings of this workflow-like view of the scripts. Future versions of YesWorkflow also will allow the prospective provenance of the data products of these scripts to be queried in ways similar to those available to users of scientific workflow systems.