CLNov 29, 2022
Improving astroBERT using Semantic Textual SimilarityFelix Grezes, Thomas Allen, Sergi Blanco-Cuaresma et al. · cambridge, harvard
The NASA Astrophysics Data System (ADS) is an essential tool for researchers that allows them to explore the astronomy and astrophysics scientific literature, but it has yet to exploit recent advances in natural language processing. At ADASS 2021, we introduced astroBERT, a machine learning language model tailored to the text used in astronomy papers in ADS. In this work we: - announce the first public release of the astroBERT language model; - show how astroBERT improves over existing public language models on astrophysics specific tasks; - and detail how ADS plans to harness the unique structure of scientific papers, the citation graph and citation context, to further improve astroBERT.
CLDec 21, 2023Code
Experimenting with Large Language Models and vector embeddings in NASA SciXSergi Blanco-Cuaresma, Ioana Ciucă, Alberto Accomazzi et al. · cambridge, harvard
Open-source Large Language Models enable projects such as NASA SciX (i.e., NASA ADS) to think out of the box and try alternative approaches for information retrieval and data augmentation, while respecting data copyright and users' privacy. However, when large language models are directly prompted with questions without any context, they are prone to hallucination. At NASA SciX we have developed an experiment where we created semantic vectors for our large collection of abstracts and full-text content, and we designed a prompt system to ask questions using contextual chunks from our system. Based on a non-systematic human evaluation, the experiment shows a lower degree of hallucination and better responses when using Retrieval Augmented Generation. Further exploration is required to design new features and data augmentation processes at NASA SciX that leverages this technology while respecting the high level of trust and quality that the project holds.
45.6IMMar 20
Astrophysics Research Organizations in the 21st Century: Database and Comparative DashboardsMichael J. Kurtz, Carlolyn S. Grant, Matthew R. Templeton et al.
As many research papers in astronomy have been written since the beginning of the 21st century as had been written previously. This exponential growth has been accompanied by substantial changes in the structure of astrophysics research, which organizations perform it and where they are located. Using data from the Smithsonian/NASA Astrophysics Data System/Science Explorer (ADS/SciX) we have obtained an article number and citation based set of metrics as a function of the institutional affiliation of the first author; nearly every organization which has produced recent astronomy research is included. We use these data to examine changes in where astronomy research is being done. We demonstrate how to create custom rankings for the organizations. We develop a dashboard of key performance indicators (KPI) to examine the relative and absolute changes in the research performance for each of the 1949 organizations which have produced at least one first authored, refereed astronomy journal article since 1997. We also present KPI dashboards for 65 countries and three regions.
HCFeb 1, 2022
Web accessibility trends and implementation in dynamic web applicationsTimothy W. Hostetler, Shinyi Chen, Sergi Blanco-Cuaresma et al.
The NASA Astrophysics Data System (ADS), a critical research service for the astrophysics community, strives to provide the most accessible and inclusive environment for the discovery and exploration of the astronomical literature. Part of this goal involves creating a digital platform that can accommodate everybody, including those with disabilities that would benefit from alternative ways to present the information provided by the website. NASA ADS follows the official Web Content Accessibility Guidelines (WCAG) standard for ensuring accessibility of all its applications, striving to exceed this standard where possible. Through the use of both internal audits and external expert review based on these guidelines, we have identified many areas for improving accessibility in our current web application, and have implemented a number of updates to the UI as a result of this. We present an overview of some current web accessibility trends, discuss our experience incorporating these trends in our web application, and discuss the lessons learned and recommendations for future projects.
CLDec 1, 2021
Building astroBERT, a language model for Astronomy & AstrophysicsFelix Grezes, Sergi Blanco-Cuaresma, Alberto Accomazzi et al.
The existing search tools for exploring the NASA Astrophysics Data System (ADS) can be quite rich and empowering (e.g., similar and trending operators), but researchers are not yet allowed to fully leverage semantic search. For example, a query for "results from the Planck mission" should be able to distinguish between all the various meanings of Planck (person, mission, constant, institutions and more) without further clarification from the user. At ADS, we are applying modern machine learning and natural language processing techniques to our dataset of recent astronomy publications to train astroBERT, a deeply contextual language model based on research at Google. Using astroBERT, we aim to enrich the ADS dataset and improve its discoverability, and in particular we are developing our own named entity recognition tool. We present here our preliminary results and lessons learned.
SESep 10, 2020
Agile methodologies in teams with highly creative and autonomous membersSergi Blanco-Cuaresma, Alberto Accomazzi, Michael J. Kurtz et al.
The Agile manifesto encourages us to value individuals and interactions over processes and tools, while Scrum, the most adopted Agile development methodology, is essentially based on roles, events, artifacts, and the rules that bind them together (i.e., processes). Moreover, it is generally proclaimed that whenever a Scrum project does not succeed, the reason is because Scrum was not implemented correctly and not because Scrum may have its own flaws. This grants irrefutability to the methodology, discouraging deviations to fit the actual needs and peculiarities of the developers. In particular, the members of the NASA ADS team are highly creative and autonomous whose motivation can be affected if their freedom is too strongly constrained. We present our experience following Agile principles, reusing certain Scrum elements and seeking the satisfaction of the team members, while rapidly reacting/keeping the project in line with our stakeholders expectations.
AIJan 2, 2018
Advice from the Oracle: Really Intelligent Information RetrievalMichael J. Kurtz
What is "intelligent" information retrieval? Essentially this is asking what is intelligence, in this article I will attempt to show some of the aspects of human intelligence, as related to information retrieval. I will do this by the device of a semi-imaginary Oracle. Every Observatory has an oracle, someone who is a distinguished scientist, has great administrative responsibilities, acts as mentor to a number of less senior people, and as trusted advisor to even the most accomplished scientists, and knows essentially everyone in the field. In an appendix I will present a brief summary of the Statistical Factor Space method for text indexing and retrieval, and indicate how it will be used in the Astrophysics Data System Abstract Service. 2018 Keywords: Personal Digital Assistant; Supervised Topic Models
MLDec 18, 2017
Multilingual Topic ModelsKriste Krstovski, Michael J. Kurtz, David A. Smith et al.
Scientific publications have evolved several features for mitigating vocabulary mismatch when indexing, retrieving, and computing similarity between articles. These mitigation strategies range from simply focusing on high-value article sections, such as titles and abstracts, to assigning keywords, often from controlled vocabularies, either manually or through automatic annotation. Various document representation schemes possess different cost-benefit tradeoffs. In this paper, we propose to model different representations of the same article as translations of each other, all generated from a common latent representation in a multilingual topic model. We start with a methodological overview on latent variable models for parallel document representations that could be used across many information science tasks. We then show how solving the inference problem of mapping diverse representations into a shared topic space allows us to evaluate representations based on how topically similar they are to the original article. In addition, our proposed approach provides means to discover where different concept vocabularies require improvement.
DLJun 7, 2017
Usage Bibliometrics as a Tool to Measure Research ActivityEdwin A. Henneken, Michael J. Kurtz
Measures for research activity and impact have become an integral ingredient in the assessment of a wide range of entities (individual researchers, organizations, instruments, regions, disciplines). Traditional bibliometric indicators, like publication and citation based indicators, provide an essential part of this picture, but cannot describe the complete picture. Since reading scholarly publications is an essential part of the research life cycle, it is only natural to introduce measures for this activity in attempts to quantify the efficiency, productivity and impact of an entity. Citations and reads are significantly different signals, so taken together, they provide a more complete picture of research activity. Most scholarly publications are now accessed online, making the study of reads and their patterns possible. Click-stream logs allow us to follow information access by the entire research community, real-time. Publication and citation datasets just reflect activity by authors. In addition, download statistics will help us identify publications with significant impact, but which do not attract many citations. Click-stream signals are arguably more complex than, say, citation signals. For one, they are a superposition of different classes of readers. Systematic downloads by crawlers also contaminate the signal, as does browsing behavior. We discuss the complexities associated with clickstream data and how, with proper filtering, statistically significant relations and conclusions can be inferred from download statistics. We describe how download statistics can be used to describe research activity at different levels of aggregation, ranging from organizations to countries. These statistics show a correlation with socio-economic indicators. A comparison will be made with traditional bibliometric indicators. We will argue that astronomy is representative of more general trends.
IRJan 7, 2016
Automatic Construction of Evaluation Sets and Evaluation of Document Similarity Models in Large Scholarly Retrieval SystemsKriste Krstovski, David A. Smith, Michael J. Kurtz
Retrieval systems for scholarly literature offer the ability for the scientific community to search, explore and download scholarly articles across various scientific disciplines. Mostly used by the experts in the particular field, these systems contain user community logs including information on user specific downloaded articles. In this paper we present a novel approach for automatically evaluating document similarity models in large collections of scholarly publications. Unlike typical evaluation settings that use test collections consisting of query documents and human annotated relevance judgments, we use download logs to automatically generate pseudo-relevant set of similar document pairs. More specifically we show that consecutively downloaded document pairs, extracted from a scholarly information retrieval (IR) system, could be utilized as a test collection for evaluating document similarity models. Another novel aspect of our approach lies in the method that we employ for evaluating the performance of the model by comparing the distribution of consecutively downloaded document pairs and random document pairs in log space. Across two families of similarity models, that represent documents in the term vector and topic spaces, we show that our evaluation approach achieves very high correlation with traditional performance metrics such as Mean Average Precision (MAP), while being more efficient to compute.
IRSep 6, 2012
Finding and Recommending Scholarly ArticlesMichael J. Kurtz, Edwin A. Henneken
The rate at which scholarly literature is being produced has been increasing at approximately 3.5 percent per year for decades. This means that during a typical 40 year career the amount of new literature produced each year increases by a factor of four. The methods scholars use to discover relevant literature must change. Just like everybody else involved in information discovery, scholars are confronted with information overload. Two decades ago, this discovery process essentially consisted of paging through abstract books, talking to colleagues and librarians, and browsing journals. A time-consuming process, which could even be longer if material had to be shipped from elsewhere. Now much of this discovery process is mediated by online scholarly information systems. All these systems are relatively new, and all are still changing. They all share a common goal: to provide their users with access to the literature relevant to their specific needs. To achieve this each system responds to actions by the user by displaying articles which the system judges relevant to the user's current needs. Recently search systems which use particularly sophisticated methodologies to recommend a few specific papers to the user have been called "recommender systems". These methods are in line with the current use of the term "recommender system" in computer science. We do not adopt this definition, rather we view systems like these as components in a larger whole, which is presented by the scholarly information systems themselves. In what follows we view the recommender system as an aspect of the entire information system; one which combines the massive memory capacities of the machine with the cognitive abilities of the human user to achieve a human-machine synergy.
DLSep 1, 2012
A History of Cluster Analysis Using the Classification Society's Bibliography Over Four DecadesFionn Murtagh, Michael J. Kurtz
The Classification Literature Automated Search Service, an annual bibliography based on citation of one or more of a set of around 80 book or journal publications, ran from 1972 to 2012. We analyze here the years 1994 to 2011. The Classification Society's Service, as it was termed, has been produced by the Classification Society. In earlier decades it was distributed as a diskette or CD with the Journal of Classification. Among our findings are the following: an enormous increase in scholarly production post approximately 2000; a very major increase in quantity, coupled with work in different disciplines, from approximately 2004; and a major shift also from cluster analysis in earlier times having mathematics and psychology as disciplines of the journals published in, and affiliations of authors, contrasted with, in more recent times, a "centre of gravity" in management and engineering.