CLDec 21, 2023Code
Experimenting with Large Language Models and vector embeddings in NASA SciXSergi Blanco-Cuaresma, Ioana Ciucă, Alberto Accomazzi et al. · cambridge, harvard
Open-source Large Language Models enable projects such as NASA SciX (i.e., NASA ADS) to think out of the box and try alternative approaches for information retrieval and data augmentation, while respecting data copyright and users' privacy. However, when large language models are directly prompted with questions without any context, they are prone to hallucination. At NASA SciX we have developed an experiment where we created semantic vectors for our large collection of abstracts and full-text content, and we designed a prompt system to ask questions using contextual chunks from our system. Based on a non-systematic human evaluation, the experiment shows a lower degree of hallucination and better responses when using Retrieval Augmented Generation. Further exploration is required to design new features and data augmentation processes at NASA SciX that leverages this technology while respecting the high level of trust and quality that the project holds.
DLJun 7, 2017
Usage Bibliometrics as a Tool to Measure Research ActivityEdwin A. Henneken, Michael J. Kurtz
Measures for research activity and impact have become an integral ingredient in the assessment of a wide range of entities (individual researchers, organizations, instruments, regions, disciplines). Traditional bibliometric indicators, like publication and citation based indicators, provide an essential part of this picture, but cannot describe the complete picture. Since reading scholarly publications is an essential part of the research life cycle, it is only natural to introduce measures for this activity in attempts to quantify the efficiency, productivity and impact of an entity. Citations and reads are significantly different signals, so taken together, they provide a more complete picture of research activity. Most scholarly publications are now accessed online, making the study of reads and their patterns possible. Click-stream logs allow us to follow information access by the entire research community, real-time. Publication and citation datasets just reflect activity by authors. In addition, download statistics will help us identify publications with significant impact, but which do not attract many citations. Click-stream signals are arguably more complex than, say, citation signals. For one, they are a superposition of different classes of readers. Systematic downloads by crawlers also contaminate the signal, as does browsing behavior. We discuss the complexities associated with clickstream data and how, with proper filtering, statistically significant relations and conclusions can be inferred from download statistics. We describe how download statistics can be used to describe research activity at different levels of aggregation, ranging from organizations to countries. These statistics show a correlation with socio-economic indicators. A comparison will be made with traditional bibliometric indicators. We will argue that astronomy is representative of more general trends.
IRSep 6, 2012
Finding and Recommending Scholarly ArticlesMichael J. Kurtz, Edwin A. Henneken
The rate at which scholarly literature is being produced has been increasing at approximately 3.5 percent per year for decades. This means that during a typical 40 year career the amount of new literature produced each year increases by a factor of four. The methods scholars use to discover relevant literature must change. Just like everybody else involved in information discovery, scholars are confronted with information overload. Two decades ago, this discovery process essentially consisted of paging through abstract books, talking to colleagues and librarians, and browsing journals. A time-consuming process, which could even be longer if material had to be shipped from elsewhere. Now much of this discovery process is mediated by online scholarly information systems. All these systems are relatively new, and all are still changing. They all share a common goal: to provide their users with access to the literature relevant to their specific needs. To achieve this each system responds to actions by the user by displaying articles which the system judges relevant to the user's current needs. Recently search systems which use particularly sophisticated methodologies to recommend a few specific papers to the user have been called "recommender systems". These methods are in line with the current use of the term "recommender system" in computer science. We do not adopt this definition, rather we view systems like these as components in a larger whole, which is presented by the scholarly information systems themselves. In what follows we view the recommender system as an aspect of the entire information system; one which combines the massive memory capacities of the machine with the cognitive abilities of the human user to achieve a human-machine synergy.