Rodrigo Costas

DL
3papers
32citations
Novelty40%
AI Score36

3 Papers

36.9DLMar 22
Use of diverse data sources to control which topics emerge in a science map

Juan Pablo Bascur, Rodrigo Costas, Suzan Verberne

Traditional science maps visualize topics by clustering documents within a network, but they are inherently biased toward clustering certain topics over others. If these topics could be chosen, then the science maps could be tailored for different needs. In this paper, we explore the extent to which the topic bias of a science map can be changed by choosing different data sources to build the document network. We analyze this by evaluating the clustering effectiveness of several topic categories over two sources that are traditionally used for the creation of science maps (citations and text similarity) and six non-traditional data sources, which we found favor different kinds of topics: Health issues for Facebook users, biotechnology topics for patent families, government and social issues for policy documents, food topics for Twitter conversations, nursing topics for Twitter users, and geographical entities for document authors (the favoring in this latter source was particularly strong). Our results show that diverse data sources can be used to control topic bias, which opens up the possibility of creating science maps tailored for different needs.

LGDec 4, 2020
Unsupervised embedding of trajectories captures the latent structure of scientific migration

Dakota Murray, Jisung Yoon, Sadamori Kojaku et al.

Human migration and mobility drives major societal phenomena including epidemics, economies, innovation, and the diffusion of ideas. Although human mobility and migration have been heavily constrained by geographic distance throughout the history, advances and globalization are making other factors such as language and culture increasingly more important. Advances in neural embedding models, originally designed for natural language, provide an opportunity to tame this complexity and open new avenues for the study of migration. Here, we demonstrate the ability of the model word2vec to encode nuanced relationships between discrete locations from migration trajectories, producing an accurate, dense, continuous, and meaningful vector-space representation. The resulting representation provides a functional distance between locations, as well as a digital double that can be distributed, re-used, and itself interrogated to understand the many dimensions of migration. We show that the unique power of word2vec to encode migration patterns stems from its mathematical equivalence with the gravity model of mobility. Focusing on the case of scientific migration, we apply word2vec to a database of three million migration trajectories of scientists derived from the affiliations listed on their publication records. Using techniques that leverage its semantic structure, we demonstrate that embeddings can learn the rich structure that underpins scientific migration, such as cultural, linguistic, and prestige relationships at multiple levels of granularity. Our results provide a theoretical foundation and methodological framework for using neural embeddings to represent and understand migration both within and beyond science.

DLJan 22, 2013
"Seed+Expand": A validated methodology for creating high quality publication oeuvres of individual researchers

Linda Reijnhoudt, Rodrigo Costas, Ed Noyons et al.

The study of science at the individual micro-level frequently requires the disambiguation of author names. The creation of author's publication oeuvres involves matching the list of unique author names to names used in publication databases. Despite recent progress in the development of unique author identifiers, e.g., ORCID, VIVO, or DAI, author disambiguation remains a key problem when it comes to large-scale bibliometric analysis using data from multiple databases. This study introduces and validates a new methodology called seed+expand for semi-automatic bibliographic data collection for a given set of individual authors. Specifically, we identify the oeuvre of a set of Dutch full professors during the period 1980-2011. In particular, we combine author records from the National Research Information System (NARCIS) with publication records from the Web of Science. Starting with an initial list of 8,378 names, we identify "seed publications" for each author using five different approaches. Subsequently, we "expand" the set of publication in three different approaches. The different approaches are compared and resulting oeuvres are evaluated on precision and recall using a "gold standard" dataset of authors for which verified publications in the period 2001-2010 are available.