DLJul 24, 2023
BIP! NDR (NoDoiRefs): A Dataset of Citations From Papers Without DOIs in Computer Science Conferences and WorkshopsParis Koloveas, Serafeim Chatzopoulos, Christos Tryfonopoulos et al.
In the field of Computer Science, conference and workshop papers serve as important contributions, carrying substantial weight in research assessment processes, compared to other disciplines. However, a considerable number of these papers are not assigned a Digital Object Identifier (DOI), hence their citations are not reported in widely used citation datasets like OpenCitations and Crossref, raising limitations to citation analysis. While the Microsoft Academic Graph (MAG) previously addressed this issue by providing substantial coverage, its discontinuation has created a void in available data. BIP! NDR aims to alleviate this issue and enhance the research assessment processes within the field of Computer Science. To accomplish this, it leverages a workflow that identifies and retrieves Open Science papers lacking DOIs from the DBLP Corpus, and by performing text analysis, it extracts citation information directly from their full text. The current version of the dataset contains more than 510K citations made by approximately 60K open access Computer Science conference or workshop papers that, according to DBLP, do not have a DOI.
CRSep 14, 2021Code
A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat IntelligenceParis Koloveas, Thanasis Chantzios, Christos Tryfonopoulos et al.
The clear, social, and dark web have lately been identified as rich sources of valuable cyber-security information that -given the appropriate tools and methods-may be identified, crawled and subsequently leveraged to actionable cyber-threat intelligence. In this work, we focus on the information gathering task, and present a novel crawling architecture for transparently harvesting data from security websites in the clear web, security forums in the social web, and hacker forums/marketplaces in the dark web. The proposed architecture adopts a two-phase approach to data harvesting. Initially a machine learning-based crawler is used to direct the harvesting towards websites of interest, while in the second phase state-of-the-art statistical language modelling techniques are used to represent the harvested information in a latent low-dimensional feature space and rank it based on its potential relevance to the task at hand. The proposed architecture is realised using exclusively open-source tools, and a preliminary evaluation with crowdsourced results demonstrates its effectiveness.
CLFeb 20, 2025
Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMsParis Koloveas, Serafeim Chatzopoulos, Thanasis Vergoulis et al.
This work investigates the ability of open Large Language Models (LLMs) to predict citation intent through in-context learning and fine-tuning. Unlike traditional approaches relying on domain-specific pre-trained models like SciBERT, we demonstrate that general-purpose LLMs can be adapted to this task with minimal task-specific data. We evaluate twelve model variations across five prominent open LLM families using zero-, one-, few-, and many-shot prompting. Our experimental study identifies the top-performing model and prompting parameters through extensive in-context learning experiments. We then demonstrate the significant impact of task-specific adaptation by fine-tuning this model, achieving a relative F1-score improvement of 8% on the SciCite dataset and 4.3% on the ACL-ARC dataset compared to the instruction-tuned baseline. These findings provide valuable insights for model selection and prompt engineering. Additionally, we make our end-to-end evaluation framework and models openly available for future use.
AISep 24, 2025
InsightGUIDE: An Opinionated AI Assistant for Guided Critical Reading of Scientific LiteratureParis Koloveas, Serafeim Chatzopoulos, Thanasis Vergoulis et al.
The proliferation of scientific literature presents an increasingly significant challenge for researchers. While Large Language Models (LLMs) offer promise, existing tools often provide verbose summaries that risk replacing, rather than assisting, the reading of the source material. This paper introduces InsightGUIDE, a novel AI-powered tool designed to function as a reading assistant, not a replacement. Our system provides concise, structured insights that act as a "map" to a paper's key elements by embedding an expert's reading methodology directly into its core AI logic. We present the system's architecture, its prompt-driven methodology, and a qualitative case study comparing its output to a general-purpose LLM. The results demonstrate that InsightGUIDE produces more structured and actionable guidance, serving as a more effective tool for the modern researcher.
DLAug 5, 2025
Accelerating Scientific Discovery with Multi-Document Summarization of Impact-Ranked PapersParis Koloveas, Serafeim Chatzopoulos, Dionysis Diamantis et al.
The growing volume of scientific literature makes it challenging for scientists to move from a list of papers to a synthesized understanding of a topic. Because of the constant influx of new papers on a daily basis, even if a scientist identifies a promising set of papers, they still face the tedious task of individually reading through dozens of titles and abstracts to make sense of occasionally conflicting findings. To address this critical bottleneck in the research workflow, we introduce a summarization feature to BIP! Finder, a scholarly search engine that ranks literature based on distinct impact aspects like popularity and influence. Our approach enables users to generate two types of summaries from top-ranked search results: a concise summary for an instantaneous at-a-glance comprehension and a more comprehensive literature review-style summary for greater, better-organized comprehension. This ability dynamically leverages BIP! Finder's already existing impact-based ranking and filtering features to generate context-sensitive, synthesized narratives that can significantly accelerate literature discovery and comprehension.
DBJan 11, 2022
Atrapos: Real-time Evaluation of Metapath Query WorkloadsSerafeim Chatzopoulos, Thanasis Vergoulis, Dimitrios Skoutas et al.
Heterogeneous information networks (HINs) represent different types of entities and relationships between them. Exploring, analysing, and extracting knowledge from such networks relies on metapath queries that identify pairs of entities connected by relationships of diverse semantics. While the real-time evaluation of metapath query workloads on large, web-scale HINs is highly demanding in computational cost, current approaches do not exploit interrelationships among the queries. In this paper, we present ATRAPOS, a new approach for the real-time evaluation of metapath query workloads that leverages a combination of efficient sparse matrix multiplication and intermediate result caching. ATRAPOS selects intermediate results to cache and reuse by detecting frequent sub-metapaths among workload queries in real time, using a tailor-made data structure, the Overlap Tree, and an associated caching policy. Our experimental study on real data shows that ATRAPOS accelerates exploratory data analysis and mining on HINs, outperforming off-the-shelf caching approaches and state-of-the-art research prototypes in all examined scenarios. -- Note that this version of our work is more extended than the one presented in TheWebConf 2023 (doi: 10.1145/3543507.3583322)
CVNov 18, 2021
Benchmarking and scaling of deep learning models for land cover image classificationIoannis Papoutsis, Nikolaos-Ioannis Bountos, Angelos Zavras et al.
The availability of the sheer volume of Copernicus Sentinel-2 imagery has created new opportunities for exploiting deep learning (DL) methods for land use land cover (LULC) image classification. However, an extensive set of benchmark experiments is currently lacking, i.e. DL models tested on the same dataset, with a common and consistent set of metrics, and in the same hardware. In this work, we use the BigEarthNet Sentinel-2 dataset to benchmark for the first time different state-of-the-art DL models for the multi-label, multi-class LULC image classification problem, contributing with an exhaustive zoo of 60 trained models. Our benchmark includes standard CNNs, as well as non-convolutional methods. We put to the test EfficientNets and Wide Residual Networks (WRN) architectures, and leverage classification accuracy, training time and inference rate. Furthermore, we propose to use the EfficientNet framework for the compound scaling of a lightweight WRN. Enhanced with an Efficient Channel Attention mechanism, our scaled lightweight model emerged as the new state-of-the-art. It achieves 4.5% higher averaged F-Score classification accuracy for all 19 LULC classes compared to a standard ResNet50 baseline model, with an order of magnitude less trainable parameters. We provide access to all trained models, along with our code for distributed training on multiple GPU nodes. This model zoo of pre-trained encoders can be used for transfer learning and rapid prototyping in different remote sensing tasks that use Sentinel-2 data, instead of exploiting backbone models trained with data from a different domain, e.g., from ImageNet. We validate their suitability for transfer learning in different datasets of diverse volumes. Our top-performing WRN achieves state-of-the-art performance (71.1% F-Score) on the SEN12MS dataset while being exposed to only a small fraction of the training dataset.
CRSep 9, 2021
Social Media Monitoring for IoT Cyber-ThreatsSofia Alevizopoulou, Paris Koloveas, Christos Tryfonopoulos et al.
The rapid development of IoT applications and their use in various fields of everyday life has resulted in an escalated number of different possible cyber-threats, and has consequently raised the need of securing IoT devices. Collecting Cyber-Threat Intelligence (e.g., zero-day vulnerabilities or trending exploits) from various online sources and utilizing it to proactively secure IoT systems or prepare mitigation scenarios has proven to be a promising direction. In this work, we focus on social media monitoring and investigate real-time Cyber-Threat Intelligence detection from the Twitter stream. Initially, we compare and extensively evaluate six different machine-learning based classification alternatives trained with vulnerability descriptions and tested with real-world data from the Twitter stream to identify the best-fitting solution. Subsequently, based on our findings, we propose a novel social media monitoring system tailored to the IoT domain; the system allows users to identify recent/trending vulnerabilities and exploits on IoT devices. Finally, to aid research on the field and support the reproducibility of our results we publicly release all annotated datasets created during this process.
HCJul 8, 2013
OntoFM: A Personal Ontology-based File Manager for the DesktopJenny Rompa, Giorgos Lepouras, Costas Vassilakis et al.
Personal ontologies have been proposed as a means to support the semantic management of user information. Assuming that a personal ontology system is in use, new tools have to be developed at user interface level to exploit the enhanced capabilities offered by the system. In this work, we present an ontology-based file manager that allows semantic searching on the user's personal information space. The file manager exploits the ontology relations to present files associated with specific concepts, proposes new related concepts to users, and helps them explore the information space and locate the required file.
IRJul 8, 2013
Full-text Support for Publish/Subscribe Ontology SystemsLefteris Zervakis, Christos Tryfonopoulos, Antonios Papadakis-Pesaresi et al.
We envision a publish/subscribe ontology system that is able to index millions of user subscriptions and filter them against ontology data that arrive in a streaming fashion. In this work, we propose a SPARQL extension appropriate for a publish/subscribe setting; our extension builds on the natural semantic graph matching of the language and supports the creation of full-text subscriptions. Subsequently, we propose a main-memory subscription indexing algorithm which performs both semantic and full-text matching at low complexity and minimal filtering time. Thus, when ontology data are published matching subscriptions are identified and notifications are forwarded to users.