CLSep 28, 2023
Augmenting LLMs with Knowledge: A survey on hallucination preventionKonstantinos Andriopoulos, Johan Pouwelse
Large pre-trained language models have demonstrated their proficiency in storing factual knowledge within their parameters and achieving remarkable results when fine-tuned for downstream natural language processing tasks. Nonetheless, their capacity to access and manipulate knowledge with precision remains constrained, resulting in performance disparities on knowledge-intensive tasks when compared to task-specific architectures. Additionally, the challenges of providing provenance for model decisions and maintaining up-to-date world knowledge persist as open research frontiers. To address these limitations, the integration of pre-trained models with differentiable access mechanisms to explicit non-parametric memory emerges as a promising solution. This survey delves into the realm of language models (LMs) augmented with the ability to tap into external knowledge sources, including external knowledge bases and search engines. While adhering to the standard objective of predicting missing tokens, these augmented LMs leverage diverse, possibly non-parametric external modules to augment their contextual processing capabilities, departing from the conventional language modeling paradigm. Through an exploration of current advancements in augmenting large language models with knowledge, this work concludes that this emerging research direction holds the potential to address prevalent issues in traditional LMs, such as hallucinations, un-grounded responses, and scalability challenges.
LGJan 29, 2023
G-Rank: Unsupervised Continuous Learn-to-Rank for Edge Devices in a P2P NetworkAndrew Gold, Johan Pouwelse
Ranking algorithms in traditional search engines are powered by enormous training data sets that are meticulously engineered and curated by a centralized entity. Decentralized peer-to-peer (p2p) networks such as torrenting applications and Web3 protocols deliberately eschew centralized databases and computational architectures when designing services and features. As such, robust search-and-rank algorithms designed for such domains must be engineered specifically for decentralized networks, and must be lightweight enough to operate on consumer-grade personal devices such as a smartphone or laptop computer. We introduce G-Rank, an unsupervised ranking algorithm designed exclusively for decentralized networks. We demonstrate that accurate, relevant ranking results can be achieved in fully decentralized networks without any centralized data aggregation, feature engineering, or model training. Furthermore, we show that such results are obtainable with minimal data preprocessing and computational overhead, and can still return highly relevant results even when a user's device is disconnected from the network. G-Rank is highly modular in design, is not limited to categorical data, and can be implemented in a variety of domains with minimal modification. The results herein show that unsupervised ranking models designed for decentralized p2p networks are not only viable, but worthy of further research.
DCJun 26, 2023
Towards Sybil Resilience in Decentralized LearningThomas Werthenbach, Johan Pouwelse
Federated learning is a privacy-enforcing machine learning technology but suffers from limited scalability. This limitation mostly originates from the internet connection and memory capacity of the central parameter server, and the complexity of the model aggregation function. Decentralized learning has recently been emerging as a promising alternative to federated learning. This novel technology eliminates the need for a central parameter server by decentralizing the model aggregation across all participating nodes. Numerous studies have been conducted on improving the resilience of federated learning against poisoning and Sybil attacks, whereas the resilience of decentralized learning remains largely unstudied. This research gap serves as the main motivator for this study, in which our objective is to improve the Sybil poisoning resilience of decentralized learning. We present SybilWall, an innovative algorithm focused on increasing the resilience of decentralized learning against targeted Sybil poisoning attacks. By combining a Sybil-resistant aggregation function based on similarity between Sybils with a novel probabilistic gossiping mechanism, we establish a new benchmark for scalable, Sybil-resilient decentralized learning. A comprehensive empirical evaluation demonstrated that SybilWall outperforms existing state-of-the-art solutions designed for federated learning scenarios and is the only algorithm to obtain consistent accuracy over a range of adversarial attack scenarios. We also found SybilWall to diminish the utility of creating many Sybils, as our evaluations demonstrate a higher success rate among adversaries employing fewer Sybils. Finally, we suggest a number of possible improvements to SybilWall and highlight promising future research directions.
CRFeb 5, 2015Code
A Self-Compiling Android Data Obfuscation ToolOlivier Hokke, Alex Kolpa, Joris van den Oever et al.
Smartphones are becoming more significant in storing and transferring data. However, techniques ensuring this data is not compromised after a confiscation of the device are not readily available. DroidStealth is an open source Android application which combines data encryption and application obfuscation techniques to provide users with a way to securely hide content on their smartphones. This includes hiding the application's default launch methods and providing methods such as dial-to-launch or invisible launch buttons. A novel technique provided by DroidStealth is the ability to transform its appearance to be able to hide in plain sight on devices. To achieve this, it uses self-compilation, without requiring any special permissions. This Two-Layer protection aims to protect the user and its data from casual search in various situations.
IRApr 18, 2024
De-DSI: Decentralised Differentiable Search IndexPetru Neague, Marcel Gregoriadis, Johan Pouwelse
This study introduces De-DSI, a novel framework that fuses large language models (LLMs) with genuine decentralization for information retrieval, particularly employing the differentiable search index (DSI) concept in a decentralized setting. Focused on efficiently connecting novel user queries with document identifiers without direct document access, De-DSI operates solely on query-docid pairs. To enhance scalability, an ensemble of DSI models is introduced, where the dataset is partitioned into smaller shards for individual model training. This approach not only maintains accuracy by reducing the number of data each model needs to handle but also facilitates scalability by aggregating outcomes from multiple models. This aggregation uses a beam search to identify top docids and applies a softmax function for score normalization, selecting documents with the highest scores for retrieval. The decentralized implementation demonstrates that retrieval success is comparable to centralized methods, with the added benefit of the possibility of distributing computational complexity across the network. This setup also allows for the retrieval of multimedia items through magnet links, eliminating the need for platforms or intermediaries.
IRAug 17, 2025
A Large-Scale Web Search Dataset for Federated Online Learning to RankMarcel Gregoriadis, Jingwei Kang, Johan Pouwelse
The centralized collection of search interaction logs for training ranking models raises significant privacy concerns. Federated Online Learning to Rank (FOLTR) offers a privacy-preserving alternative by enabling collaborative model training without sharing raw user data. However, benchmarks in FOLTR are largely based on random partitioning of classical learning-to-rank datasets, simulated user clicks, and the assumption of synchronous client participation. This oversimplifies real-world dynamics and undermines the realism of experimental results. We present AOL4FOLTR, a large-scale web search dataset with 2.6 million queries from 10,000 users. Our dataset addresses key limitations of existing benchmarks by including user identifiers, real click data, and query timestamps, enabling realistic user partitioning, behavior modeling, and asynchronous federated learning scenarios.
DCOct 21, 2021
Bristle: Decentralized Federated Learning in Byzantine, Non-i.i.d. EnvironmentsJoost Verbraeken, Martijn de Vos, Johan Pouwelse
Federated learning (FL) is a privacy-friendly type of machine learning where devices locally train a model on their private data and typically communicate model updates with a server. In decentralized FL (DFL), peers communicate model updates with each other instead. However, DFL is challenging since (1) the training data possessed by different peers is often non-i.i.d. (i.e., distributed differently between the peers) and (2) malicious, or Byzantine, attackers can share arbitrary model updates with other peers to subvert the training process. We address these two challenges and present Bristle, middleware between the learning application and the decentralized network layer. Bristle leverages transfer learning to predetermine and freeze the non-output layers of a neural network, significantly speeding up model training and lowering communication costs. To securely update the output layer with model updates from other peers, we design a fast distance-based prioritizer and a novel performance-based integrator. Their combined effect results in high resilience to Byzantine attackers and the ability to handle non-i.i.d. classes. We empirically show that Bristle converges to a consistent 95% accuracy in Byzantine environments, outperforming all evaluated baselines. In non-Byzantine environments, Bristle requires 83% fewer iterations to achieve 90% accuracy compared to state-of-the-art methods. We show that when the training classes are non-i.i.d., Bristle significantly outperforms the accuracy of the most Byzantine-resilient baselines by 2.3x while reducing communication costs by 90%.
CRApr 6, 2021
ASTANA: Practical String Deobfuscation for Android Applications Using Program SlicingMartijn de Vos, Johan Pouwelse
Software obfuscation is widely used by Android developers to protect the source code of their applications against adversarial reverse-engineering efforts. A specific type of obfuscation, string obfuscation, transforms the content of all string literals in the source code to non-interpretable text and inserts logic to deobfuscate these string literals at runtime. In this work, we demonstrate that string obfuscation is easily reversible. We present ASTANA, a practical tool for Android applications to recovers the human-readable content from obfuscated string literals. ASTANA makes minimal assumptions about the obfuscation logic or application structure. The key idea is to execute the deobfuscation logic for a specific (obfuscated) string literal, which yields the original string value. To obtain the relevant deobfuscation logic, we present a lightweight and optimistic algorithm, based on program slicing techniques. By an experimental evaluation with 100 popular real-world financial applications, we demonstrate the practicality of ASTANA. We verify the correctness of our deobfuscation tool and provide insights in the behaviour of string obfuscators applied by the developers of the evaluated Android applications.
CRJul 1, 2020
A Truly Self-Sovereign Identity SystemQuinten Stokkink, Georgy Ishmaev, Dick Epema et al.
Existing digital identity management systems fail to deliver the desirable properties of control by the users of their own identity data, credibility of disclosed identity data, and network-level anonymity. The recently proposed Self-Sovereign Identity (SSI) approach promises to give users these properties. However, we argue that without addressing privacy at the network level, SSI systems cannot deliver on this promise. In this paper we present the design and analysis of our solution TCID, created in collaboration with the Dutch government. TCID is a system consisting of a set of components that together satisfy seven functional requirements to guarantee the desirable system properties. We show that the latency incurred by network-level anonymization in TCID is significantly larger than that of identity data disclosure protocols but is still low enough for practical situations. We conclude that current research on SSI is too narrowly focused on these data disclosure protocols.
DCJun 5, 2018
Deployment of a Blockchain-Based Self-Sovereign IdentityQuinten Stokkink, Johan Pouwelse
Digital identity is unsolved: after many years of research there is still no trusted communication over the Internet. To provide identity within the context of mutual distrust, this paper presents a blockchain-based digital identity solution. Without depending upon a single trusted third party, the proposed solution achieves passport-level legally valid identity. This solution for making identities Self-Sovereign, builds on a generic provable claim model for which attestations of truth from third parties need to be collected. The claim model is then shown to be both blockchain structure and proof method agnostic. Four different implementations in support of these two claim model properties are shown to offer sub-second performance for claim creation and claim verification. Through the properties of Self-Sovereign Identity, legally valid status and acceptable performance, our solution is considered to be fit for adoption by the general public.
CYNov 2, 2015
Autonomous smartphone apps: self-compilation, mutation, and viral spreadingPaul Brussee, Johan Pouwelse
We present the first smart phone tool that is capable of self-compilation, mutation and viral spreading. Our autonomous app does not require a host computer to alter its functionality, change its appearance and lacks the normal necessity of a central app store to spread among hosts. We pioneered survival skills for mobile software in order to overcome disrupted Internet access due to natural disasters and human made interference, like Internet kill switches or censored networks. Internet kill switches have proven to be an effective tool to eradicate open Internet access and all forms of digital communication within an hour on a country-wide basis. We present the first operational tool that is capable of surviving such digital eradication.
DCJul 1, 2015
Performance analysis of a Tor-like onion routing implementationQuinten Stokkink, Harmjan Treep, Johan Pouwelse
The current onion routing implementation of Tribler works as expected but throttles the overall throughput of the Tribler system. This article discusses a measuring procedure to reproducibly profile the tunnel implementation so further optimizations of the tunnel community can be made. Our work has been integrated into the Tribler eco-system.
CRMay 27, 2015
Anonymous online purchases with exhaustive operational securityVincent Van Mieghem, Johan Pouwelse
This paper describes the process of remaining anonymous online and its concurrent operational security that has to be performed. It focusses particularly on remaining anonymous while purchasing online goods, resulting in anonymously bought items. Different aspects of the operational security process as well as anonymously funding with cryptocurrencies are described. Eventually it is shown how to anonymously purchase items and services from the hidden web, as well as the delivery. It is shown that, while becoming increasingly difficult, it is still possible to make anonymous purchases. Our presented work combines existing best-practices and deliberately avoids untested novel approaches when possible.
CRJun 20, 2014
Operational Distributed Regulation for BitcoinDinesh, Erlich, Gilfoyle et al.
On February 2014, $650.000.000 worth of Bitcoins disappeared. Currently it is unclear whether hackers or MtGox, the largest Bitcoin exchange, are to be blamed. In either case, the anonymous and unregulated nature of the Bitcoin system makes it practically impossible for innocent victims to get their money back. We have investigated the technical possibilities, solutions and implications of introducing a regulatory framework based on redlisting Bitcoin accounts. Despite numerous proposals, the Bitcoin community has voiced a strong opinion against any form of regulation. However, most of the discussions were based on speculations rather than facts. We strive to contribute a scientific foundation to these discussions and illuminate the path to crypto-justice.
CYApr 18, 2014
The fifteen year struggle of decentralizing privacy-enhancing technologyRolf Jagerman, Wendo Sabée, Laurens Versluis et al.
Ever since the introduction of the internet, it has been void of any privacy. The majority of internet traffic currently is and always has been unencrypted. A number of anonymous communication overlay networks exist whose aim it is to provide privacy to its users. However, due to the nature of the internet, there is major difficulty in getting these networks to become both decentralized and anonymous. We list reasons for having anonymous networks, discern the problems in achieving decentralization and sum up the biggest initiatives in the field and their current status. To do so, we use one exemplary network, the Tor network. We explain how Tor works, what vulnerabilities this network currently has, and possible attacks that could be used to violate privacy and anonymity. The Tor network is used as a key comparison network in the main part of the report: a tabular overview of the major anonymous networking technologies in use today.