CRNov 8, 2023
Local Differential Privacy for Smart Meter Data SharingYashothara Shanmugarasa, M. A. P. Chamikara, Hye-young Paik et al.
Energy disaggregation techniques, which use smart meter data to infer appliance energy usage, can provide consumers and energy companies valuable insights into energy management. However, these techniques also present privacy risks, such as the potential for behavioral profiling. Local differential privacy (LDP) methods provide strong privacy guarantees with high efficiency in addressing privacy concerns. However, existing LDP methods focus on protecting aggregated energy consumption data rather than individual appliances. Furthermore, these methods do not consider the fact that smart meter data are a form of streaming data, and its processing methods should account for time windows. In this paper, we propose a novel LDP approach (named LDP-SmartEnergy) that utilizes randomized response techniques with sliding windows to facilitate the sharing of appliance-level energy consumption data over time while not revealing individual users' appliance usage patterns. Our evaluations show that LDP-SmartEnergy runs efficiently compared to baseline methods. The results also demonstrate that our solution strikes a balance between protecting privacy and maintaining the utility of data for effective analysis.
SEAug 11, 2023
Decentralised Governance-Driven Architecture for Designing Foundation Model based Systems: Exploring the Role of Blockchain in Responsible AIYue Liu, Qinghua Lu, Liming Zhu et al.
Foundation models including large language models (LLMs) are increasingly attracting interest worldwide for their distinguished capabilities and potential to perform a wide variety of tasks. Nevertheless, people are concerned about whether foundation model based AI systems are properly governed to ensure the trustworthiness and to prevent misuse that could harm humans, society and the environment. In this paper, we identify eight governance challenges of foundation model based AI systems regarding the three fundamental dimensions of governance: decision rights, incentives, and accountability. Furthermore, we explore the potential of blockchain as an architectural solution to address the challenges by providing a distributed ledger to facilitate decentralised governance. We present an architecture that demonstrates how blockchain can be leveraged to realise governance in foundation model based AI systems.
LGApr 28, 2022
Decision Models for Selecting Federated Learning Architecture PatternsSin Kit Lo, Qinghua Lu, Hye-Young Paik et al.
Federated machine learning is growing fast in academia and industries as a solution to solve data hungriness and privacy issues in machine learning. Being a widely distributed system, federated machine learning requires various system design thinking. To better design a federated machine learning system, researchers have introduced multiple patterns and tactics that cover various system design aspects. However, the multitude of patterns leaves the designers confused about when and which pattern to adopt. In this paper, we present a set of decision models for the selection of patterns for federated machine learning architecture design based on a systematic literature review on federated machine learning, to assist designers and architects who have limited knowledge of federated machine learning. Each decision model maps functional and non-functional requirements of federated machine learning systems to a set of patterns. We also clarify the drawbacks of the patterns. We evaluated the decision models by mapping the decision patterns to concrete federated machine learning architectures by big tech firms to assess the models' correctness and usefulness. The evaluation results indicate that the proposed decision models are able to bring structure to the federated machine learning architecture design process and help explicitly articulate the design rationale.
CRAug 18, 2022
Profiler: Profile-Based Model to Detect Phishing EmailsMariya Shmalko, Alsharif Abuadbba, Raj Gaire et al.
Email phishing has become more prevalent and grows more sophisticated over time. To combat this rise, many machine learning (ML) algorithms for detecting phishing emails have been developed. However, due to the limited email data sets on which these algorithms train, they are not adept at recognising varied attacks and, thus, suffer from concept drift; attackers can introduce small changes in the statistical characteristics of their emails or websites to successfully bypass detection. Over time, a gap develops between the reported accuracy from literature and the algorithm's actual effectiveness in the real world. This realises itself in frequent false positive and false negative classifications. To this end, we propose a multidimensional risk assessment of emails to reduce the feasibility of an attacker adapting their email and avoiding detection. This horizontal approach to email phishing detection profiles an incoming email on its main features. We develop a risk assessment framework that includes three models which analyse an email's (1) threat level, (2) cognitive manipulation, and (3) email type, which we combine to return the final risk assessment score. The Profiler does not require large data sets to train on to be effective and its analysis of varied email features reduces the impact of concept drift. Our Profiler can be used in conjunction with ML approaches, to reduce their misclassifications or as a labeller for large email data sets in the training stage. We evaluate the efficacy of the Profiler against a machine learning ensemble using state-of-the-art ML algorithms on a data set of 9000 legitimate and 900 phishing emails from a large Australian research organisation. Our results indicate that the Profiler's mitigates the impact of concept drift, and delivers 30% less false positive and 25% less false negative email classifications over the ML ensemble's approach.
CLJan 20, 2025Code
RACCOON: A Retrieval-Augmented Generation Approach for Location Coordinate Capture from News ArticlesJonathan Lin, Aditya Joshi, Hye-young Paik et al.
Geocoding involves automatic extraction of location coordinates of incidents reported in news articles, and can be used for epidemic intelligence or disaster management. This paper introduces Retrieval-Augmented Coordinate Capture Of Online News articles (RACCOON), an open-source geocoding approach that extracts geolocations from news articles. RACCOON uses a retrieval-augmented generation (RAG) approach where candidate locations and associated information are retrieved in the form of context from a location database, and a prompt containing the retrieved context, location mentions and news articles is fed to an LLM to generate the location coordinates. Our evaluation on three datasets, two underlying LLMs, three baselines and several ablation tests based on the components of RACCOON demonstrate the utility of RACCOON. To the best of our knowledge, RACCOON is the first RAG-based approach for geocoding using pre-trained LLMs.
SDApr 11, 2025
Generalized Multilingual Text-to-Speech Generation with Language-Aware Style AdaptationHaowei Lou, Hye-young Paik, Sheng Li et al.
Text-to-Speech (TTS) models can generate natural, human-like speech across multiple languages by transforming phonemes into waveforms. However, multilingual TTS remains challenging due to discrepancies in phoneme vocabularies and variations in prosody and speaking style across languages. Existing approaches either train separate models for each language, which achieve high performance at the cost of increased computational resources, or use a unified model for multiple languages that struggles to capture fine-grained, language-specific style variations. In this work, we propose LanStyleTTS, a non-autoregressive, language-aware style adaptive TTS framework that standardizes phoneme representations and enables fine-grained, phoneme-level style control across languages. This design supports a unified multilingual TTS model capable of producing accurate and high-quality speech without the need to train language-specific models. We evaluate LanStyleTTS by integrating it with several state-of-the-art non-autoregressive TTS architectures. Results show consistent performance improvements across different model backbones. Furthermore, we investigate a range of acoustic feature representations, including mel-spectrograms and autoencoder-derived latent features. Our experiments demonstrate that latent encodings can significantly reduce model size and computational cost while preserving high-quality speech generation.
CVOct 1, 2025
MetaLogic: Robustness Evaluation of Text-to-Image Models via Logically Equivalent PromptsYifan Shen, Yangyang Shu, Hye-young Paik et al.
Recent advances in text-to-image (T2I) models, especially diffusion-based architectures, have significantly improved the visual quality of generated images. However, these models continue to struggle with a critical limitation: maintaining semantic consistency when input prompts undergo minor linguistic variations. Despite being logically equivalent, such prompt pairs often yield misaligned or semantically inconsistent images, exposing a lack of robustness in reasoning and generalisation. To address this, we propose MetaLogic, a novel evaluation framework that detects T2I misalignment without relying on ground truth images. MetaLogic leverages metamorphic testing, generating image pairs from prompts that differ grammatically but are semantically identical. By directly comparing these image pairs, the framework identifies inconsistencies that signal failures in preserving the intended meaning, effectively diagnosing robustness issues in the model's logic understanding. Unlike existing evaluation methods that compare a generated image to a single prompt, MetaLogic evaluates semantic equivalence between paired images, offering a scalable, ground-truth-free approach to identifying alignment failures. It categorises these alignment errors (e.g., entity omission, duplication, positional misalignment) and surfaces counterexamples that can be used for model debugging and refinement. We evaluate MetaLogic across multiple state-of-the-art T2I models and reveal consistent robustness failures across a range of logical constructs. We find that even the SOTA text-to-image models like Flux.dev and DALLE-3 demonstrate a 59 percent and 71 percent misalignment rate, respectively. Our results show that MetaLogic is not only efficient and scalable, but also effective in uncovering fine-grained logical inconsistencies that are overlooked by existing evaluation metrics.
CLAug 11, 2025
LLMs for Law: Evaluating Legal-Specific LLMs on Contract UnderstandingAmrita Singh, H. Suhan Karaca, Aditya Joshi et al.
Despite advances in legal NLP, no comprehensive evaluation covering multiple legal-specific LLMs currently exists for contract classification tasks in contract understanding. To address this gap, we present an evaluation of 10 legal-specific LLMs on three English language contract understanding tasks and compare them with 7 general-purpose LLMs. The results show that legal-specific LLMs consistently outperform general-purpose models, especially on tasks requiring nuanced legal understanding. Legal-BERT and Contracts-BERT establish new SOTAs on two of the three tasks, despite having 69% fewer parameters than the best-performing general-purpose LLM. We also identify CaseLaw-BERT and LexLM as strong additional baselines for contract understanding. Our results provide a holistic evaluation of legal-specific LLMs and will facilitate the development of more accurate contract understanding systems.
CLJul 9, 2025
A Survey of Classification Tasks and Approaches for Legal ContractsAmrita Singh, Aditya Joshi, Jiaojiao Jiang et al.
Given the large size and volumes of contracts and their underlying inherent complexity, manual reviews become inefficient and prone to errors, creating a clear need for automation. Automatic Legal Contract Classification (LCC) revolutionizes the way legal contracts are analyzed, offering substantial improvements in speed, accuracy, and accessibility. This survey delves into the challenges of automatic LCC and a detailed examination of key tasks, datasets, and methodologies. We identify seven classification tasks within LCC, and review fourteen datasets related to English-language contracts, including public, proprietary, and non-public sources. We also introduce a methodology taxonomy for LCC, categorized into Traditional Machine Learning, Deep Learning, and Transformer-based approaches. Additionally, the survey discusses evaluation techniques and highlights the best-performing results from the reviewed studies. By providing a thorough overview of current methods and their limitations, this survey suggests future research directions to improve the efficiency, accuracy, and scalability of LCC. As the first comprehensive survey on LCC, it aims to support legal NLP researchers and practitioners in improving legal processes, making legal information more accessible, and promoting a more informed and equitable society.
SEOct 26, 2021
Defining Blockchain Governance Principles: A Comprehensive FrameworkYue Liu, Qinghua Lu, Guangsheng Yu et al.
Blockchain eliminates the need for trusted third-party intermediaries in business by enabling decentralised architecture design in software applications. However, the vulnerabilities in on-chain autonomous decision-makings and cumbersome off-chain coordination lead to serious concerns about blockchain's ability to behave in a trustworthy and efficient way. Blockchain governance has received considerable attention to support the decision-making process during the use and evolution of blockchain. Nevertheless, the conventional governance frameworks do not apply to blockchain due to its distributed architecture and decentralised decision process. These inherent features lead to the absence of a clear source of authority in blockchain ecosystem. Currently, there is a lack of systematic guidance on the governance of blockchain. Therefore, in this paper, we present a comprehensive blockchain governance framework, which elucidates an integrated view of the degree of decentralisation, decision rights, incentives, accountability, ecosystem, and legal and ethical responsibilities. The above aspects are formulated as six high-level principles for blockchain governance. We demonstrate a qualitative analysis of the proposed framework, including case studies on five extant blockchain platforms, and comparison with existing blockchain governance frameworks. The results show that our proposed framework is feasible and applicable in a real-world context.
LGSep 2, 2021
Global Convolutional Neural ProcessesXuesong Wang, Lina Yao, Xianzhi Wang et al.
The ability to deal with uncertainty in machine learning models has become equally, if not more, crucial to their predictive ability itself. For instance, during the pandemic, governmental policies and personal decisions are constantly made around uncertainties. Targeting this, Neural Process Families (NPFs) have recently shone a light on prediction with uncertainties by bridging Gaussian processes and neural networks. Latent neural process, a member of NPF, is believed to be capable of modelling the uncertainty on certain points (local uncertainty) as well as the general function priors (global uncertainties). Nonetheless, some critical questions remain unresolved, such as a formal definition of global uncertainties, the causality behind global uncertainties, and the manipulation of global uncertainties for generative models. Regarding this, we build a member GloBal Convolutional Neural Process(GBCoNP) that achieves the SOTA log-likelihood in latent NPFs. It designs a global uncertainty representation p(z), which is an aggregation on a discretized input space. The causal effect between the degree of global uncertainty and the intra-task diversity is discussed. The learnt prior is analyzed on a variety of scenarios, including 1D, 2D, and a newly proposed spatial-temporal COVID dataset. Our manipulation of the global uncertainty not only achieves generating the desired samples to tackle few-shot learning, but also enables the probability evaluation on the functional priors.
LGAug 16, 2021
Blockchain-based Trustworthy Federated Learning ArchitectureSin Kit Lo, Yue Liu, Qinghua Lu et al.
Federated learning is an emerging privacy-preserving AI technique where clients (i.e., organisations or devices) train models locally and formulate a global model based on the local model updates without transferring local data externally. However, federated learning systems struggle to achieve trustworthiness and embody responsible AI principles. In particular, federated learning systems face accountability and fairness challenges due to multi-stakeholder involvement and heterogeneity in client data distribution. To enhance the accountability and fairness of federated learning systems, we present a blockchain-based trustworthy federated learning architecture. We first design a smart contract-based data-model provenance registry to enable accountability. Additionally, we propose a weighted fair data sampler algorithm to enhance fairness in training data. We evaluate the proposed approach using a COVID-19 X-ray detection use case. The evaluation results show that the approach is feasible to enable accountability and improve fairness. The proposed algorithm can achieve better performance than the default federated learning setting in terms of the model's generalisation and accuracy.
LGJun 22, 2021
FLRA: A Reference Architecture for Federated Learning SystemsSin Kit Lo, Qinghua Lu, Hye-Young Paik et al.
Federated learning is an emerging machine learning paradigm that enables multiple devices to train models locally and formulate a global model, without sharing the clients' local data. A federated learning system can be viewed as a large-scale distributed system, involving different components and stakeholders with diverse requirements and constraints. Hence, developing a federated learning system requires both software system design thinking and machine learning knowledge. Although much effort has been put into federated learning from the machine learning perspectives, our previous systematic literature review on the area shows that there is a distinct lack of considerations for software architecture design for federated learning. In this paper, we propose FLRA, a reference architecture for federated learning systems, which provides a template design for federated learning-based solutions. The proposed FLRA reference architecture is based on an extensive review of existing patterns of federated learning systems found in the literature and existing industrial implementation. The FLRA reference architecture consists of a pool of architectural patterns that could address the frequently recurring design problems in federated learning architectures. The FLRA reference architecture can serve as a design guideline to assist architects and developers with practical solutions for their problems, which can be further customised.
SEMay 12, 2021
A Systematic Literature Review on Blockchain GovernanceYue Liu, Qinghua Lu, Liming Zhu et al.
Blockchain has been increasingly used as a software component to enable decentralisation in software architecture for a variety of applications. Blockchain governance has received considerable attention to ensure the safe and appropriate use and evolution of blockchain, especially after the Ethereum DAO attack in 2016. However, there are no systematic efforts to analyse existing governance solutions. To understand the state-of-the-art of blockchain governance, we conducted a systematic literature review with 37 primary studies. The extracted data from primary studies are synthesised to answer identified research questions. The study results reveal several major findings: 1) governance can improve the adaptability and upgradability of blockchain, whilst the current studies neglect broader ethical responsibilities as the objectives of blockchain governance; 2) governance is along with the development process of a blockchain platform, while ecosystem-level governance process is missing, and; 3) the responsibilities and capabilities of blockchain stakeholders are briefly discussed, whilst the decision rights, accountability, and incentives of blockchain stakeholders are still under studied. We provide actionable guidelines for academia and practitioners to use throughout the lifecycle of blockchain, and identify future trends to support researchers in this area.
LGMar 13, 2021
Simeon -- Secure Federated Machine Learning Through Iterative FilteringNicholas Malecki, Hye-young Paik, Aleksandar Ignjatovic et al.
Federated learning enables a global machine learning model to be trained collaboratively by distributed, mutually non-trusting learning agents who desire to maintain the privacy of their training data and their hardware. A global model is distributed to clients, who perform training, and submit their newly-trained model to be aggregated into a superior model. However, federated learning systems are vulnerable to interference from malicious learning agents who may desire to prevent training or induce targeted misclassification in the resulting global model. A class of Byzantine-tolerant aggregation algorithms has emerged, offering varying degrees of robustness against these attacks, often with the caveat that the number of attackers is bounded by some quantity known prior to training. This paper presents Simeon: a novel approach to aggregation that applies a reputation-based iterative filtering technique to achieve robustness even in the presence of attackers who can exhibit arbitrary behaviour. We compare Simeon to state-of-the-art aggregation techniques and find that Simeon achieves comparable or superior robustness to a variety of attacks. Notably, we show that Simeon is tolerant to sybil attacks, where other algorithms are not, presenting a key advantage of our approach.
LGJan 7, 2021
Architectural Patterns for the Design of Federated Learning SystemsSin Kit Lo, Qinghua Lu, Liming Zhu et al.
Federated learning has received fast-growing interests from academia and industry to tackle the challenges of data hungriness and privacy in machine learning. A federated learning system can be viewed as a large-scale distributed system with different components and stakeholders as numerous client devices participate in federated learning. Designing a federated learning system requires software system design thinking apart from machine learning knowledge. Although much effort has been put into federated learning from the machine learning technique aspects, the software architecture design concerns in building federated learning systems have been largely ignored. Therefore, in this paper, we present a collection of architectural patterns to deal with the design challenges of federated learning systems. Architectural patterns present reusable solutions to a commonly occurring problem within a given context during software architecture design. The presented patterns are based on the results of a systematic literature review and include three client management patterns, four model management patterns, three model training patterns, and four model aggregation patterns. The patterns are associated to the particular state transitions in a federated learning model lifecycle, serving as a guidance for effective use of the patterns in the design of federated learning systems.
SEJul 22, 2020
A Systematic Literature Review on Federated Machine Learning: From A Software Engineering PerspectiveSin Kit Lo, Qinghua Lu, Chen Wang et al.
Federated learning is an emerging machine learning paradigm where clients train models locally and formulate a global model based on the local model updates. To identify the state-of-the-art in federated learning and explore how to develop federated learning systems, we perform a systematic literature review from a software engineering perspective, based on 231 primary studies. Our data synthesis covers the lifecycle of federated learning system development that includes background understanding, requirement analysis, architecture design, implementation, and evaluation. We highlight and summarise the findings from the results, and identify future trends to encourage researchers to advance their current work.
CRMay 25, 2020
Design Patterns for Blockchain-based Self-Sovereign IdentityYue Liu, Qinghua Lu, Hye-Young Paik et al.
Self-sovereign identity is a new identity management paradigm that allows entities to really have the ownership of their identity data and control their use without involving any intermediary. Blockchain is an enabling technology for building self-sovereign identity systems by providing a neutral and trustable storage and computing infrastructure and can be viewed as a component of the systems. Both blockchain and self-sovereign identity are emerging technologies which could present a steep learning curve for architects. We collect and propose 12 design patterns for blockchain-based self-sovereign identity systems to help the architects understand and easily apply the concepts in system design. Based on the lifecycles of three main objects involved in self-sovereign identity, we categorise the patterns into three groups: key management patterns, decentralised identifier management patterns, and credential design patterns. The proposed patterns provide a systematic and holistic guide for architects to design the architecture of blockchain-based self-sovereign identity systems.
SEMay 4, 2020
Design-Pattern-as-a-Service for Blockchain-based Self-Sovereign IdentityYue Liu, Qinghua Lu, Hye-Young Paik et al.
Self-sovereign identity (SSI) is considered to be a "killer application" of blockchain. However, there is a lack of systematic architecture designs for blockchain-based SSI systems to support methodical development. An aspect of such gap is demonstrated in current solutions, which are considered coarse grained and may increase data security risks. In this paper, we first identify the lifecycles of three major SSI objects (i.e., key, identifier, and credential) and present fine-grained design patterns critical for application development. These patterns are associated with particular state transitions, providing a systematic view of system interactions and serving as a guidance for effective use of these patterns. Further, we present an SSI platform architecture, which advocates the notion of Design-Pattern-as-a-Service. Each design pattern serves as an API by wrapping the respective pattern code to ease application development and improve scalability and security. We implement a prototype and evaluate it on feasibility and scalability.
IRJan 15, 2019
Integrating and querying similar tables from PDF documents using deep learningRahul Anand, Hye-Young Paik, Cheng Wang
Large amount of public data produced by enterprises are in semi-structured PDF form. Tabular data extraction from reports and other published data in PDF format is of interest for various data consolidation purposes such as analysing and aggregating financial reports of a company. Queries into the structured tabular data in PDF format are normally processed in an unstructured manner through means like text-match. This is mainly due to that the binary format of PDF documents is optimized for layout and rendering and do not have great support for automated parsing of data. Moreover, even the same table type in PDF files varies in schema, row or column headers, which makes it difficult for a query plan to cover all relevant tables. This paper proposes a deep learning based method to enable SQL-like query and analysis of financial tables from annual reports in PDF format. This is achieved through table type classification and nearest row search. We demonstrate that using word embedding trained on Google news for header match clearly outperforms the text-match based approach in traditional database. We also introduce a practical system that uses this technology to query and analyse finance tables in PDF documents from various sources.