18.7CLJun 3
Multilingual Coreference Resolution via Cycle-Consistent Machine TranslationAdriana-Valentina Costache, Eduard Poesina, Silviu-Florin Gheorghe et al.
Coreference resolution is a core NLP task, having a broad range of downstream applications, e.g.~machine translation, question answering, document summarization, etc. While the task is well-studied in English, comparatively less attention is dedicated to coreference resolution in other languages, especially low-resource ones. To mitigate this gap, we propose a novel coreference resolution pipeline that harnesses machine translation (MT) from English to a target low-resource language, to generate or expand training data. To automatically validate the quality of the translated samples, we back-translate the samples and assess the similarity with the original English samples via cosine similarity in the latent space of a BERT model. The resulting similarity scores are integrated into the loss function to weight training samples according to their MT cycle consistency. Extensive experiments on four low-resource languages show that our pipeline brings significant performance gains in coreference resolution. Moreover, our pipeline enables accurate coreference resolution in languages where no previous corpora were available.
SYApr 21, 2023
Learning Dictionaries from Physical-Based Interpolation for Water Network Leak LocalizationPaul Irofti, Luis Romero-Ben, Florin Stoican et al.
This article presents a leak localization methodology based on state estimation and learning. The first is handled by an interpolation scheme, whereas dictionary learning is considered for the second stage. The novel proposed interpolation technique exploits the physics of the interconnections between hydraulic heads of neighboring nodes in water distribution networks. Additionally, residuals are directly interpolated instead of hydraulic head values. The results of applying the proposed method to a well-known case study (Modena) demonstrated the improvements of the new interpolation method with respect to a state-of-the-art approach, both in terms of interpolation error (considering state and residual estimation) and posterior localization.
CVNov 29, 2024Code
Deepfake Media Generation and Detection in the Generative AI Era: A Survey and OutlookFlorinel-Alin Croitoru, Andrei-Iulian Hiji, Vlad Hondru et al.
With the recent advancements in generative modeling, the realism of deepfake content has been increasing at a steady pace, even reaching the point where people often fail to detect manipulated media content online, thus being deceived into various kinds of scams. In this paper, we survey deepfake generation and detection techniques, including the most recent developments in the field, such as diffusion models and Neural Radiance Fields. Our literature review covers all deepfake media types, comprising image, video, audio and multimodal (audio-visual) content. We identify various kinds of deepfakes, according to the procedure used to alter or generate the fake content. We further construct a taxonomy of deepfake generation and detection methods, illustrating the important groups of methods and the domains where these methods are applied. Next, we gather datasets used for deepfake detection and provide updated rankings of the best performing deepfake detectors on the most popular datasets. In addition, we develop a novel multimodal benchmark to evaluate deepfake detectors on out-of-distribution content. The results indicate that state-of-the-art detectors fail to generalize to deepfake content generated by unseen deepfake generators. Finally, we propose future directions to obtain robust and powerful deepfake detectors. Our project page and new benchmark are available at https://github.com/CroitoruAlin/biodeep.
SYNov 27, 2023
Nodal Hydraulic Head Estimation through Unscented Kalman Filter for Data-driven Leak Localization in Water NetworksLuis Romero-Ben, Paul Irofti, Florin Stoican et al.
In this paper, we present a nodal hydraulic head estimation methodology for water distribution networks (WDN) based on an Unscented Kalman Filter (UKF) scheme with application to leak localization. The UKF refines an initial estimation of the hydraulic state by considering the prediction model, as well as available pressure and demand measurements. To this end, it provides customized prediction and data assimilation steps. Additionally, the method is enhanced by dynamically updating the prediction function weight matrices. Performance testing on the Modena benchmark under realistic conditions demonstrates the method's effectiveness in enhancing state estimation and data-driven leak localization.
LGMay 14, 2022
Unsupervised Abnormal Traffic Detection through Topological Flow AnalysisPaul Irofti, Andrei Pătraşcu, Andrei Iulian Hîji
Cyberthreats are a permanent concern in our modern technological world. In the recent years, sophisticated traffic analysis techniques and anomaly detection (AD) algorithms have been employed to face the more and more subversive adversarial attacks. A malicious intrusion, defined as an invasive action intending to illegally exploit private resources, manifests through unusual data traffic and/or abnormal connectivity pattern. Despite the plethora of statistical or signature-based detectors currently provided in the literature, the topological connectivity component of a malicious flow is less exploited. Furthermore, a great proportion of the existing statistical intrusion detectors are based on supervised learning, that relies on labeled data. By viewing network flows as weighted directed interactions between a pair of nodes, in this paper we present a simple method that facilitate the use of connectivity graph features in unsupervised anomaly detection algorithms. We test our methodology on real network traffic datasets and observe several improvements over standard AD.
ROMar 3
Real-time loosely coupled GNSS and IMU integration via Factor Graph OptimizationRadu-Andrei Cioaca, Cristian Rusu, Paul Irofti et al.
Accurate positioning, navigation, and timing (PNT) is fundamental to the operation of modern technologies and a key enabler of autonomous systems. A very important component of PNT is the Global Navigation Satellite System (GNSS) which ensures outdoor positioning. Modern research directions have pushed the performance of GNSS localization to new heights by fusing GNSS measurements with other sensory information, mainly measurements from Inertial Measurement Units (IMU). In this paper, we propose a loosely coupled architecture to integrate GNSS and IMU measurements using a Factor Graph Optimization (FGO) framework. Because the FGO method can be computationally challenging and often used as a post-processing method, our focus is on assessing its localization accuracy and service availability while operating in real-time in challenging environments (urban canyons). Experimental results on the UrbanNav-HK-MediumUrban-1 dataset show that the proposed approach achieves real-time operation and increased service availability compared to batch FGO methods. While this improvement comes at the cost of reduced positioning accuracy, the paper provides a detailed analysis of the trade-offs between accuracy, availability, and computational efficiency that characterize real-time FGO-based GNSS/IMU fusion.
ROMar 3
Real-time tightly coupled GNSS and IMU integration via Factor Graph OptimizationRadu-Andrei Cioaca, Paul Irofti, Cristian Rusu et al.
Reliable positioning in dense urban environments remains challenging due to frequent GNSS signal blockage, multipath, and rapidly varying satellite geometry. While factor graph optimization (FGO)-based GNSS-IMU fusion has demonstrated strong robustness and accuracy, most formulations remain offline. In this work, we present a real-time tightly coupled GNSS-IMU FGO method that enables causal state estimation via incremental optimization with fixed-lag marginalization, and we evaluate its performance in a highly urbanized GNSS-degraded environment using the UrbanNav dataset.
SYSep 13, 2025Code
Factor Graph Optimization for Leak Localization in Water Distribution NetworksPaul Irofti, Luis Romero-Ben, Florin Stoican et al.
Detecting and localizing leaks in water distribution network systems is an important topic with direct environmental, economic, and social impact. Our paper is the first to explore the use of factor graph optimization techniques for leak localization in water distribution networks, enabling us to perform sensor fusion between pressure and demand sensor readings and to estimate the network's temporal and structural state evolution across all network nodes. The methodology introduces specific water network factors and proposes a new architecture composed of two factor graphs: a leak-free state estimation factor graph and a leak localization factor graph. When a new sensor reading is obtained, unlike Kalman and other interpolation-based methods, which estimate only the current network state, factor graphs update both current and past states. Results on Modena, L-TOWN and synthetic networks show that factor graphs are much faster than nonlinear Kalman-based alternatives such as the UKF, while also providing improvements in localization compared to state-of-the-art estimation-localization approaches. Implementation and benchmarks are available at https://github.com/pirofti/FGLL.
SYDec 16, 2024Code
Dual Unscented Kalman Filter Architecture for Sensor Fusion in Water Networks Leak LocalizationLuis Romero-Ben, Paul Irofti, Florin Stoican et al.
Leakage in water systems results in significant daily water losses, degrading service quality, increasing costs, and aggravating environmental problems. Most leak localization methods rely solely on pressure data, missing valuable information from other sensor types. This article proposes a hydraulic state estimation methodology based on a dual Unscented Kalman Filter (UKF) approach, which enhances the estimation of both nodal hydraulic heads, critical in localization tasks, and pipe flows, useful for operational purposes. The approach enables the fusion of different sensor types, such as pressure, flow and demand meters. The strategy is evaluated in well-known open source case studies, namely Modena and L-TOWN, showing improvements over other state-of-the-art estimation approaches in terms of interpolation accuracy, as well as more precise leak localization performance in L-TOWN.
CLJan 19Code
MOSLD-Bench: Multilingual Open-Set Learning and Discovery Benchmark for Text CategorizationAdriana-Valentina Costache, Daria-Nicoleta Dragomir, Silviu-Florin Gheorghe et al.
Open-set learning and discovery (OSLD) is a challenging machine learning task in which samples from new (unknown) classes can appear at test time. It can be seen as a generalization of zero-shot learning, where the new classes are not known a priori, hence involving the active discovery of new classes. While zero-shot learning has been extensively studied in text classification, especially with the emergence of pre-trained language models, open-set learning and discovery is a comparatively new setup for the text domain. To this end, we introduce the first multilingual open-set learning and discovery (MOSLD) benchmark for text categorization by topic, comprising 960K data samples across 12 languages. To construct the benchmark, we (i) rearrange existing datasets and (ii) collect new data samples from the news domain. Moreover, we propose a novel framework for the OSLD task, which integrates multiple stages to continuously discover and learn new classes. We evaluate several language models, including our own, to obtain results that can be used as reference for future work. We release our benchmark at https://github.com/Adriana19Valentina/MOSLD-Bench.
SDMay 31, 2025Code
XMAD-Bench: Cross-Domain Multilingual Audio Deepfake BenchmarkIoan-Paul Ciobanu, Andrei-Iulian Hiji, Nicolae-Catalin Ristea et al.
Recent advances in audio generation led to an increasing number of deepfakes, making the general public more vulnerable to financial scams, identity theft, and misinformation. Audio deepfake detectors promise to alleviate this issue, with many recent studies reporting accuracy rates close to 99%. However, these methods are typically tested in an in-domain setup, where the deepfake samples from the training and test sets are produced by the same generative models. To this end, we introduce XMAD-Bench, a large-scale cross-domain multilingual audio deepfake benchmark comprising 668.8 hours of real and deepfake speech. In our novel dataset, the speakers, the generative methods, and the real audio sources are distinct across training and test splits. This leads to a challenging cross-domain evaluation setup, where audio deepfake detectors can be tested ``in the wild''. Our in-domain and cross-domain experiments indicate a clear disparity between the in-domain performance of deepfake detectors, which is usually as high as 100%, and the cross-domain performance of the same models, which is sometimes similar to random chance. Our benchmark highlights the need for the development of robust audio deepfake detectors, which maintain their generalization capacity across different languages, speakers, generative methods, and data sources. Our benchmark is publicly released at https://github.com/ristea/xmad-bench/.
CLFeb 18, 2025Code
A Survey of Text Classification Under Class Distribution ShiftAdriana Valentina Costache, Silviu Florin Gheorghe, Eduard Gabriel Poesina et al.
The basic underlying assumption of machine learning (ML) models is that the training and test data are sampled from the same distribution. However, in daily practice, this assumption is often broken, i.e.~the distribution of the test data changes over time, which hinders the application of conventional ML models. One domain where the distribution shift naturally occurs is text classification, since people always find new topics to discuss. To this end, we survey research articles studying open-set text classification and related tasks. We divide the methods in this area based on the constraints that define the kind of distribution shift and the corresponding problem formulation, i.e.~learning with the Universum, zero-shot learning, and open-set learning. We next discuss the predominant mitigation approaches for each problem setup. Finally, we identify several future work directions, aiming to push the boundaries beyond the state of the art. Interestingly, we find that continual learning can solve many of the issues caused by the shifting class distribution. We maintain a list of relevant papers at https://github.com/Eduard6421/Open-Set-Survey.
CRMar 22, 2025
Detecting and Mitigating DDoS Attacks with AI: A SurveyAlexandru Apostu, Silviu Gheorghe, Andrei Hîji et al.
Distributed Denial of Service attacks represent an active cybersecurity research problem. Recent research shifted from static rule-based defenses towards AI-based detection and mitigation. This comprehensive survey covers several key topics. Preeminently, state-of-the-art AI detection methods are discussed. An in-depth taxonomy based on manual expert hierarchies and an AI-generated dendrogram are provided, thus settling DDoS categorization ambiguities. An important discussion on available datasets follows, covering data format options and their role in training AI detection methods together with adversarial training and examples augmentation. Beyond detection, AI based mitigation techniques are surveyed as well. Finally, multiple open research directions are proposed.
SYOct 28, 2025
A comparison between joint and dual UKF implementations for state estimation and leak localization in water distribution networksLuis Romero-Ben, Paul Irofti, Florin Stoican et al.
The sustainability of modern cities highly depends on efficient water distribution management, including effective pressure control and leak detection and localization. Accurate information about the network hydraulic state is therefore essential. This article presents a comparison between two data-driven state estimation methods based on the Unscented Kalman Filter (UKF), fusing pressure, demand and flow data for head and flow estimation. One approach uses a joint state vector with a single estimator, while the other uses a dual-estimator scheme. We analyse their main characteristics, discussing differences, advantages and limitations, and compare them theoretically in terms of accuracy and complexity. Finally, we show several estimation results for the L-TOWN benchmark, allowing to discuss their properties in a real implementation.
LGApr 5, 2024
Fusing Dictionary Learning and Support Vector Machines for Unsupervised Anomaly DetectionPaul Irofti, Iulian-Andrei Hîji, Andrei Pătraşcu et al.
We study in this paper the improvement of one-class support vector machines (OC-SVM) through sparse representation techniques for unsupervised anomaly detection. As Dictionary Learning (DL) became recently a common analysis technique that reveals hidden sparse patterns of data, our approach uses this insight to endow unsupervised detection with more control on pattern finding and dimensions. We introduce a new anomaly detection model that unifies the OC-SVM and DL residual functions into a single composite objective, subsequently solved through K-SVD-type iterative algorithms. A closed-form of the alternating K-SVD iteration is explicitly derived for the new composite model and practical implementable schemes are discussed. The standard DL model is adapted for the Dictionary Pair Learning (DPL) context, where the usual sparsity constraints are naturally eliminated. Finally, we extend both objectives to the more general setting that allows the use of kernel functions. The empirical convergence properties of the resulting algorithms are provided and an in-depth analysis of their parametrization is performed while also demonstrating their numerical performance in comparison with existing methods.
NAMar 5, 2024
Learning Explicitly Conditioned Sparsifying TransformsAndrei Pătraşcu, Cristian Rusu, Paul Irofti
Sparsifying transforms became in the last decades widely known tools for finding structured sparse representations of signals in certain transform domains. Despite the popularity of classical transforms such as DCT and Wavelet, learning optimal transforms that guarantee good representations of data into the sparse domain has been recently analyzed in a series of papers. Typically, the conditioning number and representation ability are complementary key features of learning square transforms that may not be explicitly controlled in a given optimization model. Unlike the existing approaches from the literature, in our paper, we consider a new sparsifying transform model that enforces explicit control over the data representation quality and the condition number of the learned transforms. We confirm through numerical experiments that our model presents better numerical behavior than the state-of-the-art.
LGJan 11, 2022
Dictionary Learning with Uniform Sparse Representations for Anomaly DetectionPaul Irofti, Cristian Rusu, Andrei Pătraşcu
Many applications like audio and image processing show that sparse representations are a powerful and efficient signal modeling technique. Finding an optimal dictionary that generates at the same time the sparsest representations of data and the smallest approximation error is a hard problem approached by dictionary learning (DL). We study how DL performs in detecting abnormal samples in a dataset of signals. In this paper we use a particular DL formulation that seeks uniform sparse representations model to detect the underlying subspace of the majority of samples in a dataset, using a K-SVD-type algorithm. Numerical simulations show that one can efficiently use this resulted subspace to discriminate the anomalies over the regular data points.
LGOct 12, 2021
Data-driven Leak Localization in Water Distribution Networks via Dictionary Learning and Graph-based InterpolationPaul Irofti, Luis Romero-Ben, Florin Stoican et al.
In this paper, we propose a data-driven leak localization method for water distribution networks (WDNs) which combines two complementary approaches: graph-based interpolation and dictionary classification. The former estimates the complete WDN hydraulic state (i.e., hydraulic heads) from real measurements at certain nodes and the network graph. Then, these actual measurements, together with a subset of valuable estimated states, are used to feed and train the dictionary learning scheme. Thus, the meshing of these two methods is explored, showing that its performance is superior to either approach alone, even deriving different mechanisms to increase its resilience to classical problems (e.g., dimensionality, interpolation errors, etc.). The approach is validated using the L-TOWN benchmark proposed at BattLeDIM2020.
LGAug 10, 2021
Complexity of Inexact Proximal Point Algorithm for minimizing convex functions with Holderian GrowthAndrei Pătraşcu, Paul Irofti
Several decades ago the Proximal Point Algorithm (PPA) started to gain a long-lasting attraction for both abstract operator theory and numerical optimization communities. Even in modern applications, researchers still use proximal minimization theory to design scalable algorithms that overcome nonsmoothness. Remarkable works as \cite{Fer:91,Ber:82constrained,Ber:89parallel,Tom:11} established tight relations between the convergence behaviour of PPA and the regularity of the objective function. In this manuscript we derive nonasymptotic iteration complexity of exact and inexact PPA to minimize convex functions under $γ-$Holderian growth: $\BigO{\log(1/ε)}$ (for $γ\in [1,2]$) and $\BigO{1/ε^{γ- 2}}$ (for $γ> 2$). In particular, we recover well-known results on PPA: finite convergence for sharp minima and linear convergence for quadratic growth, even under presence of deterministic noise. Moreover, when a simple Proximal Subgradient Method is recurrently called as an inner routine for computing each IPPA iterate, novel computational complexity bounds are obtained for Restarting Inexact PPA. Our numerical tests show improvements over existing restarting versions of the Subgradient Method.
LGJul 7, 2020
Efficient and Parallel Separable Dictionary LearningCristian Rusu, Paul Irofti
Separable, or Kronecker product, dictionaries provide natural decompositions for 2D signals, such as images. In this paper, we describe a highly parallelizable algorithm that learns such dictionaries which reaches sparse representations competitive with the previous state of the art dictionary learning algorithms from the literature but at a lower computational cost. We highlight the performance of the proposed method to sparsely represent image and hyperspectral data, and for image denoising.
LGMar 30, 2020
Stochastic Proximal Gradient Algorithm with Minibatches. Application to Large Scale Learning ModelsAndrei Patrascu, Ciprian Paduraru, Paul Irofti
Stochastic optimization lies at the core of most statistical learning models. The recent great development of stochastic algorithmic tools focused significantly onto proximal gradient iterations, in order to find an efficient approach for nonsmooth (composite) population risk functions. The complexity of finding optimal predictors by minimizing regularized risk is largely understood for simple regularizations such as $\ell_1/\ell_2$ norms. However, more complex properties desired for the predictor necessitates highly difficult regularizers as used in grouped lasso or graph trend filtering. In this chapter we develop and analyze minibatch variants of stochastic proximal gradient algorithm for general composite objective functions with stochastic nonsmooth components. We provide iteration complexity for constant and variable stepsize policies obtaining that, for minibatch size $N$, after $\mathcal{O}(\frac{1}{Nε})$ iterations $ε-$suboptimality is attained in expected quadratic distance to optimal solution. The numerical tests on $\ell_2-$regularized SVMs and parametric sparse representation problems confirm the theoretical behaviour and surpasses minibatch SGD performance.
SYMar 18, 2020
Fault Handling in Large Water Networks with Online Dictionary LearningPaul Irofti, Florin Stoican, Vicenç Puig
Fault detection and isolation in water distribution networks is an active topic due to its model's mathematical complexity and increased data availability through sensor placement. Here we simplify the model by offering a data driven alternative that takes the network topology into account when performing sensor placement and then proceeds to build a network model through online dictionary learning based on the incoming sensor data. Online learning is fast and allows tackling large networks as it processes small batches of signals at a time and has the benefit of continuous integration of new data into the existing network model, be it in the beginning for training or in production when new data samples are encountered. The algorithms show good performance when tested on both small and large-scale networks.
LGFeb 29, 2020
Unsupervised Dictionary Learning for Anomaly DetectionPaul Irofti, Andra Băltoiu
We investigate the possibilities of employing dictionary learning to address the requirements of most anomaly detection applications, such as absence of supervision, online formulations, low false positive rates. We present new results of our recent semi-supervised online algorithm, TODDLeR, on a anti-money laundering application. We also introduce a novel unsupervised method of using the performance of the learning algorithm as indication of the nature of the samples.
OCDec 4, 2019
Stochastic proximal splitting algorithm for composite minimizationAndrei Patrascu, Paul Irofti
Supported by the recent contributions in multiple branches, the first-order splitting algorithms became central for structured nonsmooth optimization. In the large-scale or noisy contexts, when only stochastic information on the smooth part of the objective function is available, the extension of proximal gradient schemes to stochastic oracles is based on proximal tractability of the nonsmooth component and it has been deeply analyzed in the literature. However, there remained gaps illustrated by composite models where the nonsmooth term is not proximally tractable anymore. In this note we tackle composite optimization problems, where the access only to stochastic information on both smooth and nonsmooth components is assumed, using a stochastic proximal first-order scheme with stochastic proximal updates. We provide $\mathcal{O}\left( \frac{1}{k} \right)$ the iteration complexity (in expectation of squared distance to the optimal set) under the strong convexity assumption on the objective function. Empirical behavior is illustrated by numerical tests on parametric sparse representation models.
LGOct 24, 2019
Community-Level Anomaly Detection for Anti-Money LaunderingAndra Baltoiu, Andrei Patrascu, Paul Irofti
Anomaly detection in networks often boils down to identifying an underlying graph structure on which the abnormal occurrence rests on. Financial fraud schemes are one such example, where more or less intricate schemes are employed in order to elude transaction security protocols. We investigate the problem of learning graph structure representations using adaptations of dictionary learning aimed at encoding connectivity patterns. In particular, we adapt dictionary learning strategies to the specificity of network topologies and propose new methods that impose Laplacian structure on the dictionaries themselves. In one adaption we focus on classifying topologies by working directly on the graph Laplacian and cast the learning problem to accommodate its 2D structure. We tackle the same problem by learning dictionaries which consist of vectorized atomic Laplacians, and provide a block coordinate descent scheme to solve the new dictionary learning formulation. Imposing Laplacian structure on the dictionaries is also proposed in an adaptation of the Single Block Orthogonal learning method. Results on synthetic graph datasets comprising different graph topologies confirm the potential of dictionaries to directly represent graph structure information.
LGOct 24, 2019
Quick survey of graph-based fraud detection methodsPaul Irofti, Andrei Patrascu, Andra Baltoiu
In general, anomaly detection is the problem of distinguishing between normal data samples with well defined patterns or signatures and those that do not conform to the expected profiles. Financial transactions, customer reviews, social media posts are all characterized by relational information. In these networks, fraudulent behaviour may appear as a distinctive graph edge, such as spam message, a node or a larger subgraph structure, such as when a group of clients engage in money laundering schemes. Most commonly, these networks are represented as attributed graphs, with numerical features complementing relational information. We present a survey on anomaly detection techniques used for fraud detection that exploit both the graph structure underlying the data and the contextual information contained in the attributes.
CVSep 16, 2015
Overcomplete Dictionary Learning with Jacobi Atom UpdatesPaul Irofti, Bogdan Dumitrescu
Dictionary learning for sparse representations is traditionally approached with sequential atom updates, in which an optimized atom is used immediately for the optimization of the next atoms. We propose instead a Jacobi version, in which groups of atoms are updated independently, in parallel. Extensive numerical evidence for sparse image representation shows that the parallel algorithms, especially when all atoms are updated simultaneously, give better dictionaries than their sequential counterparts.
CVDec 16, 2014
Efficient GPU Implementation for Single Block Orthogonal Dictionary LearningPaul Irofti
Dictionary training for sparse representations involves dealing with large chunks of data and complex algorithms that determine time consuming implementations. SBO is an iterative dictionary learning algorithm based on constructing unions of orthonormal bases via singular value decomposition, that represents each data item through a single best fit orthobase. In this paper we present a GPGPU approach of implementing SBO in OpenCL. We provide a lock-free solution that ensures full-occupancy of the GPU by following the map-reduce model for the sparse-coding stage and by making use of the Partitioned Global Address Space (PGAS) model for developing parallel dictionary updates. The resulting implementation achieves a favourable trade-off between algorithm complexity and data representation quality compared to PAK-SVD which is the standard overcomplete dictionary learning approach. We present and discuss numerical results showing a significant acceleration of the execution time for the dictionary learning process.