99.2CLMar 30Code
Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim VerificationMasnun Nuha Chowdhury, Nusrat Jahan Beg, Umme Hunny Khan et al.
Large language models (LLMs) remain unreliable for high-stakes claim verification due to hallucinations and shallow reasoning. While retrieval-augmented generation (RAG) and multi-agent debate (MAD) address this, they are limited by one-pass retrieval and unstructured debate dynamics. We propose a courtroom-style multi-agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation. Our approach integrates specialized roles (e.g., Plaintiff, Defense, Judge) with Progressive RAG (P-RAG) to dynamically expand and refine the evidence pool during the debate. Furthermore, we employ evidence negotiation, self-reflection, and heterogeneous multi-judge aggregation to enforce calibration, robustness, and diversity. In zero-shot evaluations on the Check-COVID benchmark, PROClaim achieves 81.7% accuracy, outperforming standard multi-agent debate by 10.0 percentage points, with P-RAG driving the primary performance gains (+7.5 pp). We ultimately demonstrate that structural deliberation and model heterogeneity effectively mitigate systematic biases, providing a robust foundation for reliable claim verification. Our code and data are publicly available at https://github.com/mnc13/PROClaim.
CLJun 24, 2023
Math Word Problem Solving by Generating Linguistic Variants of Problem StatementsSyed Rifat Raiyan, Md. Nafis Faiyaz, Shah Md. Jawad Kabir et al.
The art of mathematical reasoning stands as a fundamental pillar of intellectual progress and is a central catalyst in cultivating human ingenuity. Researchers have recently published a plethora of works centered around the task of solving Math Word Problems (MWP) $-$ a crucial stride towards general AI. These existing models are susceptible to dependency on shallow heuristics and spurious correlations to derive the solution expressions. In order to ameliorate this issue, in this paper, we propose a framework for MWP solvers based on the generation of linguistic variants of the problem text. The approach involves solving each of the variant problems and electing the predicted expression with the majority of the votes. We use DeBERTa (Decoding-enhanced BERT with disentangled attention) as the encoder to leverage its rich textual representations and enhanced mask decoder to construct the solution expressions. Furthermore, we introduce a challenging dataset, $\mathrm{P\small{ARA}\normalsize{MAWPS}}$, consisting of paraphrased, adversarial, and inverse variants of selectively sampled MWPs from the benchmark $\mathrm{M\small{AWPS}}$ dataset. We extensively experiment on this dataset along with other benchmark datasets using some baseline MWP solver models. We show that training on linguistic variants of problem statements and voting on candidate predictions improve the mathematical reasoning and robustness of the model. We make our code and data publicly available.
CLMar 27, 2023
TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language ModelsMd Kamrul Hasan, Md Saiful Islam, Sangwu Lee et al.
Pre-trained large language models have recently achieved ground-breaking performance in a wide variety of language understanding tasks. However, the same model can not be applied to multimodal behavior understanding tasks (e.g., video sentiment/humor detection) unless non-verbal features (e.g., acoustic and visual) can be integrated with language. Jointly modeling multiple modalities significantly increases the model complexity, and makes the training process data-hungry. While an enormous amount of text data is available via the web, collecting large-scale multimodal behavioral video datasets is extremely expensive, both in terms of time and money. In this paper, we investigate whether large language models alone can successfully incorporate non-verbal information when they are presented in textual form. We present a way to convert the acoustic and visual information into corresponding textual descriptions and concatenate them with the spoken text. We feed this augmented input to a pre-trained BERT model and fine-tune it on three downstream multimodal tasks: sentiment, humor, and sarcasm detection. Our approach, TextMI, significantly reduces model complexity, adds interpretability to the model's decision, and can be applied for a diverse set of tasks while achieving superior (multimodal sarcasm detection) or near SOTA (multimodal sentiment analysis and multimodal humor detection) performance. We propose TextMI as a general, competitive baseline for multimodal behavioral analysis tasks, particularly in a low-resource setting.
CYJan 8
Political Alignment in Large Language Models: A Multidimensional Audit of Psychometric Identity and Behavioral BiasAdib Sakhawat, Tahsin Islam, Takia Farhin et al.
As large language models (LLMs) are increasingly integrated into social decision-making, understanding their political positioning and alignment behavior is critical for safety and fairness. This study presents a sociotechnical audit of 26 prominent LLMs, triangulating their positions across three psychometric inventories (Political Compass, SapplyValues, 8 Values) and evaluating their performance on a large-scale news labeling task ($N \approx 27{,}000$). Our results reveal a strong clustering of models in the Libertarian-Left region of the ideological space, encompassing 96.3% of the cohort. Alignment signals appear to be consistent architectural traits rather than stochastic noise ($η^2 > 0.90$); however, we identify substantial discrepancies in measurement validity. In particular, the Political Compass exhibits a strong negative correlation with cultural progressivism ($r=-0.64$) when compared against multi-axial instruments, suggesting a conflation of social conservatism with authoritarianism in this context. We further observe a significant divergence between open-weights and closed-source models, with the latter displaying markedly higher cultural progressivism scores ($p<10^{-25}$). In downstream media analysis, models exhibit a systematic "center-shift," frequently categorizing neutral articles as left-leaning, alongside an asymmetric detection capability in which "Far Left" content is identified with greater accuracy (19.2%) than "Far Right" content (2.0%). These findings suggest that single-axis evaluations are insufficient and that multidimensional auditing frameworks are necessary to characterize alignment behavior in deployed LLMs. Our code and data will be made public.
CLJan 8
BanglaLorica: Design and Evaluation of a Robust Watermarking Algorithm for Large Language Models in Bangla Text GenerationAmit Bin Tariqul, A N M Zahid Hossain Milkan, Sahab-Al-Chowdhury et al.
As large language models (LLMs) are increasingly deployed for text generation, watermarking has become essential for authorship attribution, intellectual property protection, and misuse detection. While existing watermarking methods perform well in high-resource languages, their robustness in low-resource languages remains underexplored. This work presents the first systematic evaluation of state-of-the-art text watermarking methods: KGW, Exponential Sampling (EXP), and Waterfall, for Bangla LLM text generation under cross-lingual round-trip translation (RTT) attacks. Under benign conditions, KGW and EXP achieve high detection accuracy (>88%) with negligible perplexity and ROUGE degradation. However, RTT causes detection accuracy to collapse below RTT causes detection accuracy to collapse to 9-13%, indicating a fundamental failure of token-level watermarking. To address this, we propose a layered watermarking strategy that combines embedding-time and post-generation watermarks. Experimental results show that layered watermarking improves post-RTT detection accuracy by 25-35%, achieving 40-50% accuracy, representing a 3$\times$ to 4$\times$ relative improvement over single-layer methods, at the cost of controlled semantic degradation. Our findings quantify the robustness-quality trade-off in multilingual watermarking and establish layered watermarking as a practical, training-free solution for low-resource languages such as Bangla. Our code and data will be made public.
CLNov 26, 2025
Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and ProspectsHusne Ara Rubaiyeat, Hasan Mahmud, Md Kamrul Hasan
Bangla Sign Language Translation (BdSLT) has been severely constrained so far as the language itself is very low resource. Standard sentence level dataset creation for BdSLT is of immense importance for developing AI based assistive tools for deaf and hard of hearing people of Bangla speaking community. In this paper, we present a dataset, IsharaKhobor , and two subset of it for enabling research. We also present the challenges towards developing the dataset and present some way forward by benchmarking with landmark based raw and RQE embedding. We do some ablation on vocabulary restriction and canonicalization of the same within the dataset, which resulted in two more datasets, IsharaKhobor_small and IsharaKhobor_canonical_small. The dataset is publicly available at: www.kaggle.com/datasets/hasanssl/isharakhobor [1].
CLNov 3, 2025
BanglaNirTox: A Large-scale Parallel Corpus for Explainable AI in Bengali Text DetoxificationAyesha Afroza Mohsin, Mashrur Ahsan, Nafisa Maliyat et al.
Toxic language in Bengali remains prevalent, especially in online environments, with few effective precautions against it. Although text detoxification has seen progress in high-resource languages, Bengali remains underexplored due to limited resources. In this paper, we propose a novel pipeline for Bengali text detoxification that combines Pareto class-optimized large language models (LLMs) and Chain-of-Thought (CoT) prompting to generate detoxified sentences. To support this effort, we construct BanglaNirTox, an artificially generated parallel corpus of 68,041 toxic Bengali sentences with class-wise toxicity labels, reasonings, and detoxified paraphrases, using Pareto-optimized LLMs evaluated on random samples. The resulting BanglaNirTox dataset is used to fine-tune language models to produce better detoxified versions of Bengali sentences. Our findings show that Pareto-optimized LLMs with CoT prompting significantly enhance the quality and consistency of Bengali text detoxification.
CLJul 31, 2025Code
PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics ProblemsOshayer Siddique, J. M Areeb Uzair Alam, Md Jobayer Rahman Rafy et al.
The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems - a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference-time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi-agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, ${\rm P{\small HYSICS}E{\small VAL}}$, consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites. Our code and data are publicly available at https://github.com/areebuzair/PhysicsEval.
CLMay 11, 2023Code
BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from Book ReviewsMohsinul Kabir, Obayed Bin Mahfuz, Syed Rifat Raiyan et al.
The analysis of consumer sentiment, as expressed through reviews, can provide a wealth of insight regarding the quality of a product. While the study of sentiment analysis has been widely explored in many popular languages, relatively less attention has been given to the Bangla language, mostly due to a lack of relevant data and cross-domain adaptability. To address this limitation, we present BanglaBook, a large-scale dataset of Bangla book reviews consisting of 158,065 samples classified into three broad categories: positive, negative, and neutral. We provide a detailed statistical analysis of the dataset and employ a range of machine learning models to establish baselines including SVM, LSTM, and Bangla-BERT. Our findings demonstrate a substantial performance advantage of pre-trained models over models that rely on manually crafted features, emphasizing the necessity for additional training resources in this domain. Additionally, we conduct an in-depth error analysis by examining sentiment unigrams, which may provide insight into common classification errors in under-resourced languages like Bangla. Our codes and data are publicly available at https://github.com/mohsinulkabir14/BanglaBook.
55.2CLMay 8
Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM EvaluationAdib Sakhawat, Tahsin Islam, Takia Farhin et al.
The evaluation of Large Language Models (LLMs) faces a critical challenge in construct validity, where fragmented benchmarks and ad hoc metrics frequently conflate method variance, such as prompt sensitivity, with true latent capabilities. Concurrently, emerging research suggests that LLM capabilities and outputs can be modeled as continuous geometric manifolds. In this Systematization of Knowledge (SoK), we bridge these paradigms by proposing a generalized Multi-Trait Multi-Method (MTMM) framework for LLM evaluation. We formalize and unify nine evaluation metrics, including Paraphrase Instability, Drift Score, Overton Width, and Pluralism Score, interpreting them not as isolated scalar values but as geometric measurements within a shared latent coordinate space. This spatial unification factorizes model behavior into three orthogonal latent dimensions: (1) Instability and Sensitivity, (2) Position and Alignment, and (3) Coverage and Expressiveness. By systematically separating task-irrelevant perturbations from true capability spans, the framework provides a theoretically grounded and domain-agnostic taxonomy for robust and empirically stable benchmark design.
CVMar 4, 2025
BdSLW401: Transformer-Based Word-Level Bangla Sign Language Recognition Using Relative Quantization Encoding (RQE)Husne Ara Rubaiyeat, Njayou Youssouf, Md Kamrul Hasan et al.
Sign language recognition (SLR) for low-resource languages like Bangla suffers from signer variability, viewpoint variations, and limited annotated datasets. In this paper, we present BdSLW401, a large-scale, multi-view, word-level Bangla Sign Language (BdSL) dataset with 401 signs and 102,176 video samples from 18 signers in front and lateral views. To improve transformer-based SLR, we introduce Relative Quantization Encoding (RQE), a structured embedding approach anchoring landmarks to physiological reference points and quantize motion trajectories. RQE improves attention allocation by decreasing spatial variability, resulting in 44.3% WER reduction in WLASL100, 21.0% in SignBD-200, and significant gains in BdSLW60 and SignBD-90. However, fixed quantization becomes insufficient on large-scale datasets (e.g., WLASL2000), indicating the need for adaptive encoding strategies. Further, RQE-SF, an extended variant that stabilizes shoulder landmarks, achieves improvements in pose consistency at the cost of small trade-offs in lateral view recognition. The attention graphs prove that RQE improves model interpretability by focusing on the major articulatory features (fingers, wrists) and the more distinctive frames instead of global pose changes. Introducing BdSLW401 and demonstrating the effectiveness of RQE-enhanced structured embeddings, this work advances transformer-based SLR for low-resource languages and sets a benchmark for future research in this area.
HCAug 22, 2025
Prompting with Sign Parameters for Low-resource Sign Language Instruction GenerationMd Tariquzzaman, Md Farhan Ishmam, Saiyma Sittul Muna et al.
Sign Language (SL) enables two-way communication for the deaf and hard-of-hearing community, yet many sign languages remain under-resourced in the AI space. Sign Language Instruction Generation (SLIG) produces step-by-step textual instructions that enable non-SL users to imitate and learn SL gestures, promoting two-way interaction. We introduce BdSLIG, the first Bengali SLIG dataset, used to evaluate Vision Language Models (VLMs) (i) on under-resourced SLIG tasks, and (ii) on long-tail visual concepts, as Bengali SL is unlikely to appear in the VLM pre-training data. To enhance zero-shot performance, we introduce Sign Parameter-Infused (SPI) prompting, which integrates standard SL parameters, like hand shape, motion, and orientation, directly into the textual prompts. Subsuming standard sign parameters into the prompt makes the instructions more structured and reproducible than free-form natural text from vanilla prompting. We envision that our work would promote inclusivity and advancement in SL learning systems for the under-resourced communities.
CLDec 11, 2024
BDA: Bangla Text Data Augmentation FrameworkMd. Tariquzzaman, Audwit Nafi Anam, Naimul Haque et al.
Data augmentation involves generating synthetic samples that resemble those in a given dataset. In resource-limited fields where high-quality data is scarce, augmentation plays a crucial role in increasing the volume of training data. This paper introduces a Bangla Text Data Augmentation (BDA) Framework that uses both pre-trained models and rule-based methods to create new variants of the text. A filtering process is included to ensure that the new text keeps the same meaning as the original while also adding variety in the words used. We conduct a comprehensive evaluation of the framework's effectiveness in Bangla text classification tasks. Our framework achieved significant improvement in F1 scores across five distinct datasets, delivering performance equivalent to models trained on 100% of the data while utilizing only 50% of the training dataset. Additionally, we explore the impact of data scarcity by progressively reducing the training data and augmenting it through BDA, resulting in notable F1 score enhancements. The study offers a thorough examination of BDA's performance, identifying key factors for optimal results and addressing its limitations through detailed analysis.
64.7AIApr 1
Beyond Symbolic Solving: Multi Chain-of-Thought Voting for Geometric Reasoning in Large Language ModelsMd. Abu Bakor Siddique, Shahrin Hossain, Sadman Ahmed Siam et al.
Geometric Problem Solving (GPS) remains at the heart of enhancing mathematical reasoning in large language models because it requires the combination of diagrammatic understanding, symbolic manipulation and logical inference. In existing literature, researchers have chiefly focused on synchronising the diagram descriptions with text literals and solving the problem. In this vein, they have either taken a neural, symbolic or neuro-symbolic approach. But this solves only the first two of the requirements, namely diagrammatic understanding and symbolic manipulation, while leaving logical inference underdeveloped. The logical inference is often limited to one chain-of-thought (CoT). To address this weakness in hitherto existing models, this paper proposes MARS-GPS, that generates multiple parallel reasoning rollouts augmented with Python code execution for numerical verification, ranks them using token-level entropy as a confidence signal, and aggregates answers through a multi-stage voting and self-verification pipeline. Empirical results show that MARS-GPS with 8 parallel rollouts achieves 88.8% on Geometry3K, a nearly +11% improvement over the prior state-of-the-art, with accuracy scaling consistently as the number of rollouts increases from 1 to 16 (+6.0% on ablation subset). We provide our code and data in an anonymous repository: https://anonymous.4open.science/r/MARS-GPS-DE55.
CLOct 27, 2025
Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?Tawsif Tashwar Dipto, Azmol Hossain, Rubayet Sabbir Faruque et al.
Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations but dialect specific model training alleviates the issue. Our dataset also serves as a out of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available
CVMay 1, 2023
Venn Diagram Multi-label Class Interpretation of Diabetic Foot Ulcer with Color and Sharpness EnhancementMd Mahamudul Hasan, Moi Hoon Yap, Md Kamrul Hasan
DFU is a severe complication of diabetes that can lead to amputation of the lower limb if not treated properly. Inspired by the 2021 Diabetic Foot Ulcer Grand Challenge, researchers designed automated multi-class classification of DFU, including infection, ischaemia, both of these conditions, and none of these conditions. However, it remains a challenge as classification accuracy is still not satisfactory. This paper proposes a Venn Diagram interpretation of multi-label CNN-based method, utilizing different image enhancement strategies, to improve the multi-class DFU classification. We propose to reduce the four classes into two since both class wounds can be interpreted as the simultaneous occurrence of infection and ischaemia and none class wounds as the absence of infection and ischaemia. We introduce a novel Venn Diagram representation block in the classifier to interpret all four classes from these two classes. To make our model more resilient, we propose enhancing the perceptual quality of DFU images, particularly blurry or inconsistently lit DFU images, by performing color and sharpness enhancements on them. We also employ a fine-tuned optimization technique, adaptive sharpness aware minimization, to improve the CNN model generalization performance. The proposed method is evaluated on the test dataset of DFUC2021, containing 5,734 images and the results are compared with the top-3 winning entries of DFUC2021. Our proposed approach outperforms these existing approaches and achieves Macro-Average F1, Recall and Precision scores of 0.6592, 0.6593, and 0.6652, respectively.Additionally, We perform ablation studies and image quality measurements to further interpret our proposed method. This proposed method will benefit patients with DFUs since it tackles the inconsistencies in captured images and can be employed for a more robust remote DFU wound classification.
IVOct 7, 2019
A Novel Technique of Noninvasive Hemoglobin Level Measurement Using HSV Value of Fingertip ImageMd Kamrul Hasan, Nazmus Sakib, Joshua Field et al.
Over the last decade, smartphones have changed radically to support us with mHealth technology, cloud computing, and machine learning algorithm. Having its multifaceted facilities, we present a novel smartphone-based noninvasive hemoglobin (Hb) level prediction model by analyzing hue, saturation and value (HSV) of a fingertip video. Here, we collect 60 videos of 60 subjects from two different locations: Blood Center of Wisconsin, USA and AmaderGram, Bangladesh. We extract red, green, and blue (RGB) pixel intensities of selected images of those videos captured by the smartphone camera with flash on. Then we convert RGB values of selected video frames of a fingertip video into HSV color space and we generate histogram values of these HSV pixel intensities. We average these histogram values of a fingertip video and consider as an observation against the gold standard Hb concentration. We generate two input feature matrices based on observation of two different data sets. Partial Least Squares (PLS) algorithm is applied on the input feature matrix. We observe R2=0.95 in both data sets through our research. We analyze our data using Python OpenCV, Matlab, and R statistics tool.
LGMay 21, 2019
A Causality-Guided Prediction of the TED Talk Ratings from the Speech-Transcripts using Neural NetworksMd Iftekhar Tanveer, Md Kamrul Hasan, Daniel Gildea et al.
Automated prediction of public speaking performance enables novel systems for tutoring public speaking skills. We use the largest open repository---TED Talks---to predict the ratings provided by the online viewers. The dataset contains over 2200 talk transcripts and the associated meta information including over 5.5 million ratings from spontaneous visitors to the website. We carefully removed the bias present in the dataset (e.g., the speakers' reputations, popularity gained by publicity, etc.) by modeling the data generating process using a causal diagram. We use a word sequence based recurrent architecture and a dependency tree based recursive architecture as the neural networks for predicting the TED talk ratings. Our neural network models can predict the ratings with an average F-score of 0.77 which largely outperforms the competitive baseline method.
LGApr 14, 2019
UR-FUNNY: A Multimodal Language Dataset for Understanding HumorMd Kamrul Hasan, Wasifur Rahman, Amir Zadeh et al.
Humor is a unique and creative communicative behavior displayed during social interactions. It is produced in a multimodal manner, through the usage of words (text), gestures (vision) and prosodic cues (acoustic). Understanding humor from these three modalities falls within boundaries of multimodal language; a recent research trend in natural language processing that models natural language as it happens in face-to-face communication. Although humor detection is an established research area in NLP, in a multimodal context it is an understudied area. This paper presents a diverse multimodal dataset, called UR-FUNNY, to open the door to understanding multimodal language used in expressing humor. The dataset and accompanying studies, present a framework in multimodal humor detection for the natural language processing community. UR-FUNNY is publicly available for research.
HCSep 7, 2017
Automated Dyadic Data Recorder (ADDR) Framework and Analysis of Facial Cues in Deceptive CommunicationTayan Sen, Md Kamrul Hasan, Zach Teicher et al.
We developed an online framework that can automatically pair two crowd-sourced participants, prompt them to follow a research protocol, and record their audio and video on a remote server. The framework comprises two web applications: an Automatic Quality Gatekeeper for ensuring only high quality crowd-sourced participants are recruited for the study, and a Session Controller which directs participants to play a research protocol, such as an interrogation game. This framework was used to run a research study for analyzing facial expressions during honest and deceptive communication using a novel interrogation protocol. The protocol gathers two sets of nonverbal facial cues in participants: features expressed during questions relating to the interrogation topic and features expressed during control questions. The framework and protocol were used to gather 151 dyadic conversations (1.3 million video frames). Interrogators who were lied to expressed the smile-related lip corner puller cue more often than interrogators who were being told the truth, suggesting that facial cues from interrogators may be useful in evaluating the honesty of witnesses in some contexts. Overall, these results demonstrate that this framework is capable of gathering high quality data which can identify statistically significant results in a communication study.
SIJul 6, 2017
Buildup of Speaking Skills in an Online Learning Community: A Network-Analytic ExplorationRasoul Shafipour, Raiyan Abdul Baten, Md Kamrul Hasan et al.
In this study, we explore peer-interaction effects in online networks on speaking skill development. In particular, we present an evidence for gradual buildup of skills in a small-group setting that has not been reported in the literature. We introduce a novel dataset of six online communities consisting of 158 participants focusing on improving their speaking skills. They video-record speeches for 5 prompts in 10 days and exchange comments and performance-ratings with their peers. We ask (i) whether the participants' ratings are affected by their interaction patterns with peers, and (ii) whether there is any gradual buildup of speaking skills in the communities towards homogeneity. To analyze the data, we employ tools from the emerging field of Graph Signal Processing (GSP). GSP enjoys a distinction from Social Network Analysis in that the latter is concerned primarily with the connection structures of graphs, while the former studies signals on top of graphs. We study the performance ratings of the participants as graph signals atop underlying interaction topologies. Total variation analysis of the graph signals show that the participants' rating differences decrease with time (slope=-0.04, p<0.01), while average ratings increase (slope=0.07, p<0.05)--thereby gradually building up the ratings towards community-wide homogeneity. We provide evidence for peer-influence through a prediction formulation. Our consensus-based prediction model outperforms baseline network-agnostic regression models by about 23% in predicting performance ratings. This, in turn, shows that participants' ratings are affected by their peers' ratings and the associated interaction patterns, corroborating previous findings. Then, we formulate a consensus-based diffusion model that captures these observations of peer-influence from our analyses.
CVDec 27, 2015
Improving Facial Analysis and Performance Driven Animation through Disentangling Identity and ExpressionDavid Rim, Sina Honari, Md Kamrul Hasan et al.
We present techniques for improving performance driven facial animation, emotion recognition, and facial key-point or landmark prediction using learned identity invariant representations. Established approaches to these problems can work well if sufficient examples and labels for a particular identity are available and factors of variation are highly controlled. However, labeled examples of facial expressions, emotions and key-points for new individuals are difficult and costly to obtain. In this paper we improve the ability of techniques to generalize to new and unseen individuals by explicitly modeling previously seen variations related to identity and expression. We use a weakly-supervised approach in which identity labels are used to learn the different factors of variation linked to identity separately from factors related to expression. We show how probabilistic modeling of these sources of variation allows one to learn identity-invariant representations for expressions which can then be used to identity-normalize various procedures for facial expression analysis and animation control. We also show how to extend the widely used techniques of active appearance models and constrained local models through replacing the underlying point distribution models which are typically constructed using principal component analysis with identity-expression factorized representations. We present a wide variety of experiments in which we consistently improve performance on emotion recognition, markerless performance-driven facial animation and facial key-point tracking.
CVNov 18, 2015
A New Smooth Approximation to the Zero One Loss with a Probabilistic InterpretationMd Kamrul Hasan, Christopher J. Pal
We examine a new form of smooth approximation to the zero one loss in which learning is performed using a reformulation of the widely used logistic function. Our approach is based on using the posterior mean of a novel generalized Beta-Bernoulli formulation. This leads to a generalized logistic function that approximates the zero one loss, but retains a probabilistic formulation conferring a number of useful properties. The approach is easily generalized to kernel logistic regression and easily integrated into methods for structured prediction. We present experiments in which we learn such models using an optimization method consisting of a combination of gradient descent and coordinate descent using localized grid search so as to escape from local minima. Our experiments indicate that optimization quality is improved when learning meta-parameters are themselves optimized using a validation set. Our experiments show improved performance relative to widely used logistic and hinge loss methods on a wide variety of problems ranging from standard UC Irvine and libSVM evaluation datasets to product review predictions and a visual information extraction task. We observe that the approach: 1) is more robust to outliers compared to the logistic and hinge losses; 2) outperforms comparable logistic and max margin models on larger scale benchmark problems; 3) when combined with Gaussian- Laplacian mixture prior on parameters the kernelized version of our formulation yields sparser solutions than Support Vector Machine classifiers; and 4) when integrated into a probabilistic structured prediction technique our approach provides more accurate probabilities yielding improved inference and increasing information extraction performance.