Zhuohao Chen

CL
9papers
440citations
Novelty44%
AI Score45

9 Papers

CLOct 25, 2022
Leveraging Open Data and Task Augmentation to Automated Behavioral Coding of Psychotherapy Conversations in Low-Resource Scenarios

Zhuohao Chen, Nikolaos Flemotomos, Zac E. Imel et al.

In psychotherapy interactions, the quality of a session is assessed by codifying the communicative behaviors of participants during the conversation through manual observation and annotation. Developing computational approaches for automated behavioral coding can reduce the burden on human coders and facilitate the objective evaluation of the intervention. In the real world, however, implementing such algorithms is associated with data sparsity challenges since privacy concerns lead to limited available in-domain data. In this paper, we leverage a publicly available conversation-based dataset and transfer knowledge to the low-resource behavioral coding task by performing an intermediate language model training via meta-learning. We introduce a task augmentation method to produce a large number of "analogy tasks" - tasks similar to the target one - and demonstrate that the proposed framework predicts target behaviors more accurately than all the other baseline models.

60.7CVMay 14Code
Masked Next-Scale Prediction for Self-supervised Scene Text Recognition

Zhuohao Chen, Zeng Li, Yifei Zhang et al.

Scene Text Recognition requires modeling visual structures that evolve from coarse layouts to fine-grained character strokes. Training such models relies on large amounts of annotated data. Recent self-supervised approaches, such as Masked Image Modeling (MIM), alleviate this dependency by leveraging large-scale unlabeled data. Yet most existing MIM methods operate at a single spatial scale and fail to capture the hierarchical nature of scene text. In this work, we introduce Masked Next-Scale Prediction (MNSP), a unified self-supervised framework designed to explicitly model cross-scale structural evolution. The framework incorporates Next-Scale Prediction (NSP), which learns hierarchical representations by predicting higher-resolution features from lower-resolution contexts. Naive scale prediction, however, tends to produce spatially diffuse attention, directing the model toward background regions rather than textual structures. MNSP resolves this limitation by jointly learning cross-scale prediction and masked image reconstruction. NSP captures global layout priors across resolutions, while masked reconstruction imposes strong local constraints that guide attention toward informative text regions. A Multi-scale Linguistic Alignment module further maintains semantic consistency across different resolutions. Extensive experiments demonstrate that MNSP achieves state-of-the-art performance, reaching 86.2\% average accuracy on the challenging Union14M benchmark and 96.7\% across six standard datasets. Additional analyses show that our method improves robustness under extreme scale and layout variations. Code is available at https://github.com/CzhczhcHczh/MNSP

CLMar 1
Linking Knowledge to Care: Knowledge Graph-Augmented Medical Follow-Up Question Generation

Liwen Sun, Xiang Yu, Ming Tan et al.

Clinical diagnosis is time-consuming, requiring intensive interactions between patients and medical professionals. While large language models (LLMs) could ease the pre-diagnostic workload, their limited domain knowledge hinders effective medical question generation. We introduce a Knowledge Graph-augmented LLM with active in-context learning to generate relevant and important follow-up questions, KG-Followup, serving as a critical module for the pre-diagnostic assessment. The structured medical domain knowledge graph serves as a seamless patch-up to provide professional domain expertise upon which the LLM can reason. Experiments demonstrate that KG-Followup outperforms state-of-the-art methods by 5% - 8% on relevant benchmarks in recall.

CLJun 15, 2021
An Automated Quality Evaluation Framework of Psychotherapy Conversations with Local Quality Estimates

Zhuohao Chen, Nikolaos Flemotomos, Karan Singla et al.

Text-based computational approaches for assessing the quality of psychotherapy are being developed to support quality assurance and clinical training. However, due to the long durations of typical conversation based therapy sessions, and due to limited annotated modeling resources, computational methods largely rely on frequency-based lexical features or dialogue acts to assess the overall session level characteristics. In this work, we propose a hierarchical framework to automatically evaluate the quality of transcribed Cognitive Behavioral Therapy (CBT) interactions. Given the richly dynamic nature of the spoken dialog within a talk therapy session, to evaluate the overall session level quality, we propose to consider modeling it as a function of local variations across the interaction. To implement that empirically, we divide each psychotherapy session into conversation segments and initialize the segment-level qualities with the session-level scores. First, we produce segment embeddings by fine-tuning a BERT-based model, and predict segment-level (local) quality scores. These embeddings are used as the lower-level input to a Bidirectional LSTM-based neural network to predict the session-level (global) quality estimates. In particular, we model the global quality as a linear function of the local quality scores, which allows us to update the segment-level quality estimates based on the session-level quality prediction. These newly estimated segment-level scores benefit the BERT fine-tuning process, which in turn results in better segment embeddings. We evaluate the proposed framework on automatically derived transcriptions from real-world CBT clinical recordings to predict session-level behavior codes. The results indicate that our approach leads to improved evaluation accuracy for most codes when used for both regression and classification tasks.

CLFeb 23, 2021
Automated Quality Assessment of Cognitive Behavioral Therapy Sessions Through Highly Contextualized Language Representations

Nikolaos Flemotomos, Victor R. Martinez, Zhuohao Chen et al.

During a psychotherapy session, the counselor typically adopts techniques which are codified along specific dimensions (e.g., 'displays warmth and confidence', or 'attempts to set up collaboration') to facilitate the evaluation of the session. Those constructs, traditionally scored by trained human raters, reflect the complex nature of psychotherapy and highly depend on the context of the interaction. Recent advances in deep contextualized language models offer an avenue for accurate in-domain linguistic representations which can lead to robust recognition and scoring of such psychotherapy-relevant behavioral constructs, and support quality assurance and supervision. In this work, we propose a BERT-based model for automatic behavioral scoring of a specific type of psychotherapy, called Cognitive Behavioral Therapy (CBT), where prior work is limited to frequency-based language features and/or short text excerpts which do not capture the unique elements involved in a spontaneous long conversational interaction. The model focuses on the classification of therapy sessions with respect to the overall score achieved on the widely-used Cognitive Therapy Rating Scale (CTRS), but is trained in a multi-task manner in order to achieve higher interpretability. BERT-based representations are further augmented with available therapy metadata, providing relevant non-linguistic context and leading to consistent performance improvements. We train and evaluate our models on a set of 1,118 real-world therapy sessions, recorded and automatically transcribed. Our best model achieves an F1 score equal to 72.61% on the binary classification task of low vs. high total CTRS.

ASFeb 22, 2021
Automated Evaluation Of Psychotherapy Skills Using Speech And Language Technologies

Nikolaos Flemotomos, Victor R. Martinez, Zhuohao Chen et al.

With the growing prevalence of psychological interventions, it is vital to have measures which rate the effectiveness of psychological care to assist in training, supervision, and quality assurance of services. Traditionally, quality assessment is addressed by human raters who evaluate recorded sessions along specific dimensions, often codified through constructs relevant to the approach and domain. This is however a cost-prohibitive and time-consuming method that leads to poor feasibility and limited use in real-world settings. To facilitate this process, we have developed an automated competency rating tool able to process the raw recorded audio of a session, analyzing who spoke when, what they said, and how the health professional used language to provide therapy. Focusing on a use case of a specific type of psychotherapy called Motivational Interviewing, our system gives comprehensive feedback to the therapist, including information about the dynamics of the session (e.g., therapist's vs. client's talking time), low-level psychological language descriptors (e.g., type of questions asked), as well as other high-level behavioral constructs (e.g., the extent to which the therapist understands the clients' perspective). We describe our platform and its performance using a dataset of more than 5,000 recordings drawn from its deployment in a real-world clinical setting used to assist training of new therapists. Widespread use of automated psychotherapy rating tools may augment experts' capabilities by providing an avenue for more effective training and skill improvement, eventually leading to more positive clinical outcomes.

ASJul 1, 2020
Automated Empathy Detection for Oncology Encounters

Zhuohao Chen, James Gibson, Ming-Chang Chiu et al.

Empathy involves understanding other people's situation, perspective, and feelings. In clinical interactions, it helps clinicians establish rapport with a patient and support patient-centered care and decision making. Understanding physician communication through observation of audio-recorded encounters is largely carried out with manual annotation and analysis. However, manual annotation has a prohibitively high cost. In this paper, a multimodal system is proposed for the first time to automatically detect empathic interactions in recordings of real-world face-to-face oncology encounters that might accelerate manual processes. An automatic speech and language processing pipeline is employed to segment and diarize the audio as well as for transcription of speech into text. Lexical and acoustic features are derived to help detect both empathic opportunities offered by the patient, and the expressed empathy by the oncologist. We make the empathy predictions using Support Vector Machines (SVMs) and evaluate the performance on different combinations of features in terms of average precision (AP).

ASMay 15, 2020
Feature Fusion Strategies for End-to-End Evaluation of Cognitive Behavior Therapy Sessions

Zhuohao Chen, Nikolaos Flemotomos, Victor Ardulov et al.

Cognitive Behavioral Therapy (CBT) is a goal-oriented psychotherapy for mental health concerns implemented in a conversational setting with broad empirical support for its effectiveness across a range of presenting problems and client populations. The quality of a CBT session is typically assessed by trained human raters who manually assign pre-defined session-level behavioral codes. In this paper, we develop an end-to-end pipeline that converts speech audio to diarized and transcribed text and extracts linguistic features to code the CBT sessions automatically. We investigate both word-level and utterance-level features and propose feature fusion strategies to combine them. The utterance level features include dialog act tags as well as behavioral codes drawn from another well-known talk psychotherapy called Motivational Interviewing (MI). We propose a novel method to augment the word-based features with the utterance level tags for subsequent CBT code estimation. Experiments show that our new fusion strategy outperforms all the studied features, both when used individually and when fused by direct concatenation. We also find that incorporating a sentence segmentation module can further improve the overall system given the preponderance of multi-utterance conversational turns in CBT sessions.

CLMar 16, 2020
A Label Proportions Estimation Technique for Adversarial Domain Adaptation in Text Classification

Zhuohao Chen, Singla Karan, David C. Atkins et al.

Many text classification tasks are domain-dependent, and various domain adaptation approaches have been proposed to predict unlabeled data in a new domain. Domain-adversarial neural networks (DANN) and their variants have been used widely recently and have achieved promising results for this problem. However, most of these approaches assume that the label proportions of the source and target domains are similar, which rarely holds in most real-world scenarios. Sometimes the label shift can be large and the DANN fails to learn domain-invariant features. In this study, we focus on unsupervised domain adaptation of text classification with label shift and introduce a domain adversarial network with label proportions estimation (DAN-LPE) framework. The DAN-LPE simultaneously trains a domain adversarial net and processes label proportions estimation by the confusion of the source domain and the predictions of the target domain. Experiments show the DAN-LPE achieves a good estimate of the target label distributions and reduces the label shift to improve the classification performance.