Nilam Ram

CL
h-index6
7papers
78citations
Novelty44%
AI Score47

7 Papers

CLMar 6Code
Learning Next Action Predictors from Human-Computer Interaction

Omar Shaikh, Valentin Teutschbein, Kanishk Gandhi et al.

Truly proactive AI systems must anticipate what we will do next. This foresight demands far richer information than the sparse signals we type into our prompts -- it demands reasoning over the entire context of what we see and do. We formalize this as next action prediction (NAP): given a sequence of a user's multimodal interactions with a computer (screenshots, clicks, sensor data), predict that user's next action. Progress on this task requires both new data and modeling approaches. To scale data, we annotate longitudinal, naturalistic computer use with vision-language models. We release an open-source pipeline for performing this labeling on private infrastructure, and label over 360K actions across one month of continuous phone usage from 20 users, amounting to 1,800 hours of screen time. We then introduce LongNAP, a user model that combines parametric and in-context learning to reason over long interaction histories. LongNAP is trained via policy gradient methods to generate user-specific reasoning traces given some context; retrieve relevant traces from a library of past traces; and then apply retrieved traces in-context to predict future actions. Using an LLM-as-judge evaluation metric (0-1 similarity to ground truth), LongNAP significantly outperforms supervised finetuning and prompted baselines on held-out data (by 79% and 39% respectively). Additionally, LongNAP generalizes to held out users when trained across individuals. The space of next actions a user might take at any moment is unbounded, spanning thousands of possible outcomes. Despite this, 17.1% of LongNAP's predicted trajectories are well-aligned with what a user does next (LLM-judge score $\geq$ 0.5). This rises to 26% when we filter to highly confident predictions. In sum, we argue that learning from the full context of user behavior to anticipate user needs is now a viable task with substantial opportunity.

82.1HCMar 13
Daily Affect Fluctuations in Phone Screen Content Predict Anxiety and Depressive Symptoms

Christopher A. Kelly, Yikun Chi, Nicholas Haber et al.

The relationship between digital media use and mental health remains poorly understood, in part because real-world digital behavior is rarely captured at scale. This intensive longitudinal study tracked participants' complete natural smartphone interactions over one year. We collected screenshots every 5 seconds from 145 adults (yielding 111 million screenshots), alongside biweekly assessments of anxiety and depression (mean = 24 surveys). The valence and arousal of each screenshot were assessed using a deep learning affect model. Individuals showed highly idiosyncratic media patterns, with substantially more variance in anxiety and depression accounted for within-person than between-person. Day-to-day fluctuations in the valence and arousal of a person's screen content predicted subsequent changes in depression and anxiety, whereas between-person differences did not. Specifically, greater exposure to low-arousal negative content was associated with higher depression and anxiety. These findings underscore the dynamic, idiosyncratic nature of digital consumption and the need for targeted measurement and intervention.

IRJan 4, 2018Code
Text Extraction and Retrieval from Smartphone Screenshots: Building a Repository for Life in Media

Agnese Chiatti, Mu Jung Cho, Anupriya Gagneja et al.

Daily engagement in life experiences is increasingly interwoven with mobile device use. Screen capture at the scale of seconds is being used in behavioral studies and to implement "just-in-time" health interventions. The increasing psychological breadth of digital information will continue to make the actual screens that people view a preferred if not required source of data about life experiences. Effective and efficient Information Extraction and Retrieval from digital screenshots is a crucial prerequisite to successful use of screen data. In this paper, we present the experimental workflow we exploited to: (i) pre-process a unique collection of screen captures, (ii) extract unstructured text embedded in the images, (iii) organize image text and metadata based on a structured schema, (iv) index the resulting document collection, and (v) allow for Image Retrieval through a dedicated vertical search engine application. The adopted procedure integrates different open source libraries for traditional image processing, Optical Character Recognition (OCR), and Image Retrieval. Our aim is to assess whether and how state-of-the-art methodologies can be applied to this novel data set. We show how combining OpenCV-based pre-processing modules with a Long short-term memory (LSTM) based release of Tesseract OCR, without ad hoc training, led to a 74% character-level accuracy of the extracted text. Further, we used the processed repository as baseline for a dedicated Image Retrieval system, for the immediate use and application for behavioral and prevention scientists. We discuss issues of Text Information Extraction and Retrieval that are particular to the screenshot image case and suggest important future work.

49.6HCApr 24
Within-person prediction of depressive symptom change using year-long Screenome data and CES-D assessments

Merve Cerit, Andrea Mock, Vryan Almanon Feliciano et al.

Predicting whether an individual's depressive symptoms will worsen, remain stable, or improve over the coming weeks can enable earlier and more targeted care, yet prospective within-person trajectory prediction remains largely unaddressed in digital phenotyping. We combine fortnightly CES-D assessments with over 100 million screenshots captured every five seconds via the Stanford Screenomics platform from 96 adults followed for approximately one year (M = 20.9, SD = 3.9 assessments per participant, 2,002 total observations). We frame prediction as a within-person classification task: whether symptoms will worsen, remain stable, or improve over the subsequent fortnight, operationalized in three ways to capture clinically meaningful change. Under temporal holdout, XGBoost achieves an AUC of 0.906 for crossings of established CES-D severity bands and 0.755 for change relative to each participant's own within-person variability, generalizing to unseen individuals (AUC = 0.821). Each person's typical symptom level was the only statistically significant predictor above the most recent CES-D score; without it, the most consequential worsening transitions go undetected. Screenome-derived behavioral features revealed prodromal patterns of worsening, including escalating social media use, fragmented device engagement, and changes in overnight activity, with substantial individual heterogeneity. These findings establish a proof-of-concept foundation for monitoring systems that could identify individuals approaching clinical deterioration before symptoms reach a crisis point.

CLOct 14, 2024
Personality Differences Drive Conversational Dynamics: A High-Dimensional NLP Approach

Julia R. Fischer, Nilam Ram

This paper investigates how the topical flow of dyadic conversations emerges over time and how differences in interlocutors' personality traits contribute to this topical flow. Leveraging text embeddings, we map the trajectories of $N = 1655$ conversations between strangers into a high-dimensional space. Using nonlinear projections and clustering, we then identify when each interlocutor enters and exits various topics. Differences in conversational flow are quantified via $\textit{topic entropy}$, a summary measure of the "spread" of topics covered during a conversation, and $\textit{linguistic alignment}$, a time-varying measure of the cosine similarity between interlocutors' embeddings. Our findings suggest that interlocutors with a larger difference in the personality dimension of openness influence each other to spend more time discussing a wider range of topics and that interlocutors with a larger difference in extraversion experience a larger decrease in linguistic alignment throughout their conversation. We also examine how participants' affect (emotion) changes from before to after a conversation, finding that a larger difference in extraversion predicts a larger difference in affect change and that a greater topic entropy predicts a larger affect increase. This work demonstrates how communication research can be advanced through the use of high-dimensional NLP methods and identifies personality difference as an important driver of social influence.

CVJan 9, 2019
Guess What's on my Screen? Clustering Smartphone Screenshots with Active Learning

Agnese Chiatti, Dolzodmaa Davaasuren, Nilam Ram et al.

A significant proportion of individuals' daily activities is experienced through digital devices. Smartphones in particular have become one of the preferred interfaces for content consumption and social interaction. Identifying the content embedded in frequently-captured smartphone screenshots is thus a crucial prerequisite to studies of media behavior and health intervention planning that analyze activity interplay and content switching over time. Screenshot images can depict heterogeneous contents and applications, making the a priori definition of adequate taxonomies a cumbersome task, even for humans. Privacy protection of the sensitive data captured on screens means the costs associated with manual annotation are large, as the effort cannot be crowd-sourced. Thus, there is need to examine utility of unsupervised and semi-supervised methods for digital screenshot classification. This work introduces the implications of applying clustering on large screenshot sets when only a limited amount of labels is available. In this paper we develop a framework for combining K-Means clustering with Active Learning for efficient leveraging of labeled and unlabeled samples, with the goal of discovering latent classes and describing a large collection of screenshot data. We tested whether SVM-embedded or XGBoost-embedded solutions for class probability propagation provide for more well-formed cluster configurations. Visual and textual vector representations of the screenshot images are derived and combined to assess the relative contribution of multi-modal features to the overall performance.

SIApr 11, 2014
On the Ground Validation of Online Diagnosis with Twitter and Medical Records

Todd Bodnar, Victoria C Barclay, Nilam Ram et al.

Social media has been considered as a data source for tracking disease. However, most analyses are based on models that prioritize strong correlation with population-level disease rates over determining whether or not specific individual users are actually sick. Taking a different approach, we develop a novel system for social-media based disease detection at the individual level using a sample of professionally diagnosed individuals. Specifically, we develop a system for making an accurate influenza diagnosis based on an individual's publicly available Twitter data. We find that about half (17/35 = 48.57%) of the users in our sample that were sick explicitly discuss their disease on Twitter. By developing a meta classifier that combines text analysis, anomaly detection, and social network analysis, we are able to diagnose an individual with greater than 99% accuracy even if she does not discuss her health.