CVNov 23, 2023Code
PointPCA+: Extending PointPCA objective quality assessment metricXuemei Zhou, Evangelos Alexiou, Irene Viola et al.
A computationally-simplified and descriptor-richer Point Cloud Quality Assessment (PCQA) metric, namely PointPCA+, is proposed in this paper, which is an extension of PointPCA. PointPCA proposed a set of perceptually-relevant descriptors based on PCA decomposition that were applied to both the geometry and texture data of point clouds for full reference PCQA. PointPCA+ employs PCA only on the geometry data while enriching existing geometry and texture descriptors, that are computed more efficiently. Similarly to PointPCA, a total quality score is obtained through a learning-based fusion of individual predictions from geometry and texture descriptors that capture local shape and appearance properties, respectively. Before feature fusion, a feature selection module is introduced to choose the most effective features from a proposed super-set. Experimental results show that PointPCA+ achieves high predictive performance against subjective ground truth scores obtained from publicly available datasets. The code is available at \url{https://github.com/cwi-dis/pointpca_suite/}.
HCMay 14
Towards Gaze-Informed AI Disclosure Interfaces: Eye-Tracking Attentional and Cognitive Load While Reading AI-Assisted NewsPooja Prajod, Hannes Cools, Thomas Röggla et al.
As generative AI becomes increasingly integrated into journalism, designing effective AI-use disclosures that inform readers without imposing unnecessary burden is a key challenge. While prior research has primarily focused on trust and credibility, the impact of disclosures on readers' attentional and cognitive load remains underexplored. To address this gap, we conducted a $3\times2\times2$ mixed factorial study manipulating the level of AI-use disclosure detail (none, one-line, detailed), news type (politics, lifestyle), and role of AI (editing, partial content generation), measuring load via NASA-TLX and eye-tracking. Our results reveal a significant attentional cost: one-line disclosures resulted in significantly higher fixation durations and saccade counts, particularly for AI-edited content. Detailed disclosures did not impose additional burden. Drawing on Information-Gap Theory, we argue that brief labels may trigger increased visual scrutiny by alerting readers to AI use without providing enough information. NASA-TLX scores and pupil diameter showed no significant differences across conditions, suggesting that AI-use disclosures do not impose cognitive burden regardless of the detail level. Interview insights contextualize these findings and reveal a strong preference for detailed or ``detail-on-demand'' designs. Our findings inform the design of gaze-informed adaptive disclosure interfaces that dynamically adjust transparency levels based on readers' attentional patterns and news context.
HCJan 14
Full Disclosure, Less Trust? How the Level of Detail about AI Use in News Writing Affects Readers' TrustPooja Prajod, Hannes Cools, Thomas Röggla et al.
As artificial intelligence (AI) is increasingly integrated into news production, calls for transparency about the use of AI have gained considerable traction. Recent studies suggest that AI disclosures can lead to a ``transparency dilemma'', where disclosure reduces readers' trust. However, little is known about how the \textit{level of detail} in AI disclosures influences trust and contributes to this dilemma within the news context. In this 3$\times$2$\times$2 mixed factorial study with 40 participants, we investigate how three levels of AI disclosures (none, one-line, detailed) across two types of news (politics and lifestyle) and two levels of AI involvement (low and high) affect news readers' trust. We measured trust using the News Media Trust questionnaire, along with two decision behaviors: source-checking and subscription decisions. Questionnaire responses and subscription rates showed a decline in trust only for detailed AI disclosures, whereas source-checking behavior increased for both one-line and detailed disclosures, with the effect being more pronounced for detailed disclosures. Insights from semi-structured interviews suggest that source-checking behavior was primarily driven by interest in the topic, followed by trust, whereas trust was the main factor influencing subscription decisions. Around two-thirds of participants expressed a preference for detailed disclosures, while most participants who preferred one-line indicated a need for detail-on-demand disclosure formats. Our findings show that not all AI disclosures lead to a transparency dilemma, but instead reflect a trade-off between readers' desire for more transparency and their trust in AI-assisted news content.
MMNov 24, 2021Code
PointPCA: Point Cloud Objective Quality Assessment Using PCA-Based DescriptorsEvangelos Alexiou, Xuemei Zhou, Irene Viola et al.
Point clouds denote a prominent solution for the representation of 3D photo-realistic content in immersive applications. Similarly to other imaging modalities, quality predictions for point cloud contents are vital for a wide range of applications, enabling trade-off optimizations between data quality and data size in every processing step from acquisition to rendering. In this work, we focus on use cases that consider human end-users consuming point cloud contents and, hence, we concentrate on visual quality metrics. In particular, we propose a set of perceptually relevant descriptors based on Principal Component Analysis (PCA) decomposition, which is applied to both geometry and texture data for full-reference point cloud quality assessment. Statistical features are derived from these descriptors to characterize local shape and appearance properties for both a reference and a distorted point cloud. The extracted statistical features are subsequently compared to provide corresponding predictions of visual quality for the distorted point cloud. As part of our method, a learning-based approach is proposed to fuse these individual predictors to a unified perceptual score. We validate the accuracy of the individual predictors, as well as the unified quality scores obtained after regression against subjectively annotated datasets, showing that our metric outperforms state-of-the-art solutions. Insights regarding design decisions are provided through exploratory studies, evaluating the performance of our metric under different parameter configurations, attribute domains, color spaces, and regression models. A software implementation of the proposed metric is made available at the following link: https://github.com/cwi-dis/pointpca.
CVOct 14, 2021Code
HUMAN4D: A Human-Centric Multimodal Dataset for Motions and Immersive MediaAnargyros Chatzitofis, Leonidas Saroglou, Prodromos Boutis et al.
We introduce HUMAN4D, a large and multimodal 4D dataset that contains a variety of human activities simultaneously captured by a professional marker-based MoCap, a volumetric capture and an audio recording system. By capturing 2 female and $2$ male professional actors performing various full-body movements and expressions, HUMAN4D provides a diverse set of motions and poses encountered as part of single- and multi-person daily, physical and social activities (jumping, dancing, etc.), along with multi-RGBD (mRGBD), volumetric and audio data. Despite the existence of multi-view color datasets captured with the use of hardware (HW) synchronization, to the best of our knowledge, HUMAN4D is the first and only public resource that provides volumetric depth maps with high synchronization precision due to the use of intra- and inter-sensor HW-SYNC. Moreover, a spatio-temporally aligned scanned and rigged 3D character complements HUMAN4D to enable joint research on time-varying and high-quality dynamic meshes. We provide evaluation baselines by benchmarking HUMAN4D with state-of-the-art human pose estimation and 3D compression methods. For the former, we apply 2D and 3D pose estimation algorithms both on single- and multi-view data cues. For the latter, we benchmark open-source 3D codecs on volumetric data respecting online volumetric video encoding and steady bit-rates. Furthermore, qualitative and quantitative visual comparison between mesh-based volumetric data reconstructed in different qualities showcases the available options with respect to 4D representations. HUMAN4D is introduced to the computer vision and graphics research communities to enable joint research on spatio-temporally aligned pose, volumetric, mRGBD and audio data cues. The dataset and its code are available https://tofis.github.io/myurls/human4d.
CLMay 16, 2024
Autonomous Workflow for Multimodal Fine-Grained Training Assistants Towards Mixed RealityJiahuan Pei, Irene Viola, Haochen Huang et al.
Autonomous artificial intelligence (AI) agents have emerged as promising protocols for automatically understanding the language-based environment, particularly with the exponential development of large language models (LLMs). However, a fine-grained, comprehensive understanding of multimodal environments remains under-explored. This work designs an autonomous workflow tailored for integrating AI agents seamlessly into extended reality (XR) applications for fine-grained training. We present a demonstration of a multimodal fine-grained training assistant for LEGO brick assembly in a pilot XR environment. Specifically, we design a cerebral language agent that integrates LLM with memory, planning, and interaction with XR tools and a vision-language agent, enabling agents to decide their actions based on past experiences. Furthermore, we introduce LEGO-MRTA, a multimodal fine-grained assembly dialogue dataset synthesized automatically in the workflow served by a commercial LLM. This dataset comprises multimodal instruction manuals, conversations, XR responses, and vision question answering. Last, we present several prevailing open-resource LLMs as benchmarks, assessing their performance with and without fine-tuning on the proposed dataset. We anticipate that the broader impact of this workflow will advance the development of smarter assistants for seamless user interaction in XR environments, fostering research in both AI and HCI communities.
AIJul 7, 2025
LEGO Co-builder: Exploring Fine-Grained Vision-Language Modeling for Multimodal LEGO Assembly AssistantsHaochen Huang, Jiahuan Pei, Mohammad Aliannejadi et al.
Vision-language models (VLMs) are facing the challenges of understanding and following multimodal assembly instructions, particularly when fine-grained spatial reasoning and precise object state detection are required. In this work, we explore LEGO Co-builder, a hybrid benchmark combining real-world LEGO assembly logic with programmatically generated multimodal scenes. The dataset captures stepwise visual states and procedural instructions, allowing controlled evaluation of instruction-following, object detection, and state detection. We introduce a unified framework and assess leading VLMs such as GPT-4o, Gemini, and Qwen-VL, under zero-shot and fine-tuned settings. Our results reveal that even advanced models like GPT-4o struggle with fine-grained assembly tasks, with a maximum F1 score of just 40.54\% on state detection, highlighting gaps in fine-grained visual understanding. We release the benchmark, codebase, and generation pipeline to support future research on multimodal assembly assistants grounded in real-world workflows.
MMJan 19, 2022
On the impact of VR assessment on the Quality of Experience of Highly Realistic Digital HumansIrene Viola, Shishir Subramanyam, Jie Li et al.
Fuelled by the increase in popularity of virtual and augmented reality applications, point clouds have emerged as a popular 3D format for acquisition and rendering of digital humans, thanks to their versatility and real-time capabilities. Due to technological constraints and real-time rendering limitations, however, the visual quality of dynamic point cloud contents is seldom evaluated using virtual and augmented reality devices, instead relying on prerecorded videos displayed on conventional 2D screens. In this study, we evaluate how the visual quality of point clouds representing digital humans is affected by compression distortions. In particular, we compare three different viewing conditions based on the degrees of freedom that are granted to the viewer: passive viewing (2DTV), head rotation (3DoF), and rotation and translation (6DoF), to understand how interacting in the virtual space affects the perception of quality. We provide both quantitative and qualitative results of our evaluation involving 78 participants, and we make the data publicly available. To the best of our knowledge, this is the first study evaluating the quality of dynamic point clouds in virtual reality, and comparing it to traditional viewing settings. Results highlight the dependency of visual quality on the content under test, and limitations in the way current data sets are used to evaluate compression solutions. Moreover, influencing factors in quality evaluation in VR, and shortcomings in how point cloud encoding solutions handle visually-lossless compression, are discussed.
HCDec 17, 2021
Extending 3-DoF Metrics to Model User Behaviour Similarity in 6-DoF Immersive ApplicationsSilvia Rossi, Irene Viola, Laura Toni et al.
Immersive reality technologies, such as Virtual and Augmented Reality, have ushered a new era of user-centric systems, in which every aspect of the coding--delivery--rendering chain is tailored to the interaction of the users. Understanding the actual interactivity and behaviour of the users is still an open challenge and a key step to enabling such a user-centric system. Our main goal is to extend the applicability of existing behavioural methodologies for studying user navigation in the case of 6 Degree-of-Freedom (DoF). Specifically, we first compare the navigation in 6-DoF with its 3-DoF counterpart highlighting the main differences and novelties. Then, we define new metrics aimed at better modelling behavioural similarities between users in a 6-DoF system. We validate and test our solutions on real navigation paths of users interacting with dynamic volumetric media in 6-DoF Virtual Reality conditions. Our results show that metrics that consider both user position and viewing direction better perform in detecting user similarity while navigating in a 6-DoF system. Having easy-to-use but robust metrics that underpin multiple tools and answer the question ``how do we detect if two users look at the same content?" open the gate to new solutions for a user-centric system.
MMApr 11, 2021
Towards SocialVR: Evaluating a Novel Technology for Watching Videos TogetherMario Montagud, Jie Li, Gianluca Cernigliario et al.
Social VR enables people to interact over distance with others in real-time. It allows remote people, typically represented as avatars, to communicate and perform activities together in a join shared virtual environment, extending the capabilities of traditional social platforms like Facebook and Netflix. This paper explores the benefits and drawbacks provided by a lightweight and low-cost Social VR platform (SocialVR), in which users are captured by several cameras and reconstructed in real-time. In particular, the paper contributes with (1) the design and evaluation of an experimental protocol for Social VR experiences; (2) the report of a production workflow for this new type of media experiences; and (3) the results of experiments with both end-users (N=15 pairs) and professionals (N=25) to evaluate the potential of the SocialVR platform. Results from the questionnaires and semi-structured interviews show that end-users rated positively towards the experiences provided by the SocialVR platform, which enabled them to sense emotions and communicate effortlessly. End-users perceived the photo-realistic experience of SocialVR similar to face-to-face scenarios and appreciated this new creative medium. From a commercial perspective, professionals confirmed the potential of this communication medium and encourage further research for the adoption of the platform in the commercial landscape
HCJan 13, 2021
Proxemics and Social Interactions in an Instrumented Virtual Reality WorkshopJulie Williamson, Jie Li, Vinoba Vinayagamoorthy et al.
Virtual environments (VEs) can create collaborative and social spaces, which are increasingly important in the face of remote work and travel reduction. Recent advances, such as more open and widely available platforms, create new possibilities to observe and analyse interaction in VEs. Using a custom instrumented build of Mozilla Hubs to measure position and orientation, we conducted an academic workshop to facilitate a range of typical workshop activities. We analysed social interactions during a keynote, small group breakouts, and informal networking/hallway conversations. Our mixed-methods approach combined environment logging, observations, and semi-structured interviews. The results demonstrate how small and large spaces influenced group formation, shared attention, and personal space, where smaller rooms facilitated more cohesive groups while larger rooms made small group formation challenging but personal space more flexible. Beyond our findings, we show how the combination of data and insights can fuel collaborative spaces' design and deliver more effective virtual workshops.
MMSep 13, 2012
Surveying the Social, Smart and Converged TV Landscape: Where is Television Research Headed?Marie-Jose Montpetit, Pablo Cesar, Maja Matijasevic et al.
The TV is dead motto of just a few years ago has been replaced by the prospect of Internet Protocol (IP) television experiences over converged networks to become one of the great technology opportunities in the next few years. As an introduction to the Special Issue on Smart, Social and Converged Television, this extended editorial intends to review the current IP television landscape in its many realizations: operator-based, over-the-top, and user generated. We will address new services like social TV and recommendation engines, dissemination including new paradigms built on peer to peer and content centric networks, as well as the all important quality of experience that challenges services and networks alike. But we intend to go further than just review the existing work by proposing areas for the future of television research. These include strategies to provide services that are more efficient in network and energy usage while being socially engaging, novel services that will provide consumers with a broader choice of content and devices, and metrics that will enable operators and users alike to define the level of service they require or that they are ready to provide. These topics are addressed in this survey paper that attempts to create a unifying framework to link them all together. Not only is television not dead, it is well alive, thriving and fostering innovation and this paper will hopefully prove it.