S R Das

h-index3

6papers

37citations

6 Papers

5.5CVJul 16

Unsupervised Keypoints for Real-Time Fall Detection: Comparative Analysis Under Real-world Conditions with Predictive Bandwidth Reduction

Tasmiah Haque, Jacob Kosinski, Sumit Mohan et al.

Falls among older adults are a major safety challenge, but continuous monitoring is difficult to sustain. Video captures fall-related posture and motion, yet deployment is limited by privacy, computation, and bandwidth. Supervised pose estimation is anatomically interpretable but vulnerable to occlusion and partial body visibility. We propose a privacy-preserving framework that replaces RGB transmission with compact motion representations based on unsupervised keypoints and predictive temporal modeling. Local processing performs segmentation and keypoint extraction; variational recurrent prediction and sequence classification then detect falls from observed and forecasted motion. We evaluate the framework on the UR Fall Detection and Human Fall datasets using random, subject-disjoint, and occlusion-based splits. Under random splits, neither representation consistently dominates, suggesting that standard protocols may hide meaningful differences. Under subject-disjoint evaluation, supervised keypoints show a statistically significant advantage, but performance varies by subject: they perform better when anatomical landmarks are visible, whereas unsupervised keypoints are more robust to occlusion and partial visibility, though they produce more false positives for complex activities. Under occlusion-based evaluation, supervised keypoints miss nearly half of all falls, while unsupervised keypoints retain strong sensitivity and substantially outperform them. Their anatomical independence allows spatial anchors to adapt to visible body structure rather than fail on absent landmarks. The gap widens under bandwidth constraints, where supervised localization errors compound through the temporal model. These findings show that representation choice should reflect expected visual conditions and that unsupervised keypoints offer an advantage when body visibility is compromised.

7.0CVJul 15Code

XCT-SAM: Sequential Parameter-Efficient Domain Adaptation of SAM for Industrial XCT Defect Segmentation

Md Mahedi Hasan, Md Mushfiqur Rahaman, Alan Pachkovskiy et al.

Defect segmentation in additive manufacturing (AM) X-ray computed tomography (XCT) images remains challenging due to severe class imbalance and large distribution shifts across scan conditions. Although recent foundation models such as the Segment Anything Model (SAM) provide strong general-purpose segmentation priors, their natural-image pre-training transfers poorly to the AM XCT domain, where defects appear as subtle non-semantic microstructural anomalies. Moreover, adapting SAM to the AM domain is further limited by the large domain gap and scarcity of labeled real XCT data. We present XCT-SAM, a sequential parameter-efficient adaptation framework for AM XCT defect segmentation. Instead of adapting SAM directly from natural images to XCT data, we first fine-tune Conv-LoRA adapters on an alloy-microstructure dataset and subsequently transfer the adapted model to XCT images, progressively bridging the domain gap. Using Conv-LoRA adapters with rank r=2, the framework injects convolutional spatial inductive bias into SAM's backbone while training approximately 4.15M parameters and keeping over 99% of the model frozen. We evaluate XCT-SAM on out-of-distribution CycleGAN-XCT benchmarks and real-world NIST XCT scans. Across both settings, XCT-SAM consistently outperforms zero-shot SAM and other domain-adapted SAM baselines, achieving the best overall IoU and Dice scores. These results demonstrate the effectiveness of intermediate domain adaptation with parameter-efficient adapters for industrial XCT defect segmentation. The source code is publicly available at https://github.com/Mahedi-61/XCT-SAM.git

12.1CVJul 9Code

Mixture of Probes: Learning from Privileged Modalities in Multimodal LLMs Through Probing

Dominick Reilly, Qiyu Wu, Hiromi Wakaki et al.

Multimodal Large Language Models (MLLMs) are typically designed under the assumption that all modalities available during training will also be accessible at inference. However, many real-world settings violate this assumption, requiring models to operate under a privileged modality setting, where auxiliary modalities are available only during training. While these modalities contain valuable information, existing MLLMs largely fail to leverage them effectively, as they treat modalities as interchangeable inputs rather than sources of complementary supervision. We propose Mixture of Probes (MoP), a novel framework that disentangles modality-specific and modality-general signals within the MLLM, allowing the model to preserve modality-dependent structure while learning transferable representations across modalities. At its core, MoP achieves this through a structured probing mechanism that extracts and organizes information from intermediate representations of a shared modality encoder, rather than relying only on final-layer alignment as done in existing MLLMs. To support this disentanglement, we further introduce MoP Cross-modal Training (MoP-X), a training strategy for MoP centered around a probe disentanglement loss that prevents probe collapse and encourages cross-modal learning. We evaluate MoP across two domains spanning eight tasks and four modalities under a comprehensive evaluation protocol tailored to the privileged modality setting, where each modality is independently treated as the sole input at inference time. MoP consistently outperforms strong MLLM baselines, achieving up to 65% relative improvement, demonstrating that auxiliary modalities, even when unavailable at inference, can provide substantial gains when effectively leveraged during training. Code, model checkpoints, and evaluation protocols will be made available at https://github.com/Sony/MoP.

23.3AIJul 8

The Harness Effect: How Orchestration Design Sets the Token Economics of Enterprise Agentic AI

Muayad Sayed Ali, Aliaksandra Novik, Anji Boddupally et al.

Agentic AI development today runs on token maxing: buying capability with tokens -- longer reasoning traces, more turns, wider tool payloads, bigger replayed contexts -- so tokens per task grow faster than task value. Falling per-token prices mask the pattern; total spend rises anyway. We argue the decisive lever against token maxing is the harness: the orchestration layer that assembles context, exposes tools, sequences turns, delegates work, and carries enterprise observability and governance. We isolate it with a controlled swap: 22 locked evaluation tasks, six foundation models (Claude Sonnet 4.6, Gemini 3.1, Gemini Flash 3.5, Qwen 3.6, GLM 5.1, Palmyra X6), changing only the orchestration layer -- a frozen conventional production loop versus the Writer Agent Harness. Holding models constant, the harness cuts blended cost per task 41% ($0.21->$0.12), median wall-clock 44% (48s->27s), and tokens per task 38% (14.2k->8.8k), with task-completion quality at parity (0.78->0.81, directional at this sample size). Efficiency is model-invariant -- every model gets cheaper (33-61%) -- while quality gains are capability-dependent: a model's gain correlates almost perfectly with its baseline strength (r=0.99, n=6), a phenomenon we term harness leverage. Quality per dollar rises 82%; task-completions per million tokens rise from 54.9 to 92.0. On this workload the orchestration layer moved cost per task more than the full spread of the model menu did. We formalize token economics at the orchestration layer (including effective input price under prompt caching), detail the six mechanism families behind the effect -- cache-shape discipline to failure-spend governance -- compare six widely used agent systems on the same axes, and argue the harness is the one component whose efficiency multiplies across every model an organization runs -- present and future.

7.2CVJul 9

3D FaceShell: Attribute Transfer in 3D Face Avatars as a VLM Defense Mechanism

Weston Bondurant, Srijan Das, Hieu Le et al.

Photorealistic 3D face avatars are increasingly deployed as reusable digital assets across applications such as telepresence, animation, and personalized media. At the same time, vision-language models (VLMs) can infer sensitive attributes from rendered images with open-ended semantic reasoning without any fine-tuning. This creates a new privacy challenge: once a 3D face avatar is shared, any of its renderings can be analyzed to extract high-level facial attributes. Existing defenses largely operate in 2D image space and do not address identity-preserving semantic manipulation of 3D facial representations. We propose 3D FaceShell, a framework for steering VLM interpretations of faces rendered from 3D models while preserving geometric fidelity and facial identity. 3D FaceShell augments the original 3D representation with a learnable Gaussian shell that produces subtle, spatially distributed perturbations optimized through multi-view embedding alignment. The perturbations are designed to be visually inconspicuous yet sufficient to redirect VLM-based attribute inference in a view-consistent manner. Extensive experiments on reconstructed celebrity face avatars and multiple black-box VLMs demonstrate that 3D FaceShell significantly increases attribute injection and mismatch rates while maintaining high perceptual similarity and identity consistency. Our results show that it is possible to manipulate VLM-level semantic interpretation of 3D faces without compromising their human-recognizable appearance.

1.8LGJan 30, 2022

Machine learning based modelling and optimization in hard turning of AISI D6 steel with newly developed AlTiSiN coated carbide tool

A Das, S R Das, J P Panda et al.

In recent times Mechanical and Production industries are facing increasing challenges related to the shift toward sustainable manufacturing. In this article, machining was performed in dry cutting condition with a newly developed coated insert called AlTiSiN coated carbides coated through scalable pulsed power plasma technique in dry cutting condition and a dataset was generated for different machining parameters and output responses. The machining parameters are speed, feed, depth of cut and the output responses are surface roughness, cutting force, crater wear length, crater wear width, and flank wear. The data collected from the machining operation is used for the development of machine learning (ML) based surrogate models to test, evaluate and optimize various input machining parameters. Different ML approaches such as polynomial regression (PR), random forest (RF) regression, gradient boosted (GB) trees, and adaptive boosting (AB) based regression are used to model different output responses in the hard machining of AISI D6 steel. The surrogate models for different output responses are used to prepare a complex objective function for the germinal center algorithm-based optimization of the machining parameters of the hard turning operation.