AIMar 6
DERM-3R: A Resource-Efficient Multimodal Agents Framework for Dermatologic Diagnosis and Treatment in Real-World Clinical SettingsZiwen Chen, Zhendong Wang, Chongjing Wang et al.
Dermatologic diseases impose a large and growing global burden, affecting billions and substantially reducing quality of life. While modern therapies can rapidly control acute symptoms, long-term outcomes are often limited by single-target paradigms, recurrent courses, and insufficient attention to systemic comorbidities. Traditional Chinese medicine (TCM) provides a complementary holistic approach via syndrome differentiation and individualized treatment, but practice is hindered by non-standardized knowledge, incomplete multimodal records, and poor scalability of expert reasoning. We propose DERM-3R, a resource-efficient multimodal agent framework to model TCM dermatologic diagnosis and treatment under limited data and compute. Based on real-world workflows, we reformulate decision-making into three core issues: fine-grained lesion recognition, multi-view lesion representation with specialist-level pathogenesis modeling, and holistic reasoning for syndrome differentiation and treatment planning. DERM-3R comprises three collaborative agents: DERM-Rec, DERM-Rep, and DERM-Reason, each targeting one component of this pipeline. Built on a lightweight multimodal LLM and partially fine-tuned on 103 real-world TCM psoriasis cases, DERM-3R performs strongly across dermatologic reasoning tasks. Evaluations using automatic metrics, LLM-as-a-judge, and physician assessment show that despite minimal data and parameter updates, DERM-3R matches or surpasses large general-purpose multimodal models. These results suggest structured, domain-aware multi-agent modeling can be a practical alternative to brute-force scaling for complex clinical tasks in dermatology and integrative medicine.
CVDec 18, 2024
MegaSynth: Scaling Up 3D Scene Reconstruction with Synthesized DataHanwen Jiang, Zexiang Xu, Desai Xie et al.
We propose scaling up 3D scene reconstruction by training with synthesized data. At the core of our work is MegaSynth, a procedurally generated 3D dataset comprising 700K scenes - over 50 times larger than the prior real dataset DL3DV - dramatically scaling the training data. To enable scalable data generation, our key idea is eliminating semantic information, removing the need to model complex semantic priors such as object affordances and scene composition. Instead, we model scenes with basic spatial structures and geometry primitives, offering scalability. Besides, we control data complexity to facilitate training while loosely aligning it with real-world data distribution to benefit real-world generalization. We explore training LRMs with both MegaSynth and available real data. Experiment results show that joint training or pre-training with MegaSynth improves reconstruction quality by 1.2 to 1.8 dB PSNR across diverse image domains. Moreover, models trained solely on MegaSynth perform comparably to those trained on real data, underscoring the low-level nature of 3D reconstruction. Additionally, we provide an in-depth analysis of MegaSynth's properties for enhancing model capability, training stability, and generalization, as well as application to other tasks.
AISep 23, 2025
FERA: Foil Fencing Referee Assistant Using Pose-Based Multi-Label Move Recognition and Rule ReasoningZiwen Chen, Zhong Wang
The sport of fencing, like many other sports, faces challenges in refereeing: subjective calls, human errors, bias, and limited availability in practice environments. We present FERA (Fencing Referee Assistant), a prototype AI referee for foil fencing which integrates pose-based multi-label action recognition and rule-based reasoning. FERA extracts 2D joint positions from video, normalizes them, computes a 101-dimensional kinematic feature set, and applies a Transformer for multi-label move and blade classification. To determine priority and scoring, FERA applies a distilled language model with encoded right-of-way rules, producing both a decision and an explanation for each exchange. With limited hand-labeled data, a 5-fold cross-validation achieves an average macro-F1 score of 0.549, outperforming multiple baselines, including a Temporal Convolutional Network (TCN), BiLSTM, and a vanilla Transformer. While not ready for deployment, these results demonstrate a promising path towards automated referee assistance in foil fencing and new opportunities for AI applications, such as coaching in the field of fencing.
GRApr 2, 2025
Generating 360° Video is What You Need For a 3D SceneZhaoyang Zhang, Yannick Hold-Geoffroy, Miloš Hašan et al.
Generating 3D scenes is still a challenging task due to the lack of readily available scene data. Most existing methods only produce partial scenes and provide limited navigational freedom. We introduce a practical and scalable solution that uses 360° video as an intermediate scene representation, capturing the full-scene context and ensuring consistent visual content throughout the generation. We propose WorldPrompter, a generative pipeline that synthesizes traversable 3D scenes from text prompts. WorldPrompter incorporates a conditional 360° panoramic video generator, capable of producing a 128-frame video that simulates a person walking through and capturing a virtual environment. The resulting video is then reconstructed as Gaussian splats by a fast feedforward 3D reconstructor, enabling a true walkable experience within the 3D scene. Experiments demonstrate that our panoramic video generation model, trained with a mix of image and video data, achieves convincing spatial and temporal consistency for static scenes. This is validated by an average COLMAP matching rate of 94.6\%, allowing for high-quality panoramic Gaussian splat reconstruction and improved navigation throughout the scene. Qualitative and quantitative results also show it outperforms the state-of-the-art 360° video generators and 3D scene generation models.