Xiangyu Wu

h-index12

6papers

60citations

Novelty41%

AI Score33

Ranked #117,158 of 194,257 authors (top 60%)#39,066 in CV (top 66%)

6 Papers

9.1CVSep 5, 2023

NICE: CVPR 2023 Challenge on Zero-shot Image Captioning

Taehoon Kim, Pyunghwan Ahn, Sangyun Kim et al. · nvidia, utoronto

In this report, we introduce NICE (New frontiers for zero-shot Image Captioning Evaluation) project and share the results and outcomes of 2023 challenge. This project is designed to challenge the computer vision community to develop robust image captioning models that advance the state-of-the-art both in terms of accuracy and fairness. Through the challenge, the image captioning models were tested using a new evaluation dataset that includes a large variety of visual concepts from many domains. There was no specific training data provided for the challenge, and therefore the challenge entries were required to adapt to new types of image descriptions that had not been seen during training. This report includes information on the newly proposed NICE dataset, evaluation methods, challenge results, and technical details of top-ranking entries. We expect that the outcomes of the challenge will contribute to the improvement of AI models on various vision-language tasks.

10.2CVAug 8, 2025Code

Text as Any-Modality for Zero-Shot Classification by Consistent Prompt Tuning

Xiangyu Wu, Feng Yu, Yang Yang et al.

The integration of prompt tuning with multimodal learning has shown significant generalization abilities for various downstream tasks. Despite advancements, existing methods heavily depend on massive modality-specific labeled data (e.g., video, audio, and image), or are customized for a single modality. In this study, we present Text as Any-Modality by Consistent Prompt Tuning (TaAM-CPT), a scalable approach for constructing a general representation model toward unlimited modalities using solely text data. TaAM-CPT comprises modality prompt pools, text construction, and modality-aligned text encoders from pre-trained models, which allows for extending new modalities by simply adding prompt pools and modality-aligned text encoders. To harmonize the learning across different modalities, TaAM-CPT designs intra- and inter-modal learning objectives, which can capture category details within modalities while maintaining semantic consistency across different modalities. Benefiting from its scalable architecture and pre-trained models, TaAM-CPT can be seamlessly extended to accommodate unlimited modalities. Remarkably, without any modality-specific labeled data, TaAM-CPT achieves leading results on diverse datasets spanning various modalities, including video classification, image classification, and audio classification. The code is available at https://github.com/Jinx630/TaAM-CPT.

2.0CVJul 5, 2024

Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge

Xiangyu Wu, Zhouyang Chi, Yang Yang et al.

In this paper, we present our solution for the WSDM2023 Toloka Visual Question Answering Challenge. Inspired by the application of multimodal pre-trained models to various downstream tasks(e.g., visual question answering, visual grounding, and cross-modal retrieval), we approached this competition as a visual grounding task, where the input is an image and a question, guiding the model to answer the question and display the answer as a bounding box on the image. We designed a three-stage solution for this task. Specifically, we used the visual-language pre-trained model OFA as the foundation. In the first stage, we constructed a large-scale synthetic dataset similar to the competition dataset and coarse-tuned the model to learn generalized semantic information. In the second stage, we treated the competition task as a visual grounding task, loaded the weights from the previous stage, and continued to fine-tune the model on the competition dataset, transferring the semantic information learned in the first stage to the competition task. Finally, we designed a bounding box matching and replacing post-processing strategy to correct the model's prediction results. Our team achieved a score of 76.342 on the final leaderboard, ranking second.

3.0ROAug 9, 2021

Model-free online motion adaptation for energy efficient flights of multicopters

Xiangyu Wu, Jun Zeng, Andrea Tagliabue et al.

Limited flight distance and time is a common problem for multicopters. We propose a method for finding the optimal speed and sideslip angle of a multicopter flying a given path to achieve either the longest flight distance or time. Since flight speed and sideslip are often free variables in multicopter path planning, they can be changed without changing the mission. The proposed method is based on a novel multivariable extremum seeking controller with adaptive step size, which is inspired by recent work from the machine learning community on stochastic optimization. Our method (a) does not require a power consumption model of the vehicle, (b) is computationally efficient and runs on low-cost embedded computers in real-time, and (c) converges faster than the standard extremum seeking controller with constant step size. We prove the stability of this approach and validate it through outdoor experiments. The method is shown to converge with different payloads and in the presence of wind. Compared to flying at the maximum achievable speed in the experiments with a uniformly selected random sideslip angle, flying at the optimal range speed and sideslip on average increases the flight range by 14.3% without payload and 19.4% with a box payload. In addition, compared to hovering, flying at the optimal endurance speed and sideslip increases the flight time by 7.5% without payload and 14.4% with a box payload. A video can be found at https://youtu.be/aLds8LVfogk.

7.0ROMar 6, 2020

A collision-resilient aerial vehicle with icosahedron tensegrity structure

Jiaming Zha, Xiangyu Wu, Joseph Kroeger et al.

Aerial vehicles with collision resilience can operate with more confidence in environments with obstacles that are hard to detect and avoid. This paper presents the methodology used to design a collision resilient aerial vehicle with icosahedron tensegrity structure. A simplified stress analysis of the tensegrity frame under impact forces is performed to guide the selection of its components. In addition, an autonomous controller is presented to reorient the vehicle from an arbitrary orientation on the ground to help it take off. Experiments show that the vehicle can successfully reorient itself after landing upside-down and can survive collisions with speed up to 6.5m/s.

7.0ROMar 5, 2020

In-flight range optimization of multicopters using multivariable extremum seeking with adaptive step size

Xiangyu Wu, Mark W. Mueller

Limited flight range is a common problem for multicopters. To alleviate this problem, we propose a method for finding the optimal speed and heading of a multicopter when flying a given path to achieve the longest flight range. Based on a novel multivariable extremum seeking controller with adaptive step size, the method (a) does not require any power consumption model of the vehicle, (b) can adapt to unknown disturbances, (c) can be executed online, and (d) converges faster than the standard extremum seeking controller with constant step size. We conducted indoor experiments to validate the effectiveness of this method under different payloads and initial conditions, and showed that it is able to converge more than 30% faster than the standard extremum seeking controller. This method is especially useful for applications such as package delivery, where the size and weight of the payload differ for different deliveries and the power consumption of the vehicle is hard to model.