Xuyang Fang

CV
h-index19
3papers
10citations
Novelty45%
AI Score41

3 Papers

CVJan 14Code
Self-Supervised Animal Identification for Long Videos

Xuyang Fang, Sion Hannuna, Edwin Simpson et al.

Identifying individual animals in long-duration videos is essential for behavioral ecology, wildlife monitoring, and livestock management. Traditional methods require extensive manual annotation, while existing self-supervised approaches are computationally demanding and ill-suited for long sequences due to memory constraints and temporal error propagation. We introduce a highly efficient, self-supervised method that reframes animal identification as a global clustering task rather than a sequential tracking problem. Our approach assumes a known, fixed number of individuals within a single video -- a common scenario in practice -- and requires only bounding box detections and the total count. By sampling pairs of frames, using a frozen pre-trained backbone, and employing a self-bootstrapping mechanism with the Hungarian algorithm for in-batch pseudo-label assignment, our method learns discriminative features without identity labels. We adapt a Binary Cross Entropy loss from vision-language models, enabling state-of-the-art accuracy ($>$97\%) while consuming less than 1 GB of GPU memory per batch -- an order of magnitude less than standard contrastive methods. Evaluated on challenging real-world datasets (3D-POP pigeons and 8-calves feeding videos), our framework matches or surpasses supervised baselines trained on over 1,000 labeled frames, effectively removing the manual annotation bottleneck. This work enables practical, high-accuracy animal identification on consumer-grade hardware, with broad applicability in resource-constrained research settings. All code written for this paper are \href{https://huggingface.co/datasets/tonyFang04/8-calves}{here}.

CVMar 17, 2025Code
8-Calves Image dataset

Xuyang Fang, Sion Hannuna, Neill Campbell et al.

Automated livestock monitoring is crucial for precision farming, but robust computer vision models are hindered by a lack of datasets reflecting real-world group challenges. We introduce the 8-Calves dataset, a challenging benchmark for multi-animal detection, tracking, and identification. It features a one-hour video of eight Holstein Friesian calves in a barn, with frequent occlusions, motion blur, and diverse poses. A semi-automated pipeline using a fine-tuned YOLOv8 detector and ByteTrack, followed by manual correction, provides over 537,000 bounding boxes with temporal identity labels. We benchmark 28 object detectors, showing near-perfect performance on a lenient IoU threshold (mAP50: 95.2-98.9%) but significant divergence on stricter metrics (mAP50:95: 56.5-66.4%), highlighting fine-grained localization challenges. Our identification benchmark across 23 models reveals a trade-off: scaling model size improves classification accuracy but compromises retrieval. Smaller architectures like ConvNextV2 Nano achieve the best balance (73.35% accuracy, 50.82% Top-1 KNN). Pre-training focused on semantic learning (e.g., BEiT) yielded superior transferability. For tracking, leading methods achieve high detection accuracy (MOTA > 0.92) but struggle with identity preservation (IDF1 $\approx$ 0.27), underscoring a key challenge in occlusion-heavy scenarios. The 8-Calves dataset bridges a gap by providing temporal richness and realistic challenges, serving as a resource for advancing agricultural vision models. The dataset and code are available at https://huggingface.co/datasets/tonyFang04/8-calves.

NCDec 22, 2019
Recurrent Feedback Improves Feedforward Representations in Deep Neural Networks

Siming Yan, Xuyang Fang, Bowen Xiao et al.

The abundant recurrent horizontal and feedback connections in the primate visual cortex are thought to play an important role in bringing global and semantic contextual information to early visual areas during perceptual inference, helping to resolve local ambiguity and fill in missing details. In this study, we find that introducing feedback loops and horizontal recurrent connections to a deep convolution neural network (VGG16) allows the network to become more robust against noise and occlusion during inference, even in the initial feedforward pass. This suggests that recurrent feedback and contextual modulation transform the feedforward representations of the network in a meaningful and interesting way. We study the population codes of neurons in the network, before and after learning with feedback, and find that learning with feedback yielded an increase in discriminability (measured by d-prime) between the different object classes in the population codes of the neurons in the feedforward path, even at the earliest layer that receives feedback. We find that recurrent feedback, by injecting top-down semantic meaning to the population activities, helps the network learn better feedforward paths to robustly map noisy image patches to the latent representations corresponding to important visual concepts of each object class, resulting in greater robustness of the network against noises and occlusion as well as better fine-grained recognition.