Detection and Geographic Localization of Natural Objects in the Wild: A Case Study on Palms
This work addresses a domain-specific problem for ecologists and forest managers by providing a tool to monitor palm distribution in tropical forests, though it is incremental as it builds on existing object detection methods.
The paper tackles the problem of detecting and localizing naturally occurring palms in dense tropical forests, which is challenging due to overlapping crowns and heterogeneous landscapes, and presents PRISM, a pipeline that achieves this using large orthomosaic images and integrates state-of-the-art object detectors with segmentation for precise mapping.
Palms are ecologically and economically indicators of tropical forest health, biodiversity, and human impact that support local economies and global forest product supply chains. While palm detection in plantations is well-studied, efforts to map naturally occurring palms in dense forests remain limited by overlapping crowns, uneven shading, and heterogeneous landscapes. We develop PRISM (Processing, Inference, Segmentation, and Mapping), a flexible pipeline for detecting and localizing palms in dense tropical forests using large orthomosaic images. Orthomosaics are created from thousands of aerial images and spanning several to hundreds of gigabytes. Our contributions are threefold. First, we construct a large UAV-derived orthomosaic dataset collected across 21 ecologically diverse sites in western Ecuador, annotated with 8,830 bounding boxes and 5,026 palm center points. Second, we evaluate multiple state-of-the-art object detectors based on efficiency and performance, integrating zero-shot SAM 2 as the segmentation backbone, and refining the results for precise geographic mapping. Third, we apply calibration methods to align confidence scores with IoU and explore saliency maps for feature explainability. Though optimized for palms, PRISM is adaptable for identifying other natural objects, such as eastern white pines. Future work will explore transfer learning for lower-resolution datasets (0.5 to 1m).