ROJun 16, 2023
Semantics-Aware Next-best-view Planning for Efficient Search and Detection of Task-relevant Plant PartsAkshay K. Burusa, Joost Scholten, David Rapado Rincon et al.
Searching and detecting the task-relevant parts of plants is important to automate harvesting and de-leafing of tomato plants using robots. This is challenging due to high levels of occlusion in tomato plants. Active vision is a promising approach in which the robot strategically plans its camera viewpoints to overcome occlusion and improve perception accuracy. However, current active-vision algorithms cannot differentiate between relevant and irrelevant plant parts and spend time on perceiving irrelevant plant parts. This work proposed a semantics-aware active-vision strategy that uses semantic information to identify the relevant plant parts and prioritise them during view planning. The proposed strategy was evaluated on the task of searching and detecting the relevant plant parts using simulation and real-world experiments. In simulation experiments, the semantics-aware strategy proposed could search and detect 81.8% of the relevant plant parts using nine viewpoints. It was significantly faster and detected more plant parts than predefined, random, and volumetric active-vision strategies that do not use semantic information. The strategy proposed was also robust to uncertainty in plant and plant-part positions, plant complexity, and different viewpoint-sampling strategies. In real-world experiments, the semantics-aware strategy could search and detect 82.7% of the relevant plant parts using seven viewpoints, under complex greenhouse conditions with natural variation and occlusion, natural illumination, sensor noise, and uncertainty in camera poses. The results of this work clearly indicate the advantage of using semantics-aware active vision for targeted perception of plant parts and its applicability in the real world. It can significantly improve the efficiency of automated harvesting and de-leafing in tomato crop production.
RONov 4, 2022
Development and evaluation of automated localisation and reconstruction of all fruits on tomato plants in a greenhouse based on multi-view perception and 3D multi-object trackingDavid Rapado Rincon, Eldert J. van Henten, Gert Kootstra
The ability to accurately represent and localise relevant objects is essential for robots to carry out tasks effectively. Traditional approaches, where robots simply capture an image, process that image to take an action, and then forget the information, have proven to struggle in the presence of occlusions. Methods using multi-view perception, which have the potential to address some of these problems, require a world model that guides the collection, integration and extraction of information from multiple viewpoints. Furthermore, constructing a generic representation that can be applied in various environments and tasks is a difficult challenge. In this paper, a novel approach for building generic representations in occluded agro-food environments using multi-view perception and 3D multi-object tracking is introduced. The method is based on a detection algorithm that generates partial point clouds for each detected object, followed by a 3D multi-object tracking algorithm that updates the representation over time. The accuracy of the representation was evaluated in a real-world environment, where successful representation and localisation of tomatoes in tomato plants were achieved, despite high levels of occlusion, with the total count of tomatoes estimated with a maximum error of 5.08% and the tomatoes tracked with an accuracy up to 71.47%. Novel tracking metrics were introduced, demonstrating that valuable insight into the errors in localising and representing the fruits can be provided by their use. This approach presents a novel solution for building representations in occluded agro-food environments, demonstrating potential to enable robots to perform tasks effectively in these challenging environments.
ROJun 21, 2022
Attention-driven Next-best-view Planning for Efficient Reconstruction of Plants and Targeted Plant PartsAkshay K. Burusa, Eldert J. van Henten, Gert Kootstra
Robots in tomato greenhouses need to perceive the plant and plant parts accurately to automate monitoring, harvesting, and de-leafing tasks. Existing perception systems struggle with the high levels of occlusion in plants and often result in poor perception accuracy. One reason for this is because they use fixed cameras or predefined camera movements. Next-best-view (NBV) planning presents an alternate approach, in which the camera viewpoints are reasoned and strategically planned such that the perception accuracy is improved. However, existing NBV-planning algorithms are agnostic to the task-at-hand and give equal importance to all the plant parts. This strategy is inefficient for greenhouse tasks that require targeted perception of specific plant parts, such as the perception of leaf nodes for de-leafing. To improve targeted perception in complex greenhouse environments, NBV planning algorithms need an attention mechanism to focus on the task-relevant plant parts. In this paper, the role of attention in improving targeted perception using an attention-driven NBV planning strategy was investigated. Through simulation experiments using plants with high levels of occlusion and structural complexity, it was shown that focusing attention on task-relevant plant parts can significantly improve the speed and accuracy of 3D reconstruction. Further, with real-world experiments, it was shown that these benefits extend to complex greenhouse conditions with natural variation and occlusion, natural illumination, sensor noise, and uncertainty in camera poses. The results clearly indicate that using attention-driven NBV planning in greenhouses can significantly improve the efficiency of perception and enhance the performance of robotic systems in greenhouse crop production.
RONov 28, 2023
Gradient-based Local Next-best-view Planning for Improved Perception of Targeted Plant NodesAkshay K. Burusa, Eldert J. van Henten, Gert Kootstra
Robots are increasingly used in tomato greenhouses to automate labour-intensive tasks such as selective harvesting and de-leafing. To perform these tasks, robots must be able to accurately and efficiently perceive the plant nodes that need to be cut, despite the high levels of occlusion from other plant parts. We formulate this problem as a local next-best-view (NBV) planning task where the robot has to plan an efficient set of camera viewpoints to overcome occlusion and improve the quality of perception. Our formulation focuses on quickly improving the perception accuracy of a single target node to maximise its chances of being cut. Previous methods of NBV planning mostly focused on global view planning and used random sampling of candidate viewpoints for exploration, which could suffer from high computational costs, ineffective view selection due to poor candidates, or non-smooth trajectories due to inefficient sampling. We propose a gradient-based NBV planner using differential ray sampling, which directly estimates the local gradient direction for viewpoint planning to overcome occlusion and improve perception. Through simulation experiments, we showed that our planner can handle occlusions and improve the 3D reconstruction and position estimation of nodes equally well as a sampling-based NBV planner, while taking ten times less computation and generating 28% more efficient trajectories.
CVDec 23, 2025
Enhancing annotations for 5D apple pose estimation through 3D Gaussian Splatting (3DGS)Robert van de Ven, Trim Bresilla, Bram Nelissen et al.
Automating tasks in orchards is challenging because of the large amount of variation in the environment and occlusions. One of the challenges is apple pose estimation, where key points, such as the calyx, are often occluded. Recently developed pose estimation methods no longer rely on these key points, but still require them for annotations, making annotating challenging and time-consuming. Due to the abovementioned occlusions, there can be conflicting and missing annotations of the same fruit between different images. Novel 3D reconstruction methods can be used to simplify annotating and enlarge datasets. We propose a novel pipeline consisting of 3D Gaussian Splatting to reconstruct an orchard scene, simplified annotations, automated projection of the annotations to images, and the training and evaluation of a pose estimation method. Using our pipeline, 105 manual annotations were required to obtain 28,191 training labels, a reduction of 99.6%. Experimental results indicated that training with labels of fruits that are $\leq95\%$ occluded resulted in the best performance, with a neutral F1 score of 0.927 on the original images and 0.970 on the rendered images. Adjusting the size of the training dataset had small effects on the model performance in terms of F1 score and pose estimation accuracy. It was found that the least occluded fruits had the best position estimation, which worsened as the fruits became more occluded. It was also found that the tested pose estimation method was unable to correctly learn the orientation estimation of apples.
CVDec 13, 2021Code
Active learning with MaskAL reduces annotation effort for training Mask R-CNNPieter M. Blok, Gert Kootstra, Hakim Elchaoui Elghor et al.
The generalisation performance of a convolutional neural network (CNN) is influenced by the quantity, quality, and variety of the training images. Training images must be annotated, and this is time consuming and expensive. The goal of our work was to reduce the number of annotated images needed to train a CNN while maintaining its performance. We hypothesised that the performance of a CNN can be improved faster by ensuring that the set of training images contains a large fraction of hard-to-classify images. The objective of our study was to test this hypothesis with an active learning method that can automatically select the hard-to-classify images. We developed an active learning method for Mask Region-based CNN (Mask R-CNN) and named this method MaskAL. MaskAL involved the iterative training of Mask R-CNN, after which the trained model was used to select a set of unlabelled images about which the model was most uncertain. The selected images were then annotated and used to retrain Mask R-CNN, and this was repeated for a number of sampling iterations. In our study, MaskAL was compared to a random sampling method on a broccoli dataset with five visually similar classes. MaskAL performed significantly better than the random sampling. In addition, MaskAL had the same performance after sampling 900 images as the random sampling had after 2300 images. Compared to a Mask R-CNN model that was trained on the entire training set (14,000 images), MaskAL achieved 93.9% of that model's performance with 17.9% of its training data. The random sampling achieved 81.9% of that model's performance with 16.4% of its training data. We conclude that by using MaskAL, the annotation effort can be reduced for training Mask R-CNN on a broccoli dataset with visually similar classes. Our software is available on https://github.com/pieterblok/maskal.
CVJan 10, 2024
Video-based automatic lameness detection of dairy cows using pose estimation and multiple locomotion traitsHelena Russello, Rik van der Tol, Menno Holzhauer et al.
This study presents an automated lameness detection system that uses deep-learning image processing techniques to extract multiple locomotion traits associated with lameness. Using the T-LEAP pose estimation model, the motion of nine keypoints was extracted from videos of walking cows. The videos were recorded outdoors, with varying illumination conditions, and T-LEAP extracted 99.6% of correct keypoints. The trajectories of the keypoints were then used to compute six locomotion traits: back posture measurement, head bobbing, tracking distance, stride length, stance duration, and swing duration. The three most important traits were back posture measurement, head bobbing, and tracking distance. For the ground truth, we showed that a thoughtful merging of the scores of the observers could improve intra-observer reliability and agreement. We showed that including multiple locomotion traits improves the classification accuracy from 76.6% with only one trait to 79.9% with the three most important traits and to 80.1% with all six locomotion traits.
CVAug 14, 2025
Lameness detection in dairy cows using pose estimation and bidirectional LSTMsHelena Russello, Rik van der Tol, Eldert J. van Henten et al.
This study presents a lameness detection approach that combines pose estimation and Bidirectional Long-Short-Term Memory (BLSTM) neural networks. Combining pose-estimation and BLSTMs classifier offers the following advantages: markerless pose-estimation, elimination of manual feature engineering by learning temporal motion features from the keypoint trajectories, and working with short sequences and small training datasets. Motion sequences of nine keypoints (located on the cows' hooves, head and back) were extracted from videos of walking cows with the T-LEAP pose estimation model. The trajectories of the keypoints were then used as an input to a BLSTM classifier that was trained to perform binary lameness classification. Our method significantly outperformed an established method that relied on manually-designed locomotion features: our best architecture achieved a classification accuracy of 85%, against 80% accuracy for the feature-based approach. Furthermore, we showed that our BLSTM classifier could detect lameness with as little as one second of video data.
CVMar 16, 2018
Improved Part Segmentation Performance by Optimising Realism of Synthetic Images using Cycle Generative Adversarial NetworksRuud Barth, Jochen Hemming, Eldert J. van Henten
In this paper we report on improved part segmentation performance using convolutional neural networks to reduce the dependency on the large amount of manually annotated empirical images. This was achieved by optimising the visual realism of synthetic agricultural images.In Part I, a cycle consistent generative adversarial network was applied to synthetic and empirical images with the objective to generate more realistic synthetic images by translating them to the empirical domain. We first hypothesise and confirm that plant part image features such as color and texture become more similar to the empirical domain after translation of the synthetic images.Results confirm this with an improved mean color distribution correlation with the empirical data prior of 0.62 and post translation of 0.90. Furthermore, the mean image features of contrast, homogeneity, energy and entropy moved closer to the empirical mean, post translation. In Part II, 7 experiments were performed using convolutional neural networks with different combinations of synthetic, synthetic translated to empirical and empirical images. We hypothesised that the translated images can be used for (i) improved learning of empirical images, and (ii) that learning without any fine-tuning with empirical images is improved by bootstrapping with translated images over bootstrapping with synthetic images. Results confirm our second and third hypotheses. First a maximum intersection-over-union performance was achieved of 0.52 when bootstrapping with translated images and fine-tuning with empirical images; an 8% increase compared to only using synthetic images. Second, training without any empirical fine-tuning resulted in an average IOU of 0.31; a 55% performance increase over previous methods that only used synthetic images.