CVNov 17, 2025
Tissue Aware Nuclei Detection and Classification Model for Histopathology ImagesKesi Xu, Eleni Chiou, Ali Varamesh et al.
Accurate nuclei detection and classification are fundamental to computational pathology, yet existing approaches are hindered by reliance on detailed expert annotations and insufficient use of tissue context. We present Tissue-Aware Nuclei Detection (TAND), a novel framework achieving joint nuclei detection and classification using point-level supervision enhanced by tissue mask conditioning. TAND couples a ConvNeXt-based encoder-decoder with a frozen Virchow-2 tissue segmentation branch, where semantic tissue probabilities selectively modulate the classification stream through a novel multi-scale Spatial Feature-wise Linear Modulation (Spatial-FiLM). On the PUMA benchmark, TAND achieves state-of-the-art performance, surpassing both tissue-agnostic baselines and mask-supervised methods. Notably, our approach demonstrates remarkable improvements in tissue-dependent cell types such as epithelium, endothelium, and stroma. To the best of our knowledge, this is the first method to condition per-cell classification on learned tissue masks, offering a practical pathway to reduce annotation burden.
CVOct 14, 2020
Self-Supervised Ranking for Representation LearningAli Varamesh, Ali Diba, Tinne Tuytelaars et al.
We present a new framework for self-supervised representation learning by formulating it as a ranking problem in an image retrieval context on a large number of random views (augmentations) obtained from images. Our work is based on two intuitions: first, a good representation of images must yield a high-quality image ranking in a retrieval task; second, we would expect random views of an image to be ranked closer to a reference view of that image than random views of other images. Hence, we model representation learning as a learning to rank problem for image retrieval. We train a representation encoder by maximizing average precision (AP) for ranking, where random views of an image are considered positively related, and that of the other images considered negatives. The new framework, dubbed S2R2, enables computing a global objective on multiple views, compared to the local objective in the popular contrastive learning framework, which is calculated on pairs of views. In principle, by using a ranking criterion, we eliminate reliance on object-centric curated datasets. When trained on STL10 and MS-COCO, S2R2 outperforms SimCLR and the clustering-based contrastive learning model, SwAV, while being much simpler both conceptually and at implementation. On MS-COCO, S2R2 outperforms both SwAV and SimCLR with a larger margin than on STl10. This indicates that S2R2 is more effective on diverse scenes and could eliminate the need for an object-centric large training dataset for self-supervised representation learning.
CVJul 18, 2020
MIX'EM: Unsupervised Image Classification using a Mixture of EmbeddingsAli Varamesh, Tinne Tuytelaars
We present MIX'EM, a novel solution for unsupervised image classification. MIX'EM generates representations that by themselves are sufficient to drive a general-purpose clustering algorithm to deliver high-quality classification. This is accomplished by building a mixture of embeddings module into a contrastive visual representation learning framework in order to disentangle representations at the category level. It first generates a set of embedding and mixing coefficients from a given visual representation, and then combines them into a single embedding. We introduce three techniques to successfully train MIX'EM and avoid degenerate solutions; (i) diversify mixture components by maximizing entropy, (ii) minimize instance conditioned component entropy to enforce a clustered embedding space, and (iii) use an associative embedding loss to enforce semantic separability. By applying (i) and (ii), semantic categories emerge through the mixture coefficients, making it possible to apply (iii). Subsequently, we run K-means on the representations to acquire semantic classification. We conduct extensive experiments and analyses on STL10, CIFAR10, and CIFAR100-20 datasets, achieving state-of-the-art classification accuracy of 78\%, 82\%, and 44\%, respectively. To achieve robust and high accuracy, it is essential to use the mixture components to initialize K-means. Finally, we report competitive baselines (70\% on STL10) obtained by applying K-means to the "normalized" representations learned using the contrastive loss.
CVDec 2, 2019
Mixture Dense Regression for Object Detection and Human Pose EstimationAli Varamesh, Tinne Tuytelaars
Mixture models are well-established learning approaches that, in computer vision, have mostly been applied to inverse or ill-defined problems. However, they are general-purpose divide-and-conquer techniques, splitting the input space into relatively homogeneous subsets in a data-driven manner. Not only ill-defined but also well-defined complex problems should benefit from them. To this end, we devise a framework for spatial regression using mixture density networks. We realize the framework for object detection and human pose estimation. For both tasks, a mixture model yields higher accuracy and divides the input space into interpretable modes. For object detection, mixture components focus on object scale, with the distribution of components closely following that of ground truth the object scale. This practically alleviates the need for multi-scale testing, providing a superior speed-accuracy trade-off. For human pose estimation, a mixture model divides the data based on viewpoint and uncertainty -- namely, front and back views, with back view imposing higher uncertainty. We conduct experiments on the MS COCO dataset and do not face any mode collapse.