97.3FAMay 29
Empirical Approximation of $L_p$ NormsFeng Dai, Egor Kosov, Noel Murasko
We study empirical $L_p$ moments of a random vector $\pmbφ$ based on its i.i.d.\ copies $\pmbφ^1,\ldots,\pmbφ^m$, that is, $\frac1m\sum_{j=1}^m |\langle \pmbφ^j,y\rangle|^p$. Our main result is a new estimate for the expected uniform deviation \[ \mathbb{E}\sup_{y\in D}\biggl| \frac1m\sum_{j=1}^m |\langle \pmbφ^j,y\rangle|^p -\mathbb{E}|\langle \pmbφ,y\rangle|^p \biggr| \] over an arbitrary index set $D$. The proof is based on a new bound for Talagrand's $γ$-functional, sharper than the standard Dudley-type entropy estimate. We then apply this estimate to the following two problems. First, for $p>2$, we study Marcinkiewicz-type discretization of $L_p$ norms on an $N$-dimensional subspace $X_N\subset B(Ω)$ of bounded functions on a probability space $(Ω,μ)$. We obtain bounds in terms of the norm of the embedding $ (X_N,\|\cdot\|_{L_p(μ)})\hookrightarrow B(Ω). $ In particular, we prove that when this norm is of order $N^{1/p}$ and \[ m \ge C(p)\, N\log N\,(\log\log N)^{p-1}, \] then $m$ random samples suffice to approximate the $L_p(μ)$ norm uniformly on $X_N$ by the sampled discrete $L_p$ norm. This substantially improves the previously known bound in this setting $ m \ge C(p)\, N(\log N)^{\min\{p,3\}}, $ and is optimal up to the factor $(\log\log N)^{p-1}$ in the random-sampling setting. Second, for $1\le p<2$, we obtain an $L_p$ analogue of the restricted isometry property via random sampling for bounded orthogonal systems and, more generally, for $N$-element systems $\mathcal D_N$ satisfying a Riesz-type condition. We prove that when \[ m \ge C(p)\, s\log N\,(\log s)^2\,\log\log s, \] then $m$ random samples suffice to guarantee an $L_p$ restricted isometry-type property uniformly over the class of all $s$-sparse functions generated by $\mathcal D_N$.
77.3CVMay 12Code
VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language InferenceHao Zhu, Shuo Jin, Wenbin Liao et al.
Pursuing training-free open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP-based paradigm and harnesses the recent spatially-aware dino.txt framework to facilitate more efficient and high-quality dense prediction. While dino.txt exhibits robust spatial awareness, we find that the semantic ambiguity of text queries gives rise to severe mismatch within its dense cross-modal interactions. To address this, we introduce \textcolor{oursblue}{\textbf{VI}}sual-guided \textcolor{oursblue}{\textbf{P}}rompt evolution (\textcolor{oursblue}{\textbf{\textit{VIP}}}) to rectify the semantic expressiveness of text queries in dino.txt, unleashing its potential for fine-grained object perception. Towards this end, \VIP integrates alias expansion with a visual-guided distillation mechanism to mine valuable semantic cues, which are robustly aggregated in a saliency-aware manner to yield a high-fidelity prediction. Extensive evaluations demonstrate that \VIP: \ding{182} surpasses the top-leading methods by $1.4\% \sim 8.4\%$ average mIoU, \ding{183} generalizes well to diverse challenging domains, and \ding{184} requires marginal inference time and memory overhead. \href{https://github.com/MiSsU-HH/VIP}{Our code is publicly available at GitHub \faGithub}.
CVJul 12, 2022
Cycle Self-Training for Semi-Supervised Object Detection with Distribution Consistency ReweightingHao Liu, Bin Chen, Bo Wang et al.
Recently, many semi-supervised object detection (SSOD) methods adopt teacher-student framework and have achieved state-of-the-art results. However, the teacher network is tightly coupled with the student network since the teacher is an exponential moving average (EMA) of the student, which causes a performance bottleneck. To address the coupling problem, we propose a Cycle Self-Training (CST) framework for SSOD, which consists of two teachers T1 and T2, two students S1 and S2. Based on these networks, a cycle self-training mechanism is built, i.e., S1${\rightarrow}$T1${\rightarrow}$S2${\rightarrow}$T2${\rightarrow}$S1. For S${\rightarrow}$T, we also utilize the EMA weights of the students to update the teachers. For T${\rightarrow}$S, instead of providing supervision for its own student S1(S2) directly, the teacher T1(T2) generates pseudo-labels for the student S2(S1), which looses the coupling effect. Moreover, owing to the property of EMA, the teacher is most likely to accumulate the biases from the student and make the mistakes irreversible. To mitigate the problem, we also propose a distribution consistency reweighting strategy, where pseudo-labels are reweighted based on distribution consistency across the teachers T1 and T2. With the strategy, the two students S2 and S1 can be trained robustly with noisy pseudo labels to avoid confirmation biases. Extensive experiments prove the superiority of CST by consistently improving the AP over the baseline and outperforming state-of-the-art methods by 2.1% absolute AP improvements with scarce labeled data.
47.4CVMar 31
TreeGaussian: Tree-Guided Cascaded Contrastive Learning for Hierarchical Consistent 3D Gaussian Scene Segmentation and UnderstandingJingbin You, Zehao Li, Hao Jiang et al.
3D Gaussian Splatting (3DGS) has emerged as a real-time, differentiable representation for neural scene understanding. However, existing 3DGS-based methods struggle to represent hierarchical 3D semantic structures and capture whole-part relationships in complex scenes. Moreover, dense pairwise comparisons and inconsistent hierarchical labels from 2D priors hinder feature learning, resulting in suboptimal segmentation. To address these limitations, we introduce TreeGaussian, a tree-guided cascaded contrastive learning framework that explicitly models hierarchical semantic relationships and reduces redundancy in contrastive supervision. By constructing a multi-level object tree, TreeGaussian enables structured learning across object-part hierarchies. In addition, we propose a two-stage cascaded contrastive learning strategy that progressively refines feature representations from global to local, mitigating saturation and stabilizing training. A Consistent Segmentation Detection (CSD) mechanism and a graph-based denoising module are further introduced to align segmentation modes across views while suppressing unstable Gaussian points, enhancing segmentation consistency and quality. Extensive experiments, including open-vocabulary 3D object selection, 3D point cloud understanding, and ablation studies, demonstrate the effectiveness and robustness of our approach.
CVMay 23, 2024Code
TopoLogic: An Interpretable Pipeline for Lane Topology Reasoning on Driving ScenesYanping Fu, Wenbin Liao, Xinyuan Liu et al.
As an emerging task that integrates perception and reasoning, topology reasoning in autonomous driving scenes has recently garnered widespread attention. However, existing work often emphasizes "perception over reasoning": they typically boost reasoning performance by enhancing the perception of lanes and directly adopt MLP to learn lane topology from lane query. This paradigm overlooks the geometric features intrinsic to the lanes themselves and are prone to being influenced by inherent endpoint shifts in lane detection. To tackle this issue, we propose an interpretable method for lane topology reasoning based on lane geometric distance and lane query similarity, named TopoLogic. This method mitigates the impact of endpoint shifts in geometric space, and introduces explicit similarity calculation in semantic space as a complement. By integrating results from both spaces, our methods provides more comprehensive information for lane topology. Ultimately, our approach significantly outperforms the existing state-of-the-art methods on the mainstream benchmark OpenLane-V2 (23.9 v.s. 10.9 in TOP$_{ll}$ and 44.1 v.s. 39.8 in OLS on subset_A. Additionally, our proposed geometric distance topology reasoning method can be incorporated into well-trained models without re-training, significantly boost the performance of lane topology reasoning. The code is released at https://github.com/Franpin/TopoLogic.
CVMay 23, 2025Code
TopoPoint: Enhance Topology Reasoning via Endpoint Detection in Autonomous DrivingYanping Fu, Xinyuan Liu, Tianyu Li et al.
Topology reasoning, which unifies perception and structured reasoning, plays a vital role in understanding intersections for autonomous driving. However, its performance heavily relies on the accuracy of lane detection, particularly at connected lane endpoints. Existing methods often suffer from lane endpoints deviation, leading to incorrect topology construction. To address this issue, we propose TopoPoint, a novel framework that explicitly detects lane endpoints and jointly reasons over endpoints and lanes for robust topology reasoning. During training, we independently initialize point and lane query, and proposed Point-Lane Merge Self-Attention to enhance global context sharing through incorporating geometric distances between points and lanes as an attention mask . We further design Point-Lane Graph Convolutional Network to enable mutual feature aggregation between point and lane query. During inference, we introduce Point-Lane Geometry Matching algorithm that computes distances between detected points and lanes to refine lane endpoints, effectively mitigating endpoint deviation. Extensive experiments on the OpenLane-V2 benchmark demonstrate that TopoPoint achieves state-of-the-art performance in topology reasoning (48.8 on OLS). Additionally, we propose DET$_p$ to evaluate endpoint detection, under which our method significantly outperforms existing approaches (52.6 v.s. 45.2 on DET$_p$). The code is released at https://github.com/Franpin/TopoPoint.
CVJan 8
TEA: Temporal Adaptive Satellite Image Semantic SegmentationJuyuan Kang, Hao Zhu, Yan Zhu et al.
Crop mapping based on satellite images time-series (SITS) holds substantial economic value in agricultural production settings, in which parcel segmentation is an essential step. Existing approaches have achieved notable advancements in SITS segmentation with predetermined sequence lengths. However, we found that these approaches overlooked the generalization capability of models across scenarios with varying temporal length, leading to markedly poor segmentation results in such cases. To address this issue, we propose TEA, a TEmporal Adaptive SITS semantic segmentation method to enhance the model's resilience under varying sequence lengths. We introduce a teacher model that encapsulates the global sequence knowledge to guide a student model with adaptive temporal input lengths. Specifically, teacher shapes the student's feature space via intermediate embedding, prototypes and soft label perspectives to realize knowledge transfer, while dynamically aggregating student model to mitigate knowledge forgetting. Finally, we introduce full-sequence reconstruction as an auxiliary task to further enhance the quality of representations across inputs of varying temporal lengths. Through extensive experiments, we demonstrate that our method brings remarkable improvements across inputs of different temporal lengths on common benchmarks. Our code will be publicly available.
CVDec 5, 2024Code
Exact: Exploring Space-Time Perceptive Clues for Weakly Supervised Satellite Image Time Series Semantic SegmentationHao Zhu, Yan Zhu, Jiayu Xiao et al.
Automated crop mapping through Satellite Image Time Series (SITS) has emerged as a crucial avenue for agricultural monitoring and management. However, due to the low resolution and unclear parcel boundaries, annotating pixel-level masks is exceptionally complex and time-consuming in SITS. This paper embraces the weakly supervised paradigm (i.e., only image-level categories available) to liberate the crop mapping task from the exhaustive annotation burden. The unique characteristics of SITS give rise to several challenges in weakly supervised learning: (1) noise perturbation from spatially neighboring regions, and (2) erroneous semantic bias from anomalous temporal periods. To address the above difficulties, we propose a novel method, termed exploring space-time perceptive clues (Exact). First, we introduce a set of spatial clues to explicitly capture the representative patterns of different crops from the most class-relative regions. Besides, we leverage the temporal-to-class interaction of the model to emphasize the contributions of pivotal clips, thereby enhancing the model perception for crop regions. Build upon the space-time perceptive clues, we derive the clue-based CAMs to effectively supervise the SITS segmentation network. Our method demonstrates impressive performance on various SITS benchmarks. Remarkably, the segmentation network trained on Exact-generated masks achieves 95% of its fully supervised performance, showing the bright promise of weakly supervised paradigm in crop mapping scenario. Our code will be publicly available.
CVJul 31, 2025Code
RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General GraspingDongming Wu, Yanping Fu, Saike Huang et al.
General robotic grasping systems require accurate object affordance perception in diverse open-world scenarios following human instructions. However, current studies suffer from the problem of lacking reasoning-based large-scale affordance prediction data, leading to considerable concern about open-world effectiveness. To address this limitation, we build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet. It contains 273k images, 180 categories, and 26k reasoning instructions. The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data. They are carefully annotated with an affordance map, while the difficulty of language instructions is largely increased by removing their category name and only providing functional descriptions. Furthermore, we propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pre-trained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target. Extensive experiments on affordance segmentation benchmarks and real-robot manipulation tasks show that our model has a powerful open-world generalization ability. Our data and code is available at https://github.com/wudongming97/AffordanceNet.
CVJun 12, 2025Code
Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object DetectionXinyuan Liu, Hang Xu, Yike Ma et al.
Recent remote sensing tech advancements drive imagery growth, making oriented object detection rapid development, yet hindered by labor-intensive annotation for high-density scenes. Oriented object detection with point supervision offers a cost-effective solution for densely packed scenes in remote sensing, yet existing methods suffer from inadequate sample assignment and instance confusion due to rigid rule-based designs. To address this, we propose SSP (Semantic-decoupled Spatial Partition), a unified framework that synergizes rule-driven prior injection and data-driven label purification. Specifically, SSP introduces two core innovations: 1) Pixel-level Spatial Partition-based Sample Assignment, which compactly estimates the upper and lower bounds of object scales and mines high-quality positive samples and hard negative samples through spatial partitioning of pixel maps. 2) Semantic Spatial Partition-based Box Extraction, which derives instances from spatial partitions modulated by semantic maps and reliably converts them into bounding boxes to form pseudo-labels for supervising the learning of downstream detectors. Experiments on DOTA-v1.0 and others demonstrate SSP\' s superiority: it achieves 45.78% mAP under point supervision, outperforming SOTA method PointOBB-v2 by 4.10%. Furthermore, when integrated with ORCNN and ReDet architectures, the SSP framework achieves mAP values of 47.86% and 48.50%, respectively. The code is available at https://github.com/antxinyuan/ssp.
CVMay 17, 2023Code
Rethinking Boundary Discontinuity Problem for Oriented Object DetectionHang Xu, Xinyuan Liu, Haonan Xu et al.
Oriented object detection has been developed rapidly in the past few years, where rotation equivariance is crucial for detectors to predict rotated boxes. It is expected that the prediction can maintain the corresponding rotation when objects rotate, but severe mutation in angular prediction is sometimes observed when objects rotate near the boundary angle, which is well-known boundary discontinuity problem. The problem has been long believed to be caused by the sharp loss increase at the angular boundary, and widely used joint-optim IoU-like methods deal with this problem by loss-smoothing. However, we experimentally find that even state-of-the-art IoU-like methods actually fail to solve the problem. On further analysis, we find that the key to solution lies in encoding mode of the smoothing function rather than in joint or independent optimization. In existing IoU-like methods, the model essentially attempts to fit the angular relationship between box and object, where the break point at angular boundary makes the predictions highly unstable.To deal with this issue, we propose a dual-optimization paradigm for angles. We decouple reversibility and joint-optim from single smoothing function into two distinct entities, which for the first time achieves the objectives of both correcting angular boundary and blending angle with other parameters.Extensive experiments on multiple datasets show that boundary discontinuity problem is well-addressed. Moreover, typical IoU-like methods are improved to the same level without obvious performance gap. The code is available at https://github.com/hangxu-cv/cvpr24acm.
CVFeb 19, 2017Code
DR2-Net: Deep Residual Reconstruction Network for Image Compressive SensingHantao Yao, Feng Dai, Dongming Zhang et al.
Most traditional algorithms for compressive sensing image reconstruction suffer from the intensive computation. Recently, deep learning-based reconstruction algorithms have been reported, which dramatically reduce the time complexity than iterative reconstruction algorithms. In this paper, we propose a novel \textbf{D}eep \textbf{R}esidual \textbf{R}econstruction Network (DR$^{2}$-Net) to reconstruct the image from its Compressively Sensed (CS) measurement. The DR$^{2}$-Net is proposed based on two observations: 1) linear mapping could reconstruct a high-quality preliminary image, and 2) residual learning could further improve the reconstruction quality. Accordingly, DR$^{2}$-Net consists of two components, \emph{i.e.,} linear mapping network and residual network, respectively. Specifically, the fully-connected layer in neural network implements the linear mapping network. We then expand the linear mapping network to DR$^{2}$-Net by adding several residual learning blocks to enhance the preliminary image. Extensive experiments demonstrate that the DR$^{2}$-Net outperforms traditional iterative methods and recent deep learning-based methods by large margins at measurement rates 0.01, 0.04, 0.1, and 0.25, respectively. The code of DR$^{2}$-Net has been released on: https://github.com/coldrainyht/caffe\_dr2
9.0ROMar 29
Probe-to-Grasp Manipulation Using Self-Sensing Pneumatic Variable-Stiffness JointsNgoc Duy Tran, Yeman Fan, Feng Dai et al.
Grasping deformable objects with varying stiffness remains a significant challenge in robotics. Estimating the local stiffness of a target object is important for determining an optimal grasp pose that enables stable pickup without damaging the object. This paper presents a probe-to-grasp manipulation framework for estimating the relative stiffness of objects using a passive soft-rigid two-finger hybrid gripper equipped with self-sensing pneumatic variable-stiffness joints. Each finger of the gripper consists of two rigid links connected by a soft pneumatic ring placed at the joint, enabling both compliant interaction and controllable joint stiffness via internal pressurization. By measuring the pressure inside the pneumatic ring, we can estimate the interaction force during contact. Building on this, we propose a practical probing strategy to infer relative object stiffness by correlating the estimated normal force with known gripper closing displacement. We validate the self-sensing model through stiffness characterization experiments across bending angles and pressure ranges, and demonstrate stiffness-aware probing-and-grasping in real-life applications: selecting grasp locations on fruits with spatially varying stiffness. The proposed system offers a minimal, low-cost sensing approach for stiffness-aware soft manipulation while retaining probing and grasping capability.
LGJan 12
Max-Min Neural Network Operators For Approximation of Multivariate FunctionsAbhishek Yadav, Uaday Singh, Feng Dai
In this paper, we develop a multivariate framework for approximation by max-min neural network operators. Building on the recent advances in approximation theory by neural network operators, particularly, the univariate max-min operators, we propose and analyze new multivariate operators activated by sigmoidal functions. We establish pointwise and uniform convergence theorems and derive quantitative estimates for the order of approximation via modulus of continuity and multivariate generalized absolute moment. Our results demonstrate that multivariate max-min structure of operators, besides their algebraic elegance, provide efficient and stable approximation tools in both theoretical and applied settings.
CVNov 25, 2025
TReFT: Taming Rectified Flow Models For One-Step Image TranslationShengqian Li, Ming Gao, Yi Liu et al.
Rectified Flow (RF) models have advanced high-quality image and video synthesis via optimal transport theory. However, when applied to image-to-image translation, they still depend on costly multi-step denoising, hindering real-time applications. Although the recent adversarial training paradigm, CycleGAN-Turbo, works in pretrained diffusion models for one-step image translation, we find that directly applying it to RF models leads to severe convergence issues. In this paper, we analyze these challenges and propose TReFT, a novel method to Tame Rectified Flow models for one-step image Translation. Unlike previous works, TReFT directly uses the velocity predicted by pretrained DiT or UNet as output-a simple yet effective design that tackles the convergence issues under adversarial training with one-step inference. This design is mainly motivated by a novel observation that, near the end of the denoising process, the velocity predicted by pretrained RF models converges to the vector from origin to the final clean image, a property we further justify through theoretical analysis. When applying TReFT to large pretrained RF models such as SD3.5 and FLUX, we introduce memory-efficient latent cycle-consistency and identity losses during training, as well as lightweight architectural simplifications for faster inference. Pretrained RF models finetuned with TReFT achieve performance comparable to sota methods across multiple image translation datasets while enabling real-time inference.
CVNov 24, 2025
MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale ScenesKehua Chen, Tianlu Mao, Zhuxin Ma et al.
Recently, 3D Gaussian Splatting and its derivatives have achieved significant breakthroughs in large-scale scene reconstruction. However, how to efficiently and stably achieve high-quality geometric fidelity remains a core challenge. To address this issue, we introduce MetroGS, a novel Gaussian Splatting framework for efficient and robust reconstruction in complex urban environments. Our method is built upon a distributed 2D Gaussian Splatting representation as the core foundation, serving as a unified backbone for subsequent modules. To handle potential sparse regions in complex scenes, we propose a structured dense enhancement scheme that utilizes SfM priors and a pointmap model to achieve a denser initialization, while incorporating a sparsity compensation mechanism to improve reconstruction completeness. Furthermore, we design a progressive hybrid geometric optimization strategy that organically integrates monocular and multi-view optimization to achieve efficient and accurate geometric refinement. Finally, to address the appearance inconsistency commonly observed in large-scale scenes, we introduce a depth-guided appearance modeling approach that learns spatial features with 3D consistency, facilitating effective decoupling between geometry and appearance and further enhancing reconstruction stability. Experiments on large-scale urban datasets demonstrate that MetroGS achieves superior geometric accuracy, rendering quality, offering a unified solution for high-fidelity large-scale scene reconstruction.
CVAug 18, 2021
Unbiased IoU for Spherical Image Object DetectionQiang Zhao, Bin Chen, Hang Xu et al.
As one of the most fundamental and challenging problems in computer vision, object detection tries to locate object instances and find their categories in natural images. The most important step in the evaluation of object detection algorithm is calculating the intersection-over-union (IoU) between the predicted bounding box and the ground truth one. Although this procedure is well-defined and solved for planar images, it is not easy for spherical image object detection. Existing methods either compute the IoUs based on biased bounding box representations or make excessive approximations, thus would give incorrect results. In this paper, we first identify that spherical rectangles are unbiased bounding boxes for objects in spherical images, and then propose an analytical method for IoU calculation without any approximations. Based on the unbiased representation and calculation, we also present an anchor free object detection algorithm for spherical images. The experiments on two spherical object detection datasets show that the proposed method can achieve better performance than existing methods.
CVJun 24, 2019
Dense Scale Network for Crowd CountingFeng Dai, Hao Liu, Yike Ma et al.
Crowd counting has been widely studied by computer vision community in recent years. Due to the large scale variation, it remains to be a challenging task. Previous methods adopt either multi-column CNN or single-column CNN with multiple branches to deal with this problem. However, restricted by the number of columns or branches, these methods can only capture a few different scales and have limited capability. In this paper, we propose a simple but effective network called DSNet for crowd counting, which can be easily trained in an end-to-end fashion. The key component of our network is the dense dilated convolution block, in which each dilation layer is densely connected with the others to preserve information from continuously varied scales. The dilation rates in dilation layers are carefully selected to prevent the block from gridding artifacts. To further enlarge the range of scales covered by the network, we cascade three blocks and link them with dense residual connections. We also introduce a novel multi-scale density level consistency loss for performance improvement. To evaluate our method, we compare it with state-of-the-art algorithms on four crowd counting datasets (ShanghaiTech, UCF-QNRF, UCF_CC_50 and UCSD). Experimental results demonstrate that DSNet can achieve the best performance and make significant improvements on all the four datasets (30% on the UCF-QNRF and UCF_CC_50, and 20% on the others).
CAMar 26, 2007
Positive Cubature formulas and Marcinkiewicz-Zygmund inequalities on spherical capsFeng Dai, Heping Wang
Let $Π_n^d$ denote the space of all spherical polynomials of degree at most $n$ on the unit sphere $\sph$ of $\mathbb{R}^{d+1}$, and let $d(x, y)$ denote the usual geodesic distance $\arccos x\cdot y$ between $x, y\in \sph$. Given a spherical cap $$ B(e,\al)=\{x\in\sph: d(x, e) \leq \al\}, (e\in\sph, \text{$\al\in (0,π)$ is bounded away from $π$}),$$ we define the metric $$ρ(x,y):=\frac 1{\al} \sqrt{(d(x, y))^2+\al(\sqrt{\al-d(x, e)}-\sqrt{\al-d(y,e)})^2}, $$ where $x, y\in B(e,\al)$. It is shown that given any $\be\ge 1$, $1\leq p<\infty$ and any finite subset $\Ld$ of $B(e,\al)$ satisfying the condition $\dmin_{\sub{ξ,η\in \Ld ξ\neq η}} ρ(ξ,η) \ge \f \da n$ with $\da\in (0,1]$, there exists a positive constant $C$, independent of $\al$, $n$, $\Ld$ and $\da$, such that, for any $f\inΠ_{n}^d$, \begin{equation*} \sum_{\og\in \Ld} (\max_{x,y\in B_ρ(\og, \be\da/n)}|f(x)-f(y)|^p) |B_ρ(\og, \da/n)| \le (C \dz)^p \int_{B(e,\al)} |f(x)|^p d\sa(x),\end{equation*} where $d\sa(x)$ denotes the usual Lebesgue measure on $\sph$, $$B_ρ(x, r)=\Bl\{y\in B(e,\al): ρ(y,x)\leq r\Br\}, (r>0),$$ and $$\Bl|B_ρ(x, \f\da n)\Br|=\int_{B_ρ(x, \da/n)} d\sa(y) \sim \al ^{d}\Bl[ (\f{\da}n)^{d+1}+ (\f\da n)^{d} \sqrt{1-\f{d(x, e)}\al}\Br].$$ As a consequence, we establish positive cubature formulas and Marcinkiewicz-Zygmund inequalities on the spherical cap $B(e,\al)$.