Ordinal Scale Traffic Congestion Classification with Multi-Modal Vision-Language and Motion Analysis
This work addresses traffic congestion classification for intelligent transportation systems and urban traffic management, representing an incremental improvement through multimodal integration.
The paper tackled traffic congestion classification on an ordinal scale from 1 to 5 by developing a multimodal framework combining vision-language reasoning, object detection, and motion analysis, achieving 76.7% accuracy, an F1 score of 0.752, and a Quadratic Weighted Kappa of 0.684.
Accurate traffic congestion classification is essential for intelligent transportation systems and real-time urban traffic management. This paper presents a multimodal framework combining open-vocabulary visual-language reasoning (CLIP), object detection (YOLO-World), and motion analysis via MOG2-based background subtraction. The system predicts congestion levels on an ordinal scale from 1 (free flow) to 5 (severe congestion), enabling semantically aligned and temporally consistent classification. To enhance interpretability, we incorporate motion-based confidence weighting and generate annotated visual outputs. Experimental results show the model achieves 76.7 percent accuracy, an F1 score of 0.752, and a Quadratic Weighted Kappa (QWK) of 0.684, significantly outperforming unimodal baselines. These results demonstrate the framework's effectiveness in preserving ordinal structure and leveraging visual-language and motion modalities. Future enhancements include incorporating vehicle sizing and refined density metrics.