Spatiotemporal Tile-based Attention-guided LSTMs for Traffic Video Prediction
This work addresses traffic prediction for urban planning and management, but it is incremental as it builds on existing Conv-LSTM ideas with specific enhancements for scalability.
The paper tackles traffic video prediction by modeling fine-grained and coarse spatial structures with temporal relationships, introducing a tile-aware, cascaded-memory Conv-LSTM with cross-frame attention and a memory-flexible training scheme, resulting in improved scalability and competitive forecasting performance on large-scale traffic heatmaps.
This extended abstract describes our solution for the Traffic4Cast Challenge 2019. The task requires modeling both fine-grained (pixel-level) and coarse (region-level) spatial structure while preserving temporal relationships across long sequences. Building on Conv-LSTM ideas, we introduce a tile-aware, cascaded-memory Conv-LSTM augmented with cross-frame additive attention and a memory-flexible training scheme: frames are sampled per spatial tile so the model learns tile-local dynamics and per-tile memory cells can be updated sparsely, paged, or compressed to scale to large maps. We provide a compact theoretical analysis (tight softmax/attention Lipschitz bound and a tiling error lower bound) explaining stability and the memory-accuracy tradeoffs, and empirically demonstrate improved scalability and competitive forecasting performance on large-scale traffic heatmaps.