CVJul 25, 2022
Towards Complex Document Understanding By Discrete ReasoningFengbin Zhu, Wenqiang Lei, Fuli Feng et al.
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language, which is an emerging research topic for both Natural Language Processing and Computer Vision. In this work, we introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages comprising semi-structured table(s) and unstructured text as well as 16,558 question-answer pairs by extending the TAT-QA dataset. These documents are sampled from real-world financial reports and contain lots of numbers, which means discrete reasoning capability is demanded to answer questions on this dataset. Based on TAT-DQA, we further develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions with corresponding strategies, i.e., extraction or reasoning. Extensive experiments show that the MHST model significantly outperforms the baseline methods, demonstrating its effectiveness. However, the performance still lags far behind that of expert humans. We expect that our new TAT-DQA dataset would facilitate the research on deep understanding of visually-rich documents combining vision and language, especially for scenarios that require discrete reasoning. Also, we hope the proposed model would inspire researchers to design more advanced Document VQA models in future. Our dataset will be publicly available for non-commercial use at https://nextplusplus.github.io/TAT-DQA/.
57.9LGMay 29
Chain-of-Thought and Compressed Looped Transformers: A Memory-Budget SeparationHaozhou Zhang
Chain-of-thought prompting and looped Transformers both give a fixed model more test-time computation, but they differ in what they remember. Chain-of-thought stores intermediate state in generated tokens that remain in the context, whereas a looped Transformer carries state through recurrent hidden activations. We argue that this persistent mutable memory is a central resource for test-time reasoning. We compare three memory regimes, the compressed latent loop, the full sequence-state loop, and the chain-of-thought scratchpad. Our main result shows that a compressed loop is limited by the size of its recurrent state. Running the loop longer adds computation but does not by itself create a growing scratchpad, so a loop with a small recurrent state remains a small-space reasoner even when run for many steps. Under a standard complexity assumption, such loops cannot decide problems that are P-complete under logspace reductions, whereas polynomial-length chain-of-thought can. The separation is specific to compressed loops, as full sequence-state loops carry state at every input position and live in a memory-rich regime closer to explicit scratchpads. Controlled pointer-chasing and associative-recall sweeps illustrate this memory-budget view, with performance sensitive to whether the persistent-state budget matches the task's working-memory demand.
55.9LGMay 8
PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix OptimisationHaozhou Zhang
Muon's matrix-level update couples two distinct effects: spectral control via a polar map, and equivariance under orthogonal changes of multiplicity-space basis (Schur gauge-equivariance). We separate them with PolarAdamW, a controlled hybrid that preserves Muon's polar spectral-norm control but breaks the gauge-equivariance, since AdamW's coordinatewise preconditioner is basis-dependent. Algorithmically, PolarAdamW applies Muon's Newton-Schulz polar map to AdamW's preconditioned direction rather than to raw momentum, at per-iteration wall-time comparable to Muon. We prove that Muon's polar step is Schur gauge-equivariant on multiplicity matrices while AdamW's coordinatewise step is not. On DeiT-Tiny trained from scratch on four independently sampled 100-class subsets of ImageNet-1k, where multiplicity-basis freedom is trivial, PolarAdamW outperforms Muon by +1.93 pp in test accuracy on average and AdamW by +9.5 pp; under the 300-epoch DeiT-style recipe, it remains ahead of Muon by +1.37 pp and AdamW by +5.80 pp on average. On SO(3)-equivariant 3D point-cloud regression, where multiplicity-basis freedom is non-trivial, the ordering reverses: Muon outperforms PolarAdamW at every audited capacity, and the gap widens with capacity. Both matrix-polar optimisers continue to outperform AdamW. This double dissociation separates spectral control from Schur gauge-equivariance: the first composes well with AdamW preconditioning on standard transformers, while the second becomes consequential when multiplicity-basis freedom is structurally non-trivial.