FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis
This work addresses the challenge of diagnosing visual-mathematical reasoning in AI systems, providing a contamination-resistant benchmark, but it is incremental as it builds on existing evaluation methods for multimodal models.
The paper tackles the problem of evaluating multimodal AI systems' ability to abstract mathematical rules from visual patterns by introducing FractalBench, a benchmark for fractal program synthesis from images, and finds that while 76% of models generate valid code, only 4% capture the mathematical structure, with success rates varying from 17-21% for geometric transformations to less than 2% for branching recursion.
Mathematical reasoning requires abstracting symbolic rules from visual patterns -- inferring the infinite from the finite. We investigate whether multimodal AI systems possess this capability through FractalBench, a benchmark evaluating fractal program synthesis from images. Fractals provide ideal test cases: Iterated Function Systems with only a few contraction maps generate complex self-similar patterns through simple recursive rules, requiring models to bridge visual perception with mathematical abstraction. We evaluate four leading MLLMs -- GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Flash, and Qwen 2.5-VL -- on 12 canonical fractals. Models must generate executable Python code reproducing the fractal, enabling objective evaluation. Results reveal a striking disconnect: 76% generate syntactically valid code but only 4% capture mathematical structure. Success varies systematically -- models handle geometric transformations (Koch curves: 17-21%) but fail at branching recursion (trees: <2%), revealing fundamental gaps in mathematical abstraction. FractalBench provides a contamination-resistant diagnostic for visual-mathematical reasoning and is available at https://github.com/NaiveNeuron/FractalBench