Hindsight Preference Optimization for Financial Time Series Advisory
For practitioners in financial time series advisory, this method enables training smaller, more effective models without human annotation, but the approach is domain-specific and incremental.
The paper tackles the challenge of training language models for financial advisory, where quality depends on future outcomes. It proposes Hindsight Preference Optimization, using observed outcomes to generate preference pairs for DPO without human annotation, and shows a 4B model outperforming its 235B teacher on accuracy and advisory quality.
Time series models predict numbers; decision-makers need advisory -- directional signals with reasoning, actionable suggestions, and risk management. Training language models for such predictive advisory faces a fundamental challenge: quality depends on outcomes unknown at prediction time. We bridge two ideas from reinforcement learning -- using information unavailable during execution to retrospectively generate training signal, and preference alignment -- and propose Hindsight Preference Optimization: observed outcomes let an LLM judge rank candidate advisories on dimensions that scalar metrics cannot capture, producing preference pairs for DPO without human annotation. We apply this to Vision-Language-Model-based predictive advisories on S&P 500 equity time series, demonstrated by a 4B model outperforming its 235B teacher on both accuracy and advisory quality.