ITJul 3, 2022
Scalable Polar Code Construction for Successive Cancellation List Decoding: A Graph Neural Network-Based ApproachYun Liao, Seyyed Ali Hashemi, Hengjie Yang et al.
While constructing polar codes for successive-cancellation decoding can be implemented efficiently by sorting the bit-channels, finding optimal polar codes for cyclic-redundancy-check-aided successive-cancellation list (CA-SCL) decoding in an efficient and scalable manner still awaits investigation. This paper first maps a polar code to a unique heterogeneous graph called the polar-code-construction message-passing (PCCMP) graph. Next, a heterogeneous graph-neural-network-based iterative message-passing (IMP) algorithm is proposed which aims to find a PCCMP graph that corresponds to the polar code with minimum frame error rate under CA-SCL decoding. This new IMP algorithm's major advantage lies in its scalability power. That is, the model complexity is independent of the blocklength and code rate, and a trained IMP model over a short polar code can be readily applied to a long polar code's construction. Numerical experiments show that IMP-based polar-code constructions outperform classical constructions under CA-SCL decoding. In addition, when an IMP model trained on a length-128 polar code directly applies to the construction of polar codes with different code rates and blocklengths, simulations show that these polar code constructions deliver comparable performance to the 5G polar codes.
CLMar 6
MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn DialogueNaifan Zhang, Ruihan Sun, Jinwei Su et al.
Subjective multi-turn dialogue tasks, such as emotional support, require conversational policies that adapt to evolving user states and optimize long-horizon interaction quality. However, reinforcement learning (RL) for such settings remains challenging due to the absence of reliable process supervision. Outcome-only training collapses credit assignment across turns into a single trajectory-level reward, while naïve turn-level group sampling incurs prohibitive rollout costs in interactive environments. We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns. To stabilize optimization, we introduce a mixed advantage estimator that combines turn-level normalization with batch-level normalization, enabling fine-grained yet scalable credit assignment. Across multiple subjective dialogue benchmarks, including EMPA, EmoBench, and EQ-Bench, and model scales ranging from 7B to 32B, our method consistently improves both training stability and final performance over outcome-only GRPO and single-level normalization baselines. On EMPA, we improve rates by up to 9 points and increase dialogue scores by as much as +43.2 over the 7B base model. Despite training only on EMPA-style environments, our approach generalizes well, yielding consistent improvements on unseen emotional-intelligence benchmarks, including up to +4 points on EmoBench and +3.5 on EQ-Bench. Together, these results demonstrate that dense process supervision combined with mixed-level normalization enables effective and scalable RL for subjective, open-ended multi-turn dialogue.