72.1IRMay 29
An Industrial-Scale Sequential Recommender for LinkedIn Feed RankingLars Hertel, Gaurav Srivastava, Syed Ali Naqvi et al.
LinkedIn Feed enables professionals worldwide to discover relevant content, build connections, and share knowledge at scale. We present Feed Sequential Recommender (Feed SR), a transformer-based sequential ranking model for LinkedIn Feed that replaces a DCNv2-based ranker and meets strict production constraints. We detail the modeling choices, training techniques, and serving optimizations that enable deployment at a scale of 1.2 billion members. Feed SR has been serving the majority of LinkedIn's Feed traffic for over three months and shows significant improvements in member engagement (+2.10% time spent, +3.52% like, comments, or reshares) in online A/B tests compared to the existing production model. We also describe our deployment experience with alternative sequential and LLM-based ranking architectures and why Feed SR provided the best combination of online metrics and production efficiency.
41.5SEApr 20
Raven: Rethinking Automated Assessment for Scratch Programs via Video-Grounded EvaluationDonglin Li, Daming Li, Hanyuan Shi et al.
Block-based programming environments such as Scratch are widely used in introductory computing education, yet scalable and reliable automated assessment remains elusive. Scratch programs are highly heterogeneous, event-driven, and visually grounded, which makes traditional assertion-based or test-based grading brittle and difficult to scale. As a result, assessment in real Scratch classrooms still relies heavily on manual inspection and delayed feedback, introducing inconsistency across instructors and limiting scalability. We present Raven, an automated assessment framework for Scratch that replaces program-specific state assertions with instructor-specified, task-level video generation rules shared across all student submissions. Raven integrates large language models with video analysis to evaluate whether a program's observed visual and interactive behaviors satisfy grading criteria, without requiring explicit test cases or predefined outputs. This design enables consistent evaluation despite substantial diversity in implementation strategies and interaction sequences. We evaluate Raven on 13 real Scratch assignments comprising over 140 student submissions with ground-truth labels from human graders. The results show that Raven significantly outperforms prior automated assessment tools in both grading accuracy and robustness across diverse programming styles. A classroom study with 30 students and 10 instructors further demonstrates strong user acceptance and practical applicability. Together, these findings highlight the effectiveness of task-level behavioral abstractions for scalable assessment of open-ended, event-driven programs.
28.7SEMar 31
EcoScratch: Cost-Effective Multimodal Repair for Scratch Using Execution FeedbackYuan Si, Ming Wang, Daming Li et al.
Scratch is the most popular programming environment for novices, with over 1.15 billion projects created worldwide. Unlike traditional languages, correctness in Scratch is defined by visible behavior on the stage rather than by code structure alone, so programs that appear correct in the workspace can still fail at runtime due to timing, event ordering, or cross-sprite interactions. Visual execution evidence such as gameplay videos can therefore be essential for diagnosis and repair. However, capturing and processing this evidence inside an automated repair loop introduces substantial overhead. Probing execution, recording stage behavior, rebuilding executable .sb3 projects, and verifying candidate fixes consume time, monetary cost, and resources across an entire repair trajectory rather than a single model call. We present EcoScratch, a repair pipeline that uses lightweight runtime signals to decide whether the next attempt stays text-only or escalates to multimodal prompting. The controller also sets the JSON Patch budget and verification effort, so evidence choice and repair budget are coupled inside the same decision. EcoScratch rebuilds candidate fixes into executable .sb3 projects and records per-trajectory traces, monetary cost, local-runtime energy. We evaluate 12 models on 100 executable Scratch repair projects under four controller settings, yielding 4800 repair trajectories. In this matrix, a selective multimodal policy gives the strongest observed success-cost-energy tradeoff. It reaches the highest generation success (30.3%) while using less average cost and local-runtime energy than the two non-adaptive multimodal baselines under the same bounded trajectory budget; text-only remains the lowest-cost floor. Across the evaluated matrix, multimodal evidence helps most when it is used to control escalation within a bounded trajectory budget rather than applied uniformly.