Jinsong Wu

h-index51

5papers

271citations

Novelty38%

AI Score42

Ranked #60,401 of 194,257 authors (top 31%)#584 in SE (top 19%)

5 Papers

13.0CLAug 29, 2025Code

Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning

Zinan Tang, Xin Gao, Qizhi Pei et al.

Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely on high-quality training data. While data selection and data synthesis are two common strategies to improve data quality, existing approaches often face limitations in static dataset curation that fail to adapt to evolving model capabilities. In this paper, we introduce Middo, a self-evolving Model-informed dynamic data optimization framework that uses model-aware data selection and context-preserving data refinement. Unlike conventional one-off filtering/synthesis methods, our framework establishes a closed-loop optimization system: (1) A self-referential diagnostic module proactively identifies suboptimal samples through tri-axial model signals - loss patterns (complexity), embedding cluster dynamics (diversity), and self-alignment scores (quality); (2) An adaptive optimization engine then transforms suboptimal samples into pedagogically valuable training points while preserving semantic integrity; (3) This optimization process continuously evolves with model capability through dynamic learning principles. Experiments on multiple benchmarks demonstrate that our Middo consistently enhances the quality of seed data and boosts LLM's performance with improving accuracy by 7.15% on average while maintaining the original dataset scale. This work establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models. Our datasets, models, and code are publicly available at https://github.com/Word2VecT/Middo.

5.2SEOct 31, 2017Code

A Prediction Model of the Project Life-span in Open Source Software Ecosystem

Zhifang Liao, Benhong Zhao, Shengzong Liu et al.

In nature ecosystems, animal life-spans are determined by genes and some other biological characteristics. Similarly, the software project life-spans are related to some internal or external characteristics. Analyzing the relations between these characteristics and the project life-span, may help developers, investors, and contributors to control the development cycle of the software project. The paper provides an insight on the project life-span for a free open source software ecosystem. The statistical analysis of some project characteristics in GitHub is presented, and we find that the choices of programming languages, the number of files, the label format of the project, and the relevant membership expressions can impact the life-span of a project. Based on these discovered characteristics, we also propose a prediction model to estimate the project life-span in open source software ecosystems. These results may help developers reschedule the project in open source software ecosystem.

5.2SEOct 28, 2017Code

Topic-based Integrator Matching for Pull Request

Zhifang Liao, Yanbing Li, Jinsong Wu et al.

Pull Request (PR) is the main method for code contributions from the external contributors in GitHub. PR review is an essential part of open source software developments to maintain the quality of software. Matching a new PR for an appropriate integrator will make the PR reviewing more effective. However, PR and integrator matching are now organized manually in GitHub. To make this process more efficient, we propose a Topic-based Integrator Matching Algorithm (TIMA) to predict highly relevant collaborators(the core developers) as the integrator to incoming PRs . TIMA takes full advantage of the textual semantics of PRs. To define the relationships between topics and collaborators, TIMA builds a relation matrix about topic and collaborators. According to the relevance between topics and collaborators, TIMA matches the suitable collaborators as the PR integrator.

4.7LGOct 29, 2018

Big Data Meet Cyber-Physical Systems: A Panoramic Survey

Rachad Atat, Lingjia Liu, Jinsong Wu et al.

The world is witnessing an unprecedented growth of cyber-physical systems (CPS), which are foreseen to revolutionize our world {via} creating new services and applications in a variety of sectors such as environmental monitoring, mobile-health systems, intelligent transportation systems and so on. The {information and communication technology }(ICT) sector is experiencing a significant growth in { data} traffic, driven by the widespread usage of smartphones, tablets and video streaming, along with the significant growth of sensors deployments that are anticipated in the near future. {It} is expected to outstandingly increase the growth rate of raw sensed data. In this paper, we present the CPS taxonomy {via} providing a broad overview of data collection, storage, access, processing and analysis. Compared with other survey papers, this is the first panoramic survey on big data for CPS, where our objective is to provide a panoramic summary of different CPS aspects. Furthermore, CPS {require} cybersecurity to protect {them} against malicious attacks and unauthorized intrusion, which {become} a challenge with the enormous amount of data that is continuously being generated in the network. {Thus, we also} provide an overview of the different security solutions proposed for CPS big data storage, access and analytics. We also discuss big data meeting green challenges in the contexts of CPS.

3.3SIOct 28, 2017

DevRank: Mining Influential Developers In Github

Zhifang Liao, Haozhi Jin, Yifan Li et al.

As the social coding is becoming increasingly popular, understanding the influence of developers can benefit various applications, such as advertisement for new projects and innovations. However, most existing works have focused only on ranking influential nodes in non-weighted and homogeneous networks, which are not able to transfer proper importance scores to the real important node. To rank developers in Github, we define developer's influence on the capacity of attracting attention which can be measured by the number of followers obtained in the future. We further defined a new method, DevRank, which ranks the developers by influence propagation through heterogeneous network constructed according to user behaviors, including "commit" and "follow". Our experiment compares the performance between DevRank and some other link analysis algorithms, the results have shown that DevRank can improve the ranking accuracy.