Large-Scale Multidimensional Knowledge Profiling of Scientific Literature
This provides an evidence-based resource for researchers and policymakers to track AI research evolution and identify emerging trends, though it is incremental as it builds on existing bibliometric and semantic analysis methods.
The authors tackled the problem of synthesizing the rapid expansion of research in machine learning, vision, and language by compiling a corpus of over 100,000 papers from 22 major conferences (2020-2025) and constructing a multidimensional profiling pipeline, resulting in an analysis that highlights shifts such as growth in safety and multimodal reasoning and stabilization in areas like neural machine translation.
The rapid expansion of research across machine learning, vision, and language has produced a volume of publications that is increasingly difficult to synthesize. Traditional bibliometric tools rely mainly on metadata and offer limited visibility into the semantic content of papers, making it hard to track how research themes evolve over time or how different areas influence one another. To obtain a clearer picture of recent developments, we compile a unified corpus of more than 100,000 papers from 22 major conferences between 2020 and 2025 and construct a multidimensional profiling pipeline to organize and analyze their textual content. By combining topic clustering, LLM-assisted parsing, and structured retrieval, we derive a comprehensive representation of research activity that supports the study of topic lifecycles, methodological transitions, dataset and model usage patterns, and institutional research directions. Our analysis highlights several notable shifts, including the growth of safety, multimodal reasoning, and agent-oriented studies, as well as the gradual stabilization of areas such as neural machine translation and graph-based methods. These findings provide an evidence-based view of how AI research is evolving and offer a resource for understanding broader trends and identifying emerging directions. Code and dataset: https://github.com/xzc-zju/Profiling_Scientific_Literature