LGMar 31, 2019Code
SysML'19 demo: customizable and reusable Collective Knowledge pipelines to automate and reproduce machine learning experimentsGrigori Fursin
Reproducing, comparing and reusing results from machine learning and systems papers is a very tedious, ad hoc and time-consuming process. I will demonstrate how to automate this process using open-source, portable, customizable and CLI-based Collective Knowledge workflows and pipelines developed by the community. I will help participants run several real-world non-virtualized CK workflows from the SysML'19 conference, companies (General Motors, Arm) and MLPerf benchmark to automate benchmarking and co-design of efficient software/hardware stacks for machine learning workloads. I hope that our approach will help authors reduce their effort when sharing reusable and extensible research artifacts while enabling artifact evaluators to automatically validate experimental results from published papers in a standard and portable way.
MLJan 19, 2018Code
Introducing ReQuEST: an Open Platform for Reproducible and Quality-Efficient Systems-ML TournamentsThierry Moreau, Anton Lokhmotov, Grigori Fursin
Co-designing efficient machine learning based systems across the whole hardware/software stack to trade off speed, accuracy, energy and costs is becoming extremely complex and time consuming. Researchers often struggle to evaluate and compare different published works across rapidly evolving software frameworks, heterogeneous hardware platforms, compilers, libraries, algorithms, data sets, models, and environments. We present our community effort to develop an open co-design tournament platform with an online public scoreboard. It will gradually incorporate best research practices while providing a common way for multidisciplinary researchers to optimize and compare the quality vs. efficiency Pareto optimality of various workloads on diverse and complete hardware/software systems. We want to leverage the open-source Collective Knowledge framework and the ACM artifact evaluation methodology to validate and share the complete machine learning system implementations in a standardized, portable, and reproducible fashion. We plan to hold regular multi-objective optimization and co-design tournaments for emerging workloads such as deep learning, starting with ASPLOS'18 (ACM conference on Architectural Support for Programming Languages and Operating Systems - the premier forum for multidisciplinary systems research spanning computer architecture and hardware, programming languages and compilers, operating systems and networking) to build a public repository of the most efficient machine learning algorithms and systems which can be easily customized, reused and built upon.
SEAug 11, 2013Code
Collective Mind: cleaning up the research and experimentation mess in computer engineering using crowdsourcing, big data and machine learningGrigori Fursin
Software and hardware co-design and optimization of HPC systems has become intolerably complex, ad-hoc, time consuming and error prone due to enormous number of available design and optimization choices, complex interactions between all software and hardware components, and multiple strict requirements placed on performance, power consumption, size, reliability and cost. We present our novel long-term holistic and practical solution to this problem based on customizable, plugin-based, schema-free, heterogeneous, open-source Collective Mind repository and infrastructure with unified web interfaces and on-line advise system. This collaborative framework distributes analysis and multi-objective off-line and on-line auto-tuning of computer systems among many participants while utilizing any available smart phone, tablet, laptop, cluster or data center, and continuously observing, classifying and modeling their realistic behavior. Any unexpected behavior is analyzed using shared data mining and predictive modeling plugins or exposed to the community at cTuning.org for collaborative explanation, top-down complexity reduction, incremental problem decomposition and detection of correlating program, architecture or run-time properties (features). Gradually increasing optimization knowledge helps to continuously improve optimization heuristics of any compiler, predict optimizations for new programs or suggest efficient run-time (online) tuning and adaptation strategies depending on end-user requirements. We decided to share all our past research artifacts including hundreds of codelets, numerical applications, data sets, models, universal experimental analysis and auto-tuning pipelines, self-tuning machine learning based meta compiler, and unified statistical analysis and machine learning plugins in a public repository to initiate systematic, reproducible and collaborative research, development and experimentation with a new publication model where experiments and techniques are validated, ranked and improved by the community.
AROct 15, 2024
MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from Microwatts to Megawatts for Sustainable AIArya Tschand, Arun Tejusve Raghunath Rajan, Sachin Idgunji et al.
Rapid adoption of machine learning (ML) technologies has led to a surge in power consumption across diverse systems, from tiny IoT devices to massive datacenter clusters. Benchmarking the energy efficiency of these systems is crucial for optimization, but presents novel challenges due to the variety of hardware platforms, workload characteristics, and system-level interactions. This paper introduces MLPerf Power, a comprehensive benchmarking methodology with capabilities to evaluate the energy efficiency of ML systems at power levels ranging from microwatts to megawatts. Developed by a consortium of industry professionals from more than 20 organizations, MLPerf Power establishes rules and best practices to ensure comparability across diverse architectures. We use representative workloads from the MLPerf benchmark suite to collect 1,841 reproducible measurements from 60 systems across the entire range of ML deployment scales. Our analysis reveals trade-offs between performance, complexity, and energy efficiency across this wide range of systems, providing actionable insights for designing optimized ML solutions from the smallest edge devices to the largest cloud infrastructures. This work emphasizes the importance of energy efficiency as a key metric in the evaluation and comparison of the ML system, laying the foundation for future research in this critical area. We discuss the implications for developing sustainable AI solutions and standardizing energy efficiency benchmarking for ML systems.
LGSep 14, 2025
Framing AI System Benchmarking as a Learning Task: FlexBench and the Open MLPerf DatasetGrigori Fursin, Daniel Altunay
Existing AI system benchmarks such as MLPerf often struggle to keep pace with the rapidly evolving AI landscape, making it difficult to support informed deployment, optimization, and co-design decisions for AI systems. We suggest that benchmarking itself can be framed as an AI task - one in which models are continuously evaluated and optimized across diverse datasets, software, and hardware, using key metrics such as accuracy, latency, throughput, energy consumption, and cost. To support this perspective, we present FlexBench: a modular extension of the MLPerf LLM inference benchmark, integrated with HuggingFace and designed to provide relevant and actionable insights. Benchmarking results and metadata are collected into an Open MLPerf Dataset, which can be collaboratively curated, extended, and leveraged for predictive modeling and feature engineering. We successfully validated the FlexBench concept through MLPerf Inference submissions, including evaluations of DeepSeek R1 and LLaMA 3.3 on commodity servers. The broader objective is to enable practitioners to make cost-effective AI deployment decisions that reflect their available resources, requirements, and constraints.
LGJun 24, 2024
Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournamentsGrigori Fursin
This white paper introduces my educational community initiative to learn how to run AI, ML and other emerging workloads in the most efficient and cost-effective way across diverse models, data sets, software and hardware. This project leverages Collective Mind (CM), virtualized MLOps and DevOps (CM4MLOps), MLPerf benchmarks, and the Collective Knowledge playground (CK), which I have developed in collaboration with the community and MLCommons. I created Collective Mind as a small and portable Python package with minimal dependencies, a unified CLI and Python API to help researchers and engineers automate repetitive, tedious, and time-consuming tasks. I also designed CM as a distributed framework, continuously enhanced by the community through the CM4* repositories, which function as the unified interface for organizing and managing various collections of automations and artifacts. For example, CM4MLOps repository includes many automations, also known as CM scripts, to streamline the process of building, running, benchmarking, and optimizing AI, ML, and other workflows across ever-evolving models, data, and systems. I donated CK, CM and CM4MLOps to MLCommons to foster collaboration between academia and industry to learn how to co-design more efficient and cost-effective AI systems while capturing and encoding knowledge within Collective Mind, protecting intellectual property, enabling portable skills, and accelerating the transition of the state-of-the-art research into production. My ultimate goal is to collaborate with the community to complete my two-decade journey toward creating self-optimizing software and hardware that can automatically learn how to run any workload in the most efficient and cost-effective manner based on user requirements and constraints such as cost, latency, throughput, accuracy, power consumption, size, and other critical factors.
LGSep 29, 2021
MedPerf: Open Benchmarking Platform for Medical Artificial Intelligence using Federated EvaluationAlexandros Karargyris, Renato Umeton, Micah J. Sheller et al.
Medical AI has tremendous potential to advance healthcare by supporting the evidence-based practice of medicine, personalizing patient treatment, reducing costs, and improving provider and patient experience. We argue that unlocking this potential requires a systematic way to measure the performance of medical AI models on large-scale heterogeneous data. To meet this need, we are building MedPerf, an open framework for benchmarking machine learning in the medical domain. MedPerf will enable federated evaluation in which models are securely distributed to different facilities for evaluation, thereby empowering healthcare organizations to assess and verify the performance of AI models in an efficient and human-supervised process, while prioritizing privacy. We describe the current challenges healthcare and AI communities face, the need for an open platform, the design philosophy of MedPerf, its current implementation status, and our roadmap. We call for researchers and organizations to join us in creating the MedPerf open benchmarking platform.
LGNov 2, 2020
Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIsGrigori Fursin
This article provides the motivation and overview of the Collective Knowledge framework (CK or cKnowledge). The CK concept is to decompose research projects into reusable components that encapsulate research artifacts and provide unified application programming interfaces (APIs), command-line interfaces (CLIs), meta descriptions and common automation actions for related artifacts. The CK framework is used to organize and manage research projects as a database of such components. Inspired by the USB "plug and play" approach for hardware, CK also helps to assemble portable workflows that can automatically plug in compatible components from different users and vendors (models, datasets, frameworks, compilers, tools). Such workflows can build and run algorithms on different platforms and environments in a unified way using the universal CK program pipeline with software detection plugins and the automatic installation of missing packages. This article presents a number of industrial projects in which the modular CK approach was successfully validated in order to automate benchmarking, auto-tuning and co-design of efficient software and hardware for machine learning (ML) and artificial intelligence (AI) in terms of speed, accuracy, energy, size and various costs. The CK framework also helped to automate the artifact evaluation process at several computer science conferences as well as to make it easier to reproduce, compare and reuse research techniques from published papers, deploy them in production, and automatically adapt them to continuously changing datasets, models and systems. The long-term goal is to accelerate innovation by connecting researchers and practitioners to share and reuse all their knowledge, best practices, artifacts, workflows and experimental results in a common, portable and reproducible format at https://cKnowledge.io .
LGJun 12, 2020
The Collective Knowledge project: making ML models more portable and reproducible with open APIs, reusable best practices and MLOpsGrigori Fursin
This article provides an overview of the Collective Knowledge technology (CK or cKnowledge). CK attempts to make it easier to reproduce ML&systems research, deploy ML models in production, and adapt them to continuously changing data sets, models, research techniques, software, and hardware. The CK concept is to decompose complex systems and ad-hoc research projects into reusable sub-components with unified APIs, CLI, and JSON meta description. Such components can be connected into portable workflows using DevOps principles combined with reusable automation actions, software detection plugins, meta packages, and exposed optimization parameters. CK workflows can automatically plug in different models, data and tools from different vendors while building, running and benchmarking research code in a unified way across diverse platforms and environments. Such workflows also help to perform whole system optimization, reproduce results, and compare them using public or private scoreboards on the CK platform (https://cKnowledge.io). For example, the modular CK approach was successfully validated with industrial partners to automatically co-design and optimize software, hardware, and machine learning models for reproducible and efficient object detection in terms of speed, accuracy, energy, size, and other characteristics. The long-term goal is to simplify and accelerate the development and deployment of ML models and systems by helping researchers and practitioners to share and reuse their knowledge, experience, best practices, artifacts, and techniques using open CK APIs.
LGJan 22, 2020
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingGrigori Fursin, Herve Guillou, Nicolas Essayan
We present CodeReef - an open platform to share all the components necessary to enable cross-platform MLOps (MLSysOps), i.e. automating the deployment of ML models across diverse systems in the most efficient way. We also introduce the CodeReef solution - a way to package and share models as non-virtualized, portable, customizable and reproducible archive files. Such ML packages include JSON meta description of models with all dependencies, Python APIs, CLI actions and portable workflows necessary to automatically build, benchmark, test and customize models across diverse platforms, AI frameworks, libraries, compilers and datasets. We demonstrate several CodeReef solutions to automatically build, run and measure object detection based on SSD-Mobilenets, TensorFlow and COCO dataset from the latest MLPerf inference benchmark across a wide range of platforms from Raspberry Pi, Android phones and IoT devices to data centers. Our long-term goal is to help researchers share their new techniques as production-ready packages along with research papers to participate in collaborative and reproducible benchmarking, compare the different ML/software/hardware stacks and select the most efficient ones on a Pareto frontier using online CodeReef dashboards.
LGMar 29, 2019
MLSys: The New Frontier of Machine Learning SystemsAlexander Ratner, Dan Alistarh, Gustavo Alonso et al.
Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, MLSys, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two.
PFJun 19, 2018
A model-driven approach for a new generation of adaptive librariesMarco Cianfriglia, Flavio Vella, Cedric Nugteren et al.
Efficient high-performance libraries often expose multiple tunable parameters to provide highly optimized routines. These can range from simple loop unroll factors or vector sizes all the way to algorithmic changes, given that some implementations can be more suitable for certain devices by exploiting hardware characteristics such as local memories and vector units. Traditionally, such parameters and algorithmic choices are tuned and then hard-coded for a specific architecture and for certain characteristics of the inputs. However, emerging applications are often data-driven, thus traditional approaches are not effective across the wide range of inputs and architectures used in practice. In this paper, we present a new adaptive framework for data-driven applications which uses a predictive model to select the optimal algorithmic parameters by training with synthetic and real datasets. We demonstrate the effectiveness of a BLAS library and specifically on its matrix multiplication routine. We present experimental results for two GPU architectures and show significant performance gains of up to 3x (on a high-end NVIDIA Pascal GPU) and 2.5x (on an embedded ARM Mali GPU) when compared to a traditionally optimized library.
HCJan 19, 2018
A Collective Knowledge workflow for collaborative research into multi-objective autotuning and machine learning techniquesGrigori Fursin, Anton Lokhmotov, Dmitry Savenko et al.
Developing efficient software and hardware has never been harder whether it is for a tiny IoT device or an Exascale supercomputer. Apart from the ever growing design and optimization complexity, there exist even more fundamental problems such as lack of interdisciplinary knowledge required for effective software/hardware co-design, and a growing technology transfer gap between academia and industry. We introduce our new educational initiative to tackle these problems by developing Collective Knowledge (CK), a unified experimental framework for computer systems research and development. We use CK to teach the community how to make their research artifacts and experimental workflows portable, reproducible, customizable and reusable while enabling sustainable R&D and facilitating technology transfer. We also demonstrate how to redesign multi-objective autotuning and machine learning as a portable and extensible CK workflow. Such workflows enable researchers to experiment with different applications, data sets and tools; crowdsource experimentation across diverse platforms; share experimental results, models, visualizations; gradually expose more design and optimization choices using a simple JSON API; and ultimately build upon each other's findings. As the first practical step, we have implemented customizable compiler autotuning, crowdsourced optimization of diverse workloads across Raspberry Pi 3 devices, reduced the execution time and code size by up to 40%, and applied machine learning to predict optimizations. We hope such approach will help teach students how to build upon each others' work to enable efficient and self-optimizing software/hardware/model stack for emerging workloads.
SEJun 20, 2015
Collective Mind, Part II: Towards Performance- and Cost-Aware Software Engineering as a Natural ScienceGrigori Fursin, Abdul Memon, Christophe Guillon et al.
Nowadays, engineers have to develop software often without even knowing which hardware it will eventually run on in numerous mobile phones, tablets, desktops, laptops, data centers, supercomputers and cloud services. Unfortunately, optimizing compilers are not keeping pace with ever increasing complexity of computer systems anymore and may produce severely underperforming executable codes while wasting expensive resources and energy. We present our practical and collaborative solution to this problem via light-weight wrappers around any software piece when more than one implementation or optimization choice available. These wrappers are connected with a public Collective Mind autotuning infrastructure and repository of knowledge (c-mind.org/repo) to continuously monitor various important characteristics of these pieces (computational species) across numerous existing hardware configurations together with randomly selected optimizations. Similar to natural sciences, we can now continuously track winning solutions (optimizations for a given hardware) that minimize all costs of a computation (execution time, energy spent, code size, failures, memory and storage footprint, optimization time, faults, contentions, inaccuracy and so on) of a given species on a Pareto frontier along with any unexpected behavior. The community can then collaboratively classify solutions, prune redundant ones, and correlate them with various features of software, its inputs (data sets) and used hardware either manually or using powerful predictive analytics techniques. Our approach can then help create a large, realistic, diverse, representative, and continuously evolving benchmark with related optimization knowledge while gradually covering all possible software and hardware to be able to predict best optimizations and improve compilers and hardware depending on usage scenarios and requirements.
PLJul 14, 2014
Finding representative sets of optimizations for adaptive multiversioning applicationsLianjie Luo, Yang Chen, Chengyong Wu et al.
Iterative compilation is a widely adopted technique to optimize programs for different constraints such as performance, code size and power consumption in rapidly evolving hardware and software environments. However, in case of statically compiled programs, it is often restricted to optimizations for a specific dataset and may not be applicable to applications that exhibit different run-time behavior across program phases, multiple datasets or when executed in heterogeneous, reconfigurable and virtual environments. Several frameworks have been recently introduced to tackle these problems and enable run-time optimization and adaptation for statically compiled programs based on static function multiversioning and monitoring of online program behavior. In this article, we present a novel technique to select a minimal set of representative optimization variants (function versions) for such frameworks while avoiding performance loss across available datasets and code-size explosion. We developed a novel mapping mechanism using popular decision tree or rule induction based machine learning techniques to rapidly select best code versions at run-time based on dataset features and minimize selection overhead. These techniques enable creation of self-tuning static binaries or libraries adaptable to changing behavior and environments at run-time using staged compilation that do not require complex recompilation frameworks while effectively outperforming traditional single-version non-adaptable code.