CLDec 11, 2025
The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model FactualityAileen Cheng, Alon Jacovi, Amir Globerson et al.
We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model's overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at https://www.kaggle.com/benchmarks/google/facts .
CLJul 7, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic CapabilitiesGheorghe Comanici, Eric Bieber, Mike Schaekermann et al. · amazon-science, baidu
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
CLDec 19, 2023
Gemini: A Family of Highly Capable Multimodal ModelsGemini Team, Rohan Anil, Sebastian Borgeaud et al.
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
ROMay 29, 2017
Effective Metrics for Multi-Robot Motion-PlanningAviel Atias, Kiril Solovey, Oren Salzman et al.
We study the effectiveness of metrics for Multi-Robot Motion-Planning (MRMP) when using RRT-style sampling-based planners. These metrics play the crucial role of determining the nearest neighbors of configurations and in that they regulate the connectivity of the underlying roadmaps produced by the planners and other properties like the quality of solution paths. After screening over a dozen different metrics we focus on the five most promising ones- two more traditional metrics, and three novel ones which we propose here, adapted from the domain of shape-matching. In addition to the novel multi-robot metrics, a central contribution of this work are tools to analyze and predict the effectiveness of metrics in the MRMP context. We identify a suite of possible substructures in the configuration space, for which it is fairly easy (i) to define a so-called natural distance which allows us to predict the performance of a metric. This is done by comparing the distribution of its values for sampled pairs of configurations to the distribution induced by the natural distance; (ii) to define equivalence classes of configurations and test how well a metric covers the different classes. We provide experiments that attest to the ability of our tools to predict the effectiveness of metrics: those metrics that qualify in the analysis yield higher success rate of the planner with fewer vertices in the roadmap. We also show how combining several metrics together leads to better results (success rate and size of roadmap) than using a single metric.