Philipp Singer

CL
h-index7
15papers
404citations
Novelty28%
AI Score38

15 Papers

CLJun 13, 2023Code
h2oGPT: Democratizing Large Language Models

Arno Candel, Jon McKinney, Philipp Singer et al.

Applications built on top of Large Language Models (LLMs) such as GPT-4 represent a revolution in AI due to their human-level capabilities in natural language processing. However, they also pose many significant risks such as the presence of biased, private, or harmful text, and the unauthorized inclusion of copyrighted material. We introduce h2oGPT, a suite of open-source code repositories for the creation and use of LLMs based on Generative Pretrained Transformers (GPTs). The goal of this project is to create the world's best truly open-source alternative to closed-source approaches. In collaboration with and as part of the incredible and unstoppable open-source community, we open-source several fine-tuned h2oGPT models from 7 to 40 Billion parameters, ready for commercial use under fully permissive Apache 2.0 licenses. Included in our release is 100\% private document search using natural language. Open-source language models help boost AI development and make it more accessible and trustworthy. They lower entry hurdles, allowing people and groups to tailor these models to their needs. This openness increases innovation, transparency, and fairness. An open-source strategy is needed to share AI benefits fairly, and H2O.ai will continue to democratize AI and LLMs.

CLOct 17, 2023Code
H2O Open Ecosystem for State-of-the-art Large Language Models

Arno Candel, Jon McKinney, Philipp Singer et al.

Large Language Models (LLMs) represent a revolution in AI. However, they also pose many significant risks, such as the presence of biased, private, copyrighted or harmful text. For this reason we need open, transparent and safe solutions. We introduce a complete open-source ecosystem for developing and testing LLMs. The goal of this project is to boost open alternatives to closed-source approaches. We release h2oGPT, a family of fine-tuned LLMs of diverse sizes. We also introduce H2O LLM Studio, a framework and no-code GUI designed for efficient fine-tuning, evaluation, and deployment of LLMs using the most recent state-of-the-art techniques. Our code and models are fully open-source. We believe this work helps to boost AI development and make it more accessible, efficient and trustworthy. The demo is available at: https://gpt.h2o.ai/

CLJul 12, 2024
H2O-Danube3 Technical Report

Pascal Pfeiffer, Philipp Singer, Yauhen Babakhin et al.

We present H2O-Danube3, a series of small language models consisting of H2O-Danube3-4B, trained on 6T tokens and H2O-Danube3-500M, trained on 4T tokens. Our models are pre-trained on high quality Web data consisting of primarily English tokens in three stages with different data mixes before final supervised tuning for chat version. The models exhibit highly competitive metrics across a multitude of academic, chat, and fine-tuning benchmarks. Thanks to its compact architecture, H2O-Danube3 can be efficiently run on a modern smartphone, enabling local inference and rapid processing capabilities even on mobile devices. We make all models openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically.

99.7LGMay 13
TabPFN-3: Technical Report

Léo Grinsztajn, Klemens Flöge, Oscar Key et al.

Tabular data underpins most high-value prediction problems in science and industry, and TabPFN has driven the foundation model revolution for this modality. Designed with feedback from our users, TabPFN-3 builds on this foundation to scale state-of-the-art performance to datasets with 1M training rows and substantially reduce training and inference time. Pretrained exclusively on synthetic data from our prior, TabPFN-3 dramatically pushes the frontier of tabular prediction and brings substantial gains on time series, relational, and tabular-text data. On the standard tabular benchmark TabArena, a forward pass of TabPFN-3 outperforms all other models, including tuned and ensembled baselines, by a significant margin, and pareto-dominates the speed/performance frontier. On more diverse datasets, TabPFN-3 ranks first on datasets with many classes, and beats 8-hour-tuned gradient-boosted-tree baselines on datasets up to 1M training rows and 200 features. TabPFN-3 introduces test-time compute scaling to tabular foundation models. Our API offering TabPFN-3-Plus (Thinking) exploits this to beat all non-TabPFN models by over 200 Elo on TabArena, rising to 420 Elo on the largest data subset, and outperforms AutoGluon 1.5 extreme while being 10x faster, without using LLMs, real data, internet search or any other model besides TabPFN. TabPFN-3 extends the capabilities of our models, enabling SOTA prediction on relational data (new SOTA foundation model on RelBenchV1) and tabular-text data (SOTA on TabSTAR via TabPFN-3-Plus); and improves existing integrations: a specialized checkpoint, TabPFN-TS-3, ranks 2nd on the time-series benchmark fev-bench, and SHAP-value computation is up to 120x faster. TabPFN-3 achieves this performance while being up to 20x faster than TabPFN-2.5. In addition, a reduced KV cache and row-chunking scale to 1M rows on one H100 with fast inference speed.

CLJan 30, 2024
H2O-Danube-1.8B Technical Report

Philipp Singer, Pascal Pfeiffer, Yauhen Babakhin et al.

We present H2O-Danube, a series of small 1.8B language models consisting of H2O-Danube-1.8B, trained on 1T tokens, and the incremental improved H2O-Danube2-1.8B trained on an additional 2T tokens. Our models exhibit highly competitive metrics across a multitude of benchmarks and, as of the time of this writing, H2O-Danube2-1.8B achieves the top ranking on Open LLM Leaderboard for all models below the 2B parameter range. The models follow core principles of LLama 2 and Mistral, and we leverage and refine various techniques for pre-training large language models. We additionally release chat models trained with supervised fine-tuning followed by direct preference optimization. We make all models openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically.

SDJul 16, 2021
Recognizing bird species in diverse soundscapes under weak supervision

Christof Henkel, Pascal Pfeiffer, Philipp Singer

We present a robust classification approach for avian vocalization in complex and diverse soundscapes, achieving second place in the BirdCLEF2021 challenge. We illustrate how to make full use of pre-trained convolutional neural networks, by using an efficient modeling and training routine supplemented by novel augmentation methods. Thereby, we improve the generalization of weakly labeled crowd-sourced data to productive data collected by autonomous recording units. As such, we illustrate how to progress towards an accurate automated assessment of avian population which would enable global biodiversity monitoring at scale, impossible by manual annotation.

CVOct 4, 2020
Supporting large-scale image recognition with out-of-domain samples

Christof Henkel, Philipp Singer

This article presents an efficient end-to-end method to perform instance-level recognition employed to the task of labeling and ranking landmark images. In a first step, we embed images in a high dimensional feature space using convolutional neural networks trained with an additive angular margin loss and classify images using visual similarity. We then efficiently re-rank predictions and filter noise utilizing similarity to out-of-domain images. Using this approach we achieved the 1st place in the 2020 edition of the Google Landmark Recognition challenge.

SOC-PHJul 22, 2020
Backtesting the predictability of COVID-19

Dmitry Gordeev, Philipp Singer, Marios Michailidis et al.

The advent of the COVID-19 pandemic has instigated unprecedented changes in many countries around the globe, putting a significant burden on the health sectors, affecting the macro economic conditions, and altering social interactions amongst the population. In response, the academic community has produced multiple forecasting models, approaches and algorithms to best predict the different indicators of COVID-19, such as the number of confirmed infected cases. Yet, researchers had little to no historical information about the pandemic at their disposal in order to inform their forecasting methods. Our work studies the predictive performance of models at various stages of the pandemic to better understand their fundamental uncertainty and the impact of data availability on such forecasts. We use historical data of COVID-19 infections from 253 regions from the period of 22nd January 2020 until 22nd June 2020 to predict, through a rolling window backtesting framework, the cumulative number of infected cases for the next 7 and 28 days. We implement three simple models to track the root mean squared logarithmic error in this 6-month span, a baseline model that always predicts the last known value of the cumulative confirmed cases, a power growth model and an epidemiological model called SEIRD. Prediction errors are substantially higher in early stages of the pandemic, resulting from limited data. Throughout the course of the pandemic, errors regress slowly, but steadily. The more confirmed cases a country exhibits at any point in time, the lower the error in forecasting future confirmed cases. We emphasize the significance of having a rigorous backtesting framework to accurately assess the predictive power of such models at any point in time during the outbreak which in turn can be used to assign the right level of certainty to these forecasts and facilitate better planning.

SIFeb 17, 2017
Why We Read Wikipedia

Philipp Singer, Florian Lemmerich, Robert West et al.

Wikipedia is one of the most popular sites on the Web, with millions of users relying on it to satisfy a broad range of information needs every day. Although it is crucial to understand what exactly these needs are in order to be able to meet them, little is currently known about why users visit Wikipedia. The goal of this paper is to fill this gap by combining a survey of Wikipedia readers with a log-based analysis of user activity. Based on an initial series of user surveys, we build a taxonomy of Wikipedia use cases along several dimensions, capturing users' motivations to visit Wikipedia, the depth of knowledge they are seeking, and their knowledge of the topic of interest prior to visiting Wikipedia. Then, we quantify the prevalence of these use cases via a large-scale user survey conducted on live Wikipedia with almost 30,000 responses. Our analyses highlight the variety of factors driving users to Wikipedia, such as current events, media coverage of a topic, personal curiosity, work or school assignments, or boredom. Finally, we match survey responses to the respondents' digital traces in Wikipedia's server logs, enabling the discovery of behavioral patterns associated with specific use cases. For instance, we observe long and fast-paced page sequences across topics for users who are bored or exploring randomly, whereas those using Wikipedia for work or school spend more time on individual articles focused on topics such as science. Our findings advance our understanding of reader motivations and behavior on Wikipedia and can have implications for developers aiming to improve Wikipedia's user experience, editors striving to cater to their readers' needs, third-party services (such as search engines) providing access to Wikipedia content, and researchers aiming to build tools such as recommendation engines.

SIApr 23, 2016
Evidence of Online Performance Deterioration in User Sessions on Reddit

Philipp Singer, Emilio Ferrara, Farshad Kooti et al.

This article presents evidence of performance deterioration in online user sessions quantified by studying a massive dataset containing over 55 million comments posted on Reddit in April 2015. After segmenting the sessions (i.e., periods of activity without a prolonged break) depending on their intensity (i.e., how many posts users produced during sessions), we observe a general decrease in the quality of comments produced by users over the course of sessions. We propose mixed-effects models that capture the impact of session intensity on comments, including their length, quality, and the responses they generate from the community. Our findings suggest performance deterioration: Sessions of increasing intensity are associated with the production of shorter, progressively less complex comments, which receive declining quality scores (as rated by other users), and are less and less engaging (i.e., they attract fewer responses). Our contribution evokes a connection between cognitive and attention dynamics and the usage of online social peer production platforms, specifically the effects of deterioration of user performance.

SIJan 20, 2016
Discovering and Characterizing Mobility Patterns in Urban Spaces: A Study of Manhattan Taxi Data

Lisette Espín-Noboa, Florian Lemmerich, Philipp Singer et al.

Nowadays, human movement in urban spaces can be traced digitally in many cases. It can be observed that movement patterns are not constant, but vary across time and space. In this work,we characterize such spatio-temporal patterns with an innovative combination of two separate approaches that have been utilized for studying human mobility in the past. First, by using non-negative tensor factorization (NTF), we are able to cluster human behavior based on spatio-temporal dimensions. Second, for understanding these clusters, we propose to use HypTrails, a Bayesian approach for expressing and comparing hypotheses about human trails. To formalize hypotheses we utilize data that is publicly available on the Web, namely Foursquare data and census data provided by an open data platform. By applying this combination of approaches to taxi data in Manhattan, we can discover and explain different patterns in human mobility that cannot be identified in a collective analysis. As one example, we can find a group of taxi rides that end at locations with a high number of party venues (according to Foursquare) on weekend nights. Overall, our work demonstrates that human mobility is not one-dimensional but rather contains different facets both in time and space which we explain by utilizing online data. The findings of this paper argue for a more fine-grained analysis of human mobility in order to make more informed decisions for e.g., enhancing urban structures, tailored traffic control and location-based recommender systems.

SIJul 8, 2014
Discovering Beaten Paths in Collaborative Ontology-Engineering Projects using Markov Chains

Simon Walk, Philipp Singer, Markus Strohmaier et al.

Biomedical taxonomies, thesauri and ontologies in the form of the International Classification of Diseases (ICD) as a taxonomy or the National Cancer Institute Thesaurus as an OWL-based ontology, play a critical role in acquiring, representing and processing information about human health. With increasing adoption and relevance, biomedical ontologies have also significantly increased in size. For example, the 11th revision of the ICD, which is currently under active development by the WHO contains nearly 50,000 classes representing a vast variety of different diseases and causes of death. This evolution in terms of size was accompanied by an evolution in the way ontologies are engineered. Because no single individual has the expertise to develop such large-scale ontologies, ontology-engineering projects have evolved from small-scale efforts involving just a few domain experts to large-scale projects that require effective collaboration between dozens or even hundreds of experts, practitioners and other stakeholders. Understanding how these stakeholders collaborate will enable us to improve editing environments that support such collaborations. We uncover how large ontology-engineering projects, such as the ICD in its 11th revision, unfold by analyzing usage logs of five different biomedical ontology-engineering projects of varying sizes and scopes using Markov chains. We discover intriguing interaction patterns (e.g., which properties users subsequently change) that suggest that large collaborative ontology-engineering projects are governed by a few general principles that determine and drive development. From our analysis, we identify commonalities and differences between different projects that have implications for project managers, ontology editors, developers and contributors working on collaborative ontology-engineering projects and tools in the biomedical domain.

HCMar 5, 2014
How to Apply Markov Chains for Modeling Sequential Edit Patterns in Collaborative Ontology-Engineering Projects

Simon Walk, Philipp Singer, Markus Strohmaier et al.

With the growing popularity of large-scale collaborative ontology-engineering projects, such as the creation of the 11th revision of the International Classification of Diseases, we need new methods and insights to help project- and community-managers to cope with the constantly growing complexity of such projects. In this paper, we present a novel application of Markov chains to model sequential usage patterns that can be found in the change-logs of collaborative ontology-engineering projects. We provide a detailed presentation of the analysis process, describing all the required steps that are necessary to apply and determine the best fitting Markov chain model. Amongst others, the model and results allow us to identify structural properties and regularities as well as predict future actions based on usage sequences. We are specifically interested in determining the appropriate Markov chain orders which postulate on how many previous actions future ones depend on. To demonstrate the practical usefulness of the extracted Markov chains we conduct sequential pattern analyses on a large-scale collaborative ontology-engineering dataset, the International Classification of Diseases in its 11th revision. To further expand on the usefulness of the presented analysis, we show that the collected sequential patterns provide potentially actionable information for user-interface designers, ontology-engineering tool developers and project-managers to monitor, coordinate and dynamically adapt to the natural development processes that occur when collaboratively engineering an ontology. We hope that presented work will spur a new line of ontology-development tools, evaluation-techniques and new insights, further taking the interactive nature of the collaborative ontology-engineering process into consideration.

IRJan 3, 2014
Of course we share! Testing Assumptions about Social Tagging Systems

Stephan Doerfel, Daniel Zoller, Philipp Singer et al.

Social tagging systems have established themselves as an important part in today's web and have attracted the interest from our research community in a variety of investigations. The overall vision of our community is that simply through interactions with the system, i.e., through tagging and sharing of resources, users would contribute to building useful semantic structures as well as resource indexes using uncontrolled vocabulary not only due to the easy-to-use mechanics. Henceforth, a variety of assumptions about social tagging systems have emerged, yet testing them has been difficult due to the absence of suitable data. In this work we thoroughly investigate three available assumptions - e.g., is a tagging system really social? - by examining live log data gathered from the real-world public social tagging system BibSonomy. Our empirical results indicate that while some of these assumptions hold to a certain extent, other assumptions need to be reflected and viewed in a very critical light. Our observations have implications for the design of future search and other algorithms to better reflect the actual user behavior.

CYNov 5, 2013
Semantic Stability in Social Tagging Streams

Claudia Wagner, Philipp Singer, Markus Strohmaier et al.

One potential disadvantage of social tagging systems is that due to the lack of a centralized vocabulary, a crowd of users may never manage to reach a consensus on the description of resources (e.g., books, users or songs) on the Web. Yet, previous research has provided interesting evidence that the tag distributions of resources may become semantically stable over time as more and more users tag them. At the same time, previous work has raised an array of new questions such as: (i) How can we assess the semantic stability of social tagging systems in a robust and methodical way? (ii) Does semantic stabilization of tags vary across different social tagging systems and ultimately, (iii) what are the factors that can explain semantic stabilization in such systems? In this work we tackle these questions by (i) presenting a novel and robust method which overcomes a number of limitations in existing methods, (ii) empirically investigating semantic stabilization processes in a wide range of social tagging systems with distinct domains and properties and (iii) detecting potential causes for semantic stabilization, specifically imitation behavior, shared background knowledge and intrinsic properties of natural language. Our results show that tagging streams which are generated by a combination of imitation dynamics and shared background knowledge exhibit faster and higher semantic stability than tagging streams which are generated via imitation dynamics or natural language streams alone.