Julian Eggert

AI
h-index16
17papers
107citations
Novelty44%
AI Score48

17 Papers

AIJul 31, 2024Code
Tulip Agent -- Enabling LLM-Based Agents to Solve Tasks Using Large Tool Libraries

Felix Ocker, Daniel Tanneberg, Julian Eggert et al.

We introduce tulip agent, an architecture for autonomous LLM-based agents with Create, Read, Update, and Delete access to a tool library containing a potentially large number of tools. In contrast to state-of-the-art implementations, tulip agent does not encode the descriptions of all available tools in the system prompt, which counts against the model's context window, or embed the entire prompt for retrieving suitable tools. Instead, the tulip agent can recursively search for suitable tools in its extensible tool library, implemented exemplarily as a vector store. The tulip agent architecture significantly reduces inference costs, allows using even large tool libraries, and enables the agent to adapt and extend its set of tools. We evaluate the architecture with several ablation studies in a mathematics context and demonstrate its generalizability with an application to robotics. A reference implementation and the benchmark are available at github.com/HRI-EU/tulip_agent.

CVNov 30, 2025Code
CycliST: A Video Language Model Benchmark for Reasoning on Cyclical State Transitions

Simon Kohaut, Daniel Ochs, Shun Zhang et al.

We present CycliST, a novel benchmark dataset designed to evaluate Video Language Models (VLM) on their ability for textual reasoning over cyclical state transitions. CycliST captures fundamental aspects of real-world processes by generating synthetic, richly structured video sequences featuring periodic patterns in object motion and visual attributes. CycliST employs a tiered evaluation system that progressively increases difficulty through variations in the number of cyclic objects, scene clutter, and lighting conditions, challenging state-of-the-art models on their spatio-temporal cognition. We conduct extensive experiments with current state-of-the-art VLMs, both open-source and proprietary, and reveal their limitations in generalizing to cyclical dynamics such as linear and orbital motion, as well as time-dependent changes in visual attributes like color and scale. Our results demonstrate that present-day VLMs struggle to reliably detect and exploit cyclic patterns, lack a notion of temporal understanding, and are unable to extract quantitative insights from scenes, such as the number of objects in motion, highlighting a significant technical gap that needs to be addressed. More specifically, we find no single model consistently leads in performance: neither size nor architecture correlates strongly with outcomes, and no model succeeds equally well across all tasks. By providing a targeted challenge and a comprehensive evaluation framework, CycliST paves the way for visual reasoning models that surpass the state-of-the-art in understanding periodic patterns.

AIMar 13, 2023
Probabilistic Uncertainty-Aware Risk Spot Detector for Naturalistic Driving

Tim Puphal, Malte Probst, Julian Eggert

Risk assessment is a central element for the development and validation of Autonomous Vehicles (AV). It comprises a combination of occurrence probability and severity of future critical events. Time Headway (TH) as well as Time-To-Contact (TTC) are commonly used risk metrics and have qualitative relations to occurrence probability. However, they lack theoretical derivations and additionally they are designed to only cover special types of traffic scenarios (e.g. following between single car pairs). In this paper, we present a probabilistic situation risk model based on survival analysis considerations and extend it to naturally incorporate sensory, temporal and behavioral uncertainties as they arise in real-world scenarios. The resulting Risk Spot Detector (RSD) is applied and tested on naturalistic driving data of a multi-lane boulevard with several intersections, enabling the visualization of road criticality maps. Compared to TH and TTC, our approach is more selective and specific in predicting risk. RSD concentrates on driving sections of high vehicle density where large accelerations and decelerations or approaches with high velocity occur.

AIMar 13, 2023
Optimization of Velocity Ramps with Survival Analysis for Intersection Merge-Ins

Tim Puphal, Malte Probst, Yiyang Li et al.

We consider the problem of correct motion planning for T-intersection merge-ins of arbitrary geometry and vehicle density. A merge-in support system has to estimate the chances that a gap between two consecutive vehicles can be taken successfully. In contrast to previous models based on heuristic gap size rules, we present an approach which optimizes the integral risk of the situation using parametrized velocity ramps. It accounts for the risks from curves and all involved vehicles (front and rear on all paths) with a so-called survival analysis. For comparison, we also introduce a specially designed extension of the Intelligent Driver Model (IDM) for entering intersections. We show in a quantitative statistical evaluation that the survival method provides advantages in terms of lower absolute risk (i.e., no crash happens) and better risk-utility tradeoff (i.e., making better use of appearing gaps). Furthermore, our approach generalizes to more complex situations with additional risk sources.

ROMar 13, 2023
Intersection Warning System for Occlusion Risks using Relational Local Dynamic Maps

Florian Damerow, Yuda Li, Tim Puphal et al.

This work addresses the task of risk evaluation in traffic scenarios with limited observability due to restricted sensorial coverage. Here, we concentrate on intersection scenarios that are difficult to access visually. To identify the area of sight, we employ ray casting on a local dynamic map providing geometrical information and road infrastructure. Based on the area with reduced visibility, we first model scene entities that pose a potential risk without being visually perceivable yet. Then, we predict a worst-case trajectory in the survival analysis for collision risk estimation. Resulting risk indicators are utilized to evaluate the driver's current behavior, to warn the driver in critical situations, to give suggestions on how to act safely or to plan safe trajectories. We validate our approach by applying the resulting intersection warning system on real world scenarios. The proposed system's behavior reveals to mimic the general behavior of a correctly acting human driver.

ROMar 14, 2023
Continuous Risk Measures for Driving Support

Julian Eggert, Tim Puphal

In this paper, we compare three different model-based risk measures by evaluating their stengths and weaknesses qualitatively and testing them quantitatively on a set of real longitudinal and intersection scenarios. We start with the traditional heuristic Time-To-Collision (TTC), which we extend towards 2D operation and non-crash cases to retrieve the Time-To-Closest-Encounter (TTCE). The second risk measure models position uncertainty with a Gaussian distribution and uses spatial occupancy probabilities for collision risks. We then derive a novel risk measure based on the statistics of sparse critical events and so-called survival conditions. The resulting survival analysis shows to have an earlier detection time of crashes and less false positive detections in near-crash and non-crash cases supported by its solid theoretical grounding. It can be seen as a generalization of TTCE and the Gaussian method which is suitable for the validation of ADAS and AD.

ROMar 13, 2023
Importance Filtering with Risk Models for Complex Driving Situations

Tim Puphal, Raphael Wenzel, Benedict Flade et al.

Self-driving cars face complex driving situations with a large amount of agents when moving in crowded cities. However, some of the agents are actually not influencing the behavior of the self-driving car. Filtering out unimportant agents would inherently simplify the behavior or motion planning task for the system. The planning system can then focus on fewer agents to find optimal behavior solutions for the ego~agent. This is helpful especially in terms of computational efficiency. In this paper, therefore, the research topic of importance filtering with driving risk models is introduced. We give an overview of state-of-the-art risk models and present newly adapted risk models for filtering. Their capability to filter out surrounding unimportant agents is compared in a large-scale experiment. As it turns out, the novel trajectory distance balances performance, robustness and efficiency well. Based on the results, we can further derive a novel filter architecture with multiple filter steps, for which risk models are recommended for each step, to further improve the robustness. We are confident that this will enable current behavior planning systems to better solve complex situations in everyday driving.

SYJul 20, 2023
Introducing Risk Shadowing For Decisive and Comfortable Behavior Planning

Tim Puphal, Julian Eggert

We consider the problem of group interactions in urban driving. State-of-the-art behavior planners for self-driving cars mostly consider each single agent-to-agent interaction separately in a cost function in order to find an optimal behavior for the ego agent, such as not colliding with any of the other agents. In this paper, we develop risk shadowing, a situation understanding method that allows us to go beyond single interactions by analyzing group interactions between three agents. Concretely, the presented method can find out which first other agent does not need to be considered in the behavior planner of an ego agent, because this first other agent cannot reach the ego agent due to a second other agent obstructing its way. In experiments, we show that using risk shadowing as an upstream filter module for a behavior planner allows to plan more decisive and comfortable driving strategies than state of the art, given that safety is ensured in these cases. The usability of the approach is demonstrated for different intersection scenarios and longitudinal driving.

ROOct 19, 2023
Exploring Large Language Models as a Source of Common-Sense Knowledge for Robots

Felix Ocker, Jörg Deigmöller, Julian Eggert

Service robots need common-sense knowledge to help humans in everyday situations as it enables them to understand the context of their actions. However, approaches that use ontologies face a challenge because common-sense knowledge is often implicit, i.e., it is obvious to humans but not explicitly stated. This paper investigates if Large Language Models (LLMs) can fill this gap. Our experiments reveal limited effectiveness in the selective extraction of contextual action knowledge, suggesting that LLMs may not be sufficient on their own. However, the large-scale extraction of general, actionable knowledge shows potential, indicating that LLMs can be a suitable tool for efficiently creating ontologies for robots. This paper shows that the technique used for knowledge extraction can be applied to populate a minimalist ontology, showcasing the potential of LLMs in synergy with formal knowledge representation.

AIFeb 5
Reactive Knowledge Representation and Asynchronous Reasoning

Simon Kohaut, Benedict Flade, Julian Eggert et al.

Exact inference in complex probabilistic models often incurs prohibitive computational costs. This challenge is particularly acute for autonomous agents in dynamic environments that require frequent, real-time belief updates. Existing methods are often inefficient for ongoing reasoning, as they re-evaluate the entire model upon any change, failing to exploit that real-world information streams have heterogeneous update rates. To address this, we approach the problem from a reactive, asynchronous, probabilistic reasoning perspective. We first introduce Resin (Reactive Signal Inference), a probabilistic programming language that merges probabilistic logic with reactive programming. Furthermore, to provide efficient and exact semantics for Resin, we propose Reactive Circuits (RCs). Formulated as a meta-structure over Algebraic Circuits and asynchronous data streams, RCs are time-dynamic Directed Acyclic Graphs that autonomously adapt themselves based on the volatility of input signals. In high-fidelity drone swarm simulations, our approach achieves several orders of magnitude of speedup over frequency-agnostic inference. We demonstrate that RCs' structural adaptations successfully capture environmental dynamics, significantly reducing latency and facilitating reactive real-time reasoning. By partitioning computations based on the estimated Frequency of Change in the asynchronous inputs, large inference tasks can be decomposed into individually memoized sub-problems. This ensures that only the specific components of a model affected by new information are re-evaluated, drastically reducing redundant computation in streaming contexts.

ROMar 4
Right in Time: Reactive Reasoning in Regulated Traffic Spaces

Simon Kohaut, Benedict Flade, Julian Eggert et al.

Exact inference in probabilistic First-Order Logic offers a promising yet computationally costly approach for regulating the behavior of autonomous agents in shared traffic spaces. While prior methods have combined logical and probabilistic data into decision-making frameworks, their application is often limited to pre-flight checks due to the complexity of reasoning across vast numbers of possible universes. In this work, we propose a reactive mission design framework that jointly considers uncertain environmental data and declarative, logical traffic regulations. By synthesizing Probabilistic Mission Design (ProMis) with reactive reasoning facilitated by Reactive Circuits (RC), we enable online, exact probabilistic inference over hybrid domains. Our approach leverages the Frequency of Change inherent in heterogeneous data streams to subdivide inference formulas into memoized, isolated tasks, ensuring that only the specific components affected by new sensor data are re-evaluated. In experiments involving both real-world vessel data and simulated drone traffic in dense urban scenarios, we demonstrate that our approach provides orders of magnitude in speedup over ProMis without reactive paradigms. This allows intelligent transportation systems, such as Unmanned Aircraft Systems (UAS), to actively assert safety and legal compliance during operations rather than relying solely on preparation procedures.

CLMay 2, 2025
TRAVELER: A Benchmark for Evaluating Temporal Reasoning across Vague, Implicit and Explicit References

Svenja Kenneweg, Jörg Deigmöller, Philipp Cimiano et al.

Understanding and resolving temporal references is essential in Natural Language Understanding as we often refer to the past or future in daily communication. Although existing benchmarks address a system's ability to reason about and resolve temporal references, systematic evaluation of specific temporal references remains limited. Towards closing this gap, we introduce TRAVELER, a novel synthetic benchmark dataset that follows a Question Answering paradigm and consists of questions involving temporal references with the corresponding correct answers. TRAVELER assesses models' abilities to resolve explicit, implicit relative to speech time, and vague temporal references. Beyond investigating the performance of state-of-the-art LLMs depending on the type of temporal reference, our benchmark also allows evaluation of performance in relation to the length of the set of events. For the category of vague temporal references, ground-truth answers were established via human surveys on Prolific, following a procedure similar to the one from Kenneweg et al. To demonstrate the benchmark's applicability, we evaluate four state-of-the-art LLMs using a question-answering task encompassing 3,300 questions. Our findings show that while the benchmarked LLMs can answer questions over event sets with a handful of events and explicit temporal references successfully, performance clearly deteriorates with larger event set length and when temporal references get less explicit. Notably, the vague question category exhibits the lowest performance across all models. The benchmark is publicly available at: https://gitlab.ub.uni-bielefeld.de/s.kenneweg/TRAVELER

AIMay 9, 2025
A Grounded Memory System For Smart Personal Assistants

Felix Ocker, Jörg Deigmöller, Pavel Smirnov et al.

A wide variety of agentic AI applications - ranging from cognitive assistants for dementia patients to robotics - demand a robust memory system grounded in reality. In this paper, we propose such a memory system consisting of three components. First, we combine Vision Language Models for image captioning and entity disambiguation with Large Language Models for consistent information extraction during perception. Second, the extracted information is represented in a memory consisting of a knowledge graph enhanced by vector embeddings to efficiently manage relational information. Third, we combine semantic search and graph query generation for question answering via Retrieval Augmented Generation. We illustrate the system's working and potential using a real-world example.

AIDec 25, 2024
Probabilistic Mission Design in Neuro-Symbolic Systems

Simon Kohaut, Benedict Flade, Daniel Ochs et al.

Advanced Air Mobility (AAM) is a growing field that demands accurate modeling of legal concepts and restrictions in navigating intelligent vehicles. In addition, any implementation of AAM needs to face the challenges posed by inherently dynamic and uncertain human-inhabited spaces robustly. Nevertheless, the employment of Unmanned Aircraft Systems (UAS) beyond visual line of sight (BVLOS) is an endearing task that promises to enhance significantly today's logistics and emergency response capabilities. To tackle these challenges, we present a probabilistic and neuro-symbolic architecture to encode legal frameworks and expert knowledge over uncertain spatial relations and noisy perception in an interpretable and adaptable fashion. More specifically, we demonstrate Probabilistic Mission Design (ProMis), a system architecture that links geospatial and sensory data with declarative, Hybrid Probabilistic Logic Programs (HPLP) to reason over the agent's state space and its legality. As a result, ProMis generates Probabilistic Mission Landscapes (PML), which quantify the agent's belief that a set of mission conditions is satisfied across its navigation space. Extending prior work on ProMis' reasoning capabilities and computational characteristics, we show its integration with potent machine learning models such as Large Language Models (LLM) and Transformer-based vision models. Hence, our experiments underpin the application of ProMis with multi-modal input data and how our method applies to many important AAM scenarios.

CLMay 2, 2025
A Factorized Probabilistic Model of the Semantics of Vague Temporal Adverbials Relative to Different Event Types

Svenja Kenneweg, Jörg Deigmöller, Julian Eggert et al.

Vague temporal adverbials, such as recently, just, and a long time ago, describe the temporal distance between a past event and the utterance time but leave the exact duration underspecified. In this paper, we introduce a factorized model that captures the semantics of these adverbials as probabilistic distributions. These distributions are composed with event-specific distributions to yield a contextualized meaning for an adverbial applied to a specific event. We fit the model's parameters using existing data capturing judgments of native speakers regarding the applicability of these vague temporal adverbials to events that took place a given time ago. Comparing our approach to a non-factorized model based on a single Gaussian distribution for each pair of event and temporal adverbial, we find that while both models have similar predictive power, our model is preferable in terms of Occam's razor, as it is simpler and has better extendability.

ROJul 21, 2025
The Constitutional Controller: Doubt-Calibrated Steering of Compliant Agents

Simon Kohaut, Felix Divo, Navid Hamid et al.

Ensuring reliable and rule-compliant behavior of autonomous agents in uncertain environments remains a fundamental challenge in modern robotics. Our work shows how neuro-symbolic systems, which integrate probabilistic, symbolic white-box reasoning models with deep learning methods, offer a powerful solution to this challenge. This enables the simultaneous consideration of explicit rules and neural models trained on noisy data, combining the strength of structured reasoning with flexible representations. To this end, we introduce the Constitutional Controller (CoCo), a novel framework designed to enhance the safety and reliability of agents by reasoning over deep probabilistic logic programs representing constraints such as those found in shared traffic spaces. Furthermore, we propose the concept of self-doubt, implemented as a probability density conditioned on doubt features such as travel velocity, employed sensors, or health factors. In a real-world aerial mobility study, we demonstrate CoCo's advantages for intelligent autonomous systems to learn appropriate doubts and navigate complex and uncertain environments safely and compliantly.

AIMay 10, 2023
A Glimpse in ChatGPT Capabilities and its impact for AI research

Frank Joublin, Antonello Ceravola, Joerg Deigmoeller et al.

Large language models (LLMs) have recently become a popular topic in the field of Artificial Intelligence (AI) research, with companies such as Google, Amazon, Facebook, Amazon, Tesla, and Apple (GAFA) investing heavily in their development. These models are trained on massive amounts of data and can be used for a wide range of tasks, including language translation, text generation, and question answering. However, the computational resources required to train and run these models are substantial, and the cost of hardware and electricity can be prohibitive for research labs that do not have the funding and resources of the GAFA. In this paper, we will examine the impact of LLMs on AI research. The pace at which such models are generated as well as the range of domains covered is an indication of the trend which not only the public but also the scientific community is currently experiencing. We give some examples on how to use such models in research by focusing on GPT3.5/ChatGPT3.4 and ChatGPT4 at the current state and show that such a range of capabilities in a single system is a strong sign of approaching general intelligence. Innovations integrating such models will also expand along the maturation of such AI systems and exhibit unforeseeable applications that will have important impacts on several aspects of our societies.