Yigit Ihlamur

AI
h-index5
12papers
38citations
Novelty45%
AI Score50

12 Papers

LGFeb 28
From Stochastic Answers to Verifiable Reasoning: Interpretable Decision-Making with LLM-Generated Code

Anirudh Jaidev Mahesh, Ben Griffin, Fuat Alican et al.

Large language models (LLMs) are increasingly used for high-stakes decision-making, yet existing approaches struggle to reconcile scalability, interpretability, and reproducibility. Black-box models obscure their reasoning, while recent LLM-based rule systems rely on per-sample evaluation, causing costs to scale with dataset size and introducing stochastic, hallucination-prone outputs. We propose reframing LLMs as code generators rather than per-instance evaluators. A single LLM call generates executable, human-readable decision logic that runs deterministically over structured data, eliminating per-sample LLM queries while enabling reproducible and auditable predictions. We combine code generation with automated statistical validation using precision lift, binomial significance testing, and coverage filtering, and apply cluster-based gap analysis to iteratively refine decision logic without human annotation. We instantiate this framework in venture capital founder screening, a rare-event prediction task with strong interpretability requirements. On VCBench, a benchmark of 4,500 founders with a 9% base success rate, our approach achieves 37.5% precision and an F0.5 score of 25.0%, outperforming GPT-4o (at 30.0% precision and an F0.5 score of 25.7%) while maintaining full interpretability. Each prediction traces to executable rules over human-readable attributes, demonstrating verifiable and interpretable LLM-based decision-making in practice.

1.3AIApr 29
Optimal Stop-Loss and Take-Profit Parameterization for Autonomous Trading Agent Swarm

Nathan Li, Aikins Laryea, Yigit Ihlamur

Autonomous crypto trading systems often spend most of their design effort on finding entries, while exits are left to fixed rules that are rarely tested in a systematic way. This paper examines whether better stop-loss and take-profit settings can improve the performance of an autonomous trading agent swarm. Using more than 900 historical trades, we replay each trade under many alternative exit policies and compare results against the existing production setup. The study finds that exit design matters meaningfully: stronger configurations improve risk-adjusted performance and generally favor tighter loss limits, earlier profit capture, and closer trailing protection. The paper also discusses a key evaluation challenge: a purely chronological split was initially used, but the newest trades fell into an unusual war-driven market period that sharply distorted test results. To reduce the influence of that single episode, the main comparison was run on randomized data, with the drawbacks of doing so acknowledged explicitly. Overall, the paper presents a practical framework for tuning exit logic in a more disciplined and transparent way.

LGNov 13, 2024
GPTree: Towards Explainable Decision-Making via LLM-powered Decision Trees

Sichao Xiong, Yigit Ihlamur, Fuat Alican et al.

Traditional decision tree algorithms are explainable but struggle with non-linear, high-dimensional data, limiting its applicability in complex decision-making. Neural networks excel at capturing complex patterns but sacrifice explainability in the process. In this work, we present GPTree, a novel framework combining explainability of decision trees with the advanced reasoning capabilities of LLMs. GPTree eliminates the need for feature engineering and prompt chaining, requiring only a task-specific prompt and leveraging a tree-based structure to dynamically split samples. We also introduce an expert-in-the-loop feedback mechanism to further enhance performance by enabling human intervention to refine and rebuild decision paths, emphasizing the harmony between human expertise and machine intelligence. Our decision tree achieved a 7.8% precision rate for identifying "unicorn" startups at the inception stage of a startup, surpassing gpt-4o with few-shot learning as well as the best human decision-makers (3.1% to 5.6%).

28.4AIApr 23
CoFEE: Reasoning Control for LLM-Based Feature Discovery

Maximilian Westermann, Ben Griffin, Aaron Ontoyin Yin et al.

Feature discovery from complex unstructured data is fundamentally a reasoning problem: it requires identifying abstractions that are predictive of a target outcome while avoiding leakage, proxies, and post-outcome signals. With the introduction of ever-improving Large Language Models (LLMs), our method provides a structured method for addressing this challenge. LLMs are well suited for this task by being able to process large amounts of information, but unconstrained feature generation can lead to weak features. In this work, we study reasoning control in LLMs by inducing cognitive behaviors for improving feature discovery. We introduce CoFEE (Cognitive Feature Engineering Engine), a reasoning control framework that enforces cognitive behaviors in how the LLM reasons during feature discovery. From a machine learning perspective, these cognitive behaviors act as structured inductive biases over the space of candidate features generated by the model. These behaviors have been exploited with success in ML models, and include backward chaining from outcomes, subgoal decomposition, verification against observability and leakage criteria, and explicit backtracking of rejected reasoning paths. In a controlled comparison, we show that enforcing cognitive behaviors yields features with higher empirical predictability than those under unconstrained vanilla LLM prompts. CoFEE achieves an average Success Rate Score that is 15.2% higher than the vanilla approach, while generating 29% fewer features and reducing costs by 53.3%. Using held-out feature evaluation, we assess whether cognitively induced features generalize beyond the data used for discovery. Our results indicate that, in our evaluated setting, reasoning control is associated with improvements in quality and efficiency of LLM-based feature discovery.

CLDec 19, 2023
Founder-GPT: Self-play to evaluate the Founder-Idea fit

Sichao Xiong, Yigit Ihlamur

This research introduces an innovative evaluation method for the "founder-idea" fit in early-stage startups, utilizing advanced large language model techniques to assess founders' profiles against their startup ideas to enhance decision-making. Embeddings, self-play, tree-of-thought, and critique-based refinement techniques show early promising results that each idea's success patterns are unique and they should be evaluated based on the context of the founder's background.

AIMay 27, 2025
Policy Induction: Predicting Startup Success via Explainable Memory-Augmented In-Context Learning

Xianling Mu, Joseph Ternasky, Fuat Alican et al.

Early-stage startup investment is a high-risk endeavor characterized by scarce data and uncertain outcomes. Traditional machine learning approaches often require large, labeled datasets and extensive fine-tuning, yet remain opaque and difficult for domain experts to interpret or improve. In this paper, we propose a transparent and data-efficient investment decision framework powered by memory-augmented large language models (LLMs) using in-context learning (ICL). Central to our method is a natural language policy embedded directly into the LLM prompt, enabling the model to apply explicit reasoning patterns and allowing human experts to easily interpret, audit, and iteratively refine the logic. We introduce a lightweight training process that combines few-shot learning with an in-context learning loop, enabling the LLM to update its decision policy iteratively based on structured feedback. With only minimal supervision and no gradient-based optimization, our system predicts startup success far more accurately than existing benchmarks. It is over 20x more precise than random chance, which succeeds 1.9% of the time. It is also 7.1x more precise than the typical 5.6% success rate of top-tier venture capital (VC) firms.

AIMay 30, 2025
Random Rule Forest (RRF): Interpretable Ensembles of LLM-Generated Questions for Predicting Startup Success

Ben Griffin, Diego Vidaurre, Ugur Koyluoglu et al.

Predicting rare outcomes such as startup success is central to venture capital, demanding models that are both accurate and interpretable. We introduce Random Rule Forest (RRF), a lightweight ensemble method that uses a large language model (LLM) to generate simple YES/NO questions in natural language. Each question functions as a weak learner, and their responses are combined using a threshold-based voting rule to form a strong, interpretable predictor. Applied to a dataset of 9,892 founders, RRF achieves a 6.9x improvement over a random baseline on held-out data; adding expert-crafted questions lifts this to 8x and highlights the value of human-LLM collaboration. Compared with zero- and few-shot baselines across three LLM architectures, RRF attains an F0.5 of 0.121, versus 0.086 for the best baseline (+0.035 absolute, +41% relative). By combining the creativity of LLMs with the rigor of ensemble learning, RRF delivers interpretable, high-precision predictions suitable for decision-making in high-stakes domains.

AIApr 16, 2025
Reasoning-Based AI for Startup Evaluation (R.A.I.S.E.): A Memory-Augmented, Multi-Step Decision Framework

Jack Preuveneers, Joseph Ternasky, Fuat Alican et al.

We present a novel framework that bridges the gap between the interpretability of decision trees and the advanced reasoning capabilities of large language models (LLMs) to predict startup success. Our approach leverages chain-of-thought prompting to generate detailed reasoning logs, which are subsequently distilled into structured, human-understandable logical rules. The pipeline integrates multiple enhancements - efficient data ingestion, a two-step refinement process, ensemble candidate sampling, simulated reinforcement learning scoring, and persistent memory - to ensure both stable decision-making and transparent output. Experimental evaluations on curated startup datasets demonstrate that our combined pipeline improves precision by 54% from 0.225 to 0.346 and accuracy by 50% from 0.46 to 0.70 compared to a standalone OpenAI o3 model. Notably, our model achieves over 2x the precision of a random classifier (16%). By combining state-of-the-art AI reasoning with explicit rule-based explanations, our method not only augments traditional decision-making processes but also facilitates expert intervention and continuous policy refinement. This work lays the foundation for the implementation of interpretable LLM-powered decision frameworks in high-stakes investment environments and other domains that require transparent and data-driven insights.

LGJan 23, 2025
GPT-HTree: A Decision Tree Framework Integrating Hierarchical Clustering and Large Language Models for Explainable Classification

Te Pei, Fuat Alican, Aaron Ontoyin Yin et al.

This paper introduces GPT-HTree, a framework combining hierarchical clustering, decision trees, and large language models (LLMs) to address this challenge. By leveraging hierarchical clustering to segment individuals based on salient features, resampling techniques to balance class distributions, and decision trees to tailor classification paths within each cluster, GPT-HTree ensures both accuracy and interpretability. LLMs enhance the framework by generating human-readable cluster descriptions, bridging quantitative analysis with actionable insights.

AIOct 24, 2025
LLM-AR: LLM-powered Automated Reasoning Framework

Rick Chen, Joseph Ternasky, Aaron Ontoyin Yin et al.

Large language models (LLMs) can already identify patterns and reason effectively, yet their variable accuracy hampers adoption in high-stakes decision-making applications. In this paper, we study this issue from a venture capital perspective by predicting idea-stage startup success based on founder traits. (i) To build a reliable prediction model, we introduce LLM-AR, a pipeline inspired by neural-symbolic systems that distils LLM-generated heuristics into probabilistic rules executed by the ProbLog automated-reasoning engine. (ii) An iterative policy-evolution loop incorporates association-rule mining to progressively refine the prediction rules. On unseen folds, LLM-AR achieves 59.5% precision and 8.7% recall, 5.9x the random baseline precision, while exposing every decision path for human inspection. The framework is interpretable and tunable via hyperparameters, showing promise to extend into other domains.

AISep 17, 2025
VCBench: Benchmarking LLMs in Venture Capital

Rick Chen, Joseph Ternasky, Afriyie Samuel Kwesi et al.

Benchmarks such as SWE-bench and ARC-AGI demonstrate how shared datasets accelerate progress toward artificial general intelligence (AGI). We introduce VCBench, the first benchmark for predicting founder success in venture capital (VC), a domain where signals are sparse, outcomes are uncertain, and even top investors perform modestly. At inception, the market index achieves a precision of 1.9%. Y Combinator outperforms the index by a factor of 1.7x, while tier-1 firms are 2.9x better. VCBench provides 9,000 anonymized founder profiles, standardized to preserve predictive features while resisting identity leakage, with adversarial tests showing more than 90% reduction in re-identification risk. We evaluate nine state-of-the-art large language models (LLMs). DeepSeek-V3 delivers over six times the baseline precision, GPT-4o achieves the highest F0.5, and most models surpass human benchmarks. Designed as a public and evolving resource available at vcbench.com, VCBench establishes a community-driven standard for reproducible and privacy-preserving evaluation of AGI in early-stage venture forecasting.

LGSep 9, 2025
From Limited Data to Rare-event Prediction: LLM-powered Feature Engineering and Multi-model Learning in Venture Capital

Mihir Kumar, Aaron Ontoyin Yin, Zakari Salifu et al.

This paper presents a framework for predicting rare, high-impact outcomes by integrating large language models (LLMs) with a multi-model machine learning (ML) architecture. The approach combines the predictive strength of black-box models with the interpretability required for reliable decision-making. We use LLM-powered feature engineering to extract and synthesize complex signals from unstructured data, which are then processed within a layered ensemble of models including XGBoost, Random Forest, and Linear Regression. The ensemble first produces a continuous estimate of success likelihood, which is then thresholded to produce a binary rare-event prediction. We apply this framework to the domain of Venture Capital (VC), where investors must evaluate startups with limited and noisy early-stage data. The empirical results show strong performance: the model achieves precision between 9.8X and 11.1X the random classifier baseline in three independent test subsets. Feature sensitivity analysis further reveals interpretable success drivers: the startup's category list accounts for 15.6% of predictive influence, followed by the number of founders, while education level and domain expertise contribute smaller yet consistent effects.