Burc Gokden

CL
h-index1
5papers
10citations
Novelty56%
AI Score39

5 Papers

AIMar 12
PLDR-LLMs Reason At Self-Organized Criticality

Burc Gokden

We show that PLDR-LLMs pretrained at self-organized criticality exhibit reasoning at inference time. The characteristics of PLDR-LLM deductive outputs at criticality is similar to second-order phase transitions. At criticality, the correlation length diverges, and the deductive outputs attain a metastable steady state. The steady state behaviour suggests that deductive outputs learn representations equivalent to scaling functions, universality classes and renormalization groups from the training dataset, leading to generalization and reasoning capabilities in the process. We can then define an order parameter from the global statistics of the model's deductive output parameters at inference. The reasoning capabilities of a PLDR-LLM is better when its order parameter is close to zero at criticality. This observation is supported by the benchmark scores of the models trained at near-criticality and sub-criticality. Our results provide a self-contained explanation on how reasoning manifests in large language models, and the ability to reason can be quantified solely from global model parameter values of the deductive outputs at steady state, without any need for evaluation of curated benchmark datasets through inductive output for reasoning and comprehension.

CLOct 22, 2024
PLDR-LLM: Large Language Model from Power Law Decoder Representations

Burc Gokden

We present the Large Language Model from Power Law Decoder Representations (PLDR-LLM), a language model that leverages non-linear and linear transformations through Power Law Graph Attention mechanism to generate well-defined deductive and inductive outputs. We pretrain the PLDR-LLMs of varying layer sizes with a small batch size of 32 and $\sim$8B tokens from the RefinedWeb dataset, and show that they achieve competitive performance in zero-shot and few-shot settings compared to scaled dot-product LLMs of similar model size reported in the literature. We show that deductive outputs of PLDR-LLMs can be used to compare model characteristics or improve the performance by introducing the Directed Acyclic Graph (DAG) loss as a metric and regularizer. Our results indicate that the initial maximum learning rate and warm-up steps have a lasting impact on deductive outputs throughout the pretraining. We provide a detailed description of PLDR-LLM architecture, its implementation and the pretraining procedure.

CLFeb 19, 2025
PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference

Burc Gokden

We show that Large Language Model from Power Law Decoder Representations (PLDR-LLM) is a foundational model whose deductive outputs are invariant tensors up to a small perturbation. PLDR-LLM learns a singularity condition for the deductive outputs that enable the once-inferred energy-curvature tensor $\mathbf{G}_{LM}$ to replace the deep neural network of power law graph attention (PLGA) generating the deductive outputs at inference. We demonstrate that a cache for $\mathbf{G}_{LM}$ (G-cache) and KV-cache can be implemented in a straightforward manner to improve the inference time. The invariance and generalizable nature of deductive outputs is at a very high fidelity where deductive outputs have same RMSE and determinant values up to 15 decimal places after caching, and zero-shot benchmark scores remain unchanged. Ablation studies show that learned deductive outputs have distinct loss and accuracy characteristics from models pretrained with transferred, randomly initialized or identity tensors as a constant tensor operator and an LLM with scaled-dot product attention (SDPA) is a special case of PLDR-LLM where $\mathbf{G}_{LM}$ is predefined as identity. The observed invariance characteristic introduces a novel asymmetry between training and inference phases with caching. We outline observed common characteristics of the deductive outputs for the learned singularity condition. We provide an implementation of a training and inference framework for PLDR-LLM with KV-cache and G-cache.

CLJun 27, 2021
Power Law Graph Transformer for Machine Translation and Representation Learning

Burc Gokden

We present the Power Law Graph Transformer, a transformer model with well defined deductive and inductive tasks for prediction and representation learning. The deductive task learns the dataset level (global) and instance level (local) graph structures in terms of learnable power law distribution parameters. The inductive task outputs the prediction probabilities using the deductive task output, similar to a transductive model. We trained our model with Turkish-English and Portuguese-English datasets from TED talk transcripts for machine translation and compared the model performance and characteristics to a transformer model with scaled dot product attention trained on the same experimental setup. We report BLEU scores of $17.79$ and $28.33$ on the Turkish-English and Portuguese-English translation tasks with our model, respectively. We also show how a duality between a quantization set and N-dimensional manifold representation can be leveraged to transform between local and global deductive-inductive outputs using successive application of linear and non-linear transformations end-to-end.

LGDec 18, 2019
CoulGAT: An Experiment on Interpretability of Graph Attention Networks

Burc Gokden

We present an attention mechanism inspired from definition of screened Coulomb potential. This attention mechanism was used to interpret the Graph Attention (GAT) model layers and training dataset by using a flexible and scalable framework (CoulGAT) developed for this purpose. Using CoulGAT, a forest of plain and resnet models were trained and characterized using this attention mechanism against CHAMPS dataset. The learnable variables of the attention mechanism are used to extract node-node and node-feature interactions to define an empirical standard model for the graph structure and hidden layer. This representation of graph and hidden layers can be used as a tool to compare different models, optimize hidden layers and extract a compact definition of graph structure of the dataset.