Chien-Ping Lu

h-index1

5papers

15citations

Novelty44%

AI Score43

Ranked #81,448 of 201,326 authors (top 40%)#18,299 in LG (top 43%)

5 Papers

74.9CCApr 9

The Computational Boundary of Inference: Capability Internalization, Training, and the Turing Jump

Chien-Ping Lu

Claims about recursive self-improvement in AI often slide from repeated internal revision to the possibility of qualitatively stronger capability without clearly distinguishing the underlying computational regimes. This paper gives a formal separation result in classical computability theory that blocks that move under a precise modeling assumption. For an oracle $A$, let $\mathcal{C}(A)=\{B : B \leq_T A\}$ be the corresponding computational layer. We prove that finite internal self-modification remains inside $\mathcal{C}(A)$, while stabilized revision is governed instead by the jump $A'$ via the relativized limit lemma. Together with a local closure versus escape theorem, this yields a clean formal separation between within-layer iteration and ascent to a stronger relative level. The point is not that stronger layers never arise, but that they are not explained by finite repetition inside one already settled layer. The resulting separation gives a computability-theoretic limit on a broad class of recursive-improvement narratives in which repeated internal updating is treated as sufficient for qualitative capability ascent.

23.6LGMar 30

The Unreasonable Effectiveness of Scaling Laws in AI

Chien-Ping Lu

Classical AI scaling laws, especially for pre-training, describe how training loss decreases with compute in a power-law form. Their effectiveness has a basic and very practical sense: they make progress predictable, albeit at a declining rate. Yet their effectiveness is also unreasonable in two further senses. First, these laws are largely empirical and observational, but they appear repeatedly across model families and increasingly across training-adjacent regimes. Second, despite the diminishing returns they predict, progress in practice has often continued through rapidly improving efficiency, visible for example in falling cost per token. This paper argues that both features arise from the same source: scaling laws are unusually effective because they abstract away from many realization details. The compute variable is best understood as logical compute, an implementation-agnostic notion of model-side work, while the practical burden of scaling depends on how efficiently real resources are converted into that compute. This abstraction helps explain both why the laws travel so well across settings and why they give rise to a persistent efficiency game in hardware, algorithms, and systems. Once efficiency is made explicit, the main practical question becomes how many efficiency doublings are required to keep scaling productive despite diminishing returns. Under that view, diminishing returns are not only a geometric flattening of the loss curve, but also rising pressure for cost reduction, system-level innovation, and the breakthroughs needed to sustain Moore-like efficiency doublings.

4.7DCMar 21

Modernizing Amdahl's Law: How AI Scaling Laws Shape Computer Architecture

Chien-Ping Lu

Classical Amdahl's Law assumes a fixed decomposition between serial and parallel work and homogeneous replication; historically, it bounds how much parallel speedup is attainable. Modern systems instead combine specialized accelerators with programmable compute, tensor datapaths, and evolving pipelines, while empirical scaling laws shift which stages absorb marginal compute. The central tension is therefore not the serial-versus-parallel split alone, but resource allocation across heterogeneous hardware, given efficiency differences, and workload structures that determine how effectively additional compute can be converted into value. We reformulate Amdahl's Law for modern heterogeneous systems with scalable workloads. The analysis yields a finite collapse threshold: beyond a critical scalable fraction, specialization becomes suboptimal for any efficiency advantage of specialized hardware over programmable compute, and optimal specialized investment falls to zero, a phase transition rather than an asymptotic tail. We use this framework to interpret increasing GPU programmability and why domain-specific AI accelerators have not displaced GPUs.

LGJan 4, 2025

The Race to Efficiency: A New Perspective on AI Scaling Laws

Chien-Ping Lu

As large-scale AI models expand, training becomes costlier and sustaining progress grows harder. Classical scaling laws (e.g., Kaplan et al. (2020), Hoffmann et al. (2022)) predict training loss from a static compute budget yet neglect time and efficiency, prompting the question: how can we balance ballooning GPU fleets with rapidly improving hardware and algorithms? We introduce the relative-loss equation, a time- and efficiency-aware framework that extends classical AI scaling laws. Our model shows that, without ongoing efficiency gains, advanced performance could demand millennia of training or unrealistically large GPU fleets. However, near-exponential progress remains achievable if the "efficiency-doubling rate" parallels Moore's Law. By formalizing this race to efficiency, we offer a quantitative roadmap for balancing front-loaded GPU investments with incremental improvements across the AI stack. Empirical trends suggest that sustained efficiency gains can push AI scaling well into the coming decade, providing a new perspective on the diminishing returns inherent in classical scaling.

AIMay 17, 2017

AI, Native Supercomputing and The Revival of Moore's Law

Chien-Ping Lu

Based on Alan Turing's proposition on AI and computing machinery, which shaped Computing as we know it today, the new AI computing machinery should comprise a universal computer and a universal learning machine. The later should understand linear algebra natively to overcome the slowdown of Moore's law. In such a universal learnig machine, a computing unit does not need to keep the legacy of a universal computing core. The data can be distributed to the computing units, and the results can be collected from them through Collective Streaming, reminiscent of Collective Communication in Supercomputing. It is not necessary to use a GPU-like deep memory hierarchy, nor a TPU-like fine-grain mesh.