Anticipating Innovation Using Large Language Models
For innovation researchers and policymakers, it provides a method to forecast new technological combinations from patent text, though the approach is incremental over existing language models.
The paper shows that forthcoming technological combinations leave detectable signals in patent language decades in advance, and introduces TechToken, a transformer model that predicts first combinations by learning the language of patent codes, outperforming state-of-the-art models on patent tasks.
Forecasting innovation, intended as the emergence of new technological combinations, is a fundamental challenge for science and policy. We show that forthcoming combinations leave an early trace in the collective language of patents, with predictive signals detectable even decades in advance. We show that signal is not attributable to any single inventor, but emerges as a collective shift in how technologies are described across thousands of patents. To this end, we introduce TechToken, a transformer-based model that treats technologies, classified by International Patent Classification codes, as words in its vocabulary, learning the language of technologies by embedding these codes during fine-tuning. We define context similarity between code embeddings as a measure of linguistic convergence and show that it accurately predicts first technological combinations. TechToken also improves general representation quality, outperforming state-of-the-art models across different patent-related tasks.