Bayesian Inference of Regular Expressions from Human-Generated Example Strings
This addresses a domain-specific problem in programming by example for users who need to generate regexes from limited examples, though it is incremental as it builds on existing regex induction methods.
The paper tackles the problem of learning regular expressions from a small set of positive and negative example strings, which is challenging due to uninformative examples and a large search space. It proposes a Bayesian inference approach using a stochastic process recognition model that incrementally grows a grammar, achieving results competitive with human ability.
In programming by example, users "write" programs by generating a small number of input-output examples and asking the computer to synthesize consistent programs. We consider a challenging problem in this domain: learning regular expressions (regexes) from positive and negative example strings. This problem is challenging, as (1) user-generated examples may not be informative enough to sufficiently constrain the hypothesis space, and (2) even if user-generated examples are in principle informative, there is still a massive search space to examine. We frame regex induction as the problem of inferring a probabilistic regular grammar and propose an efficient inference approach that uses a novel stochastic process recognition model. This model incrementally "grows" a grammar using positive examples as a scaffold. We show that this approach is competitive with human ability to learn regexes from examples.