CoSQA: 20,000+ Web Queries for Code Search and Question Answering
This addresses the need for richer supervised training resources in code search and question answering, primarily benefiting software developers, and is incremental as it builds upon existing datasets and methods.
The paper tackles the problem of semantic matching between natural language queries and code for software developers by introducing the CoSQA dataset with over 20,000 annotated query-code pairs and a contrastive learning method called CoCLR, resulting in improvements of 5.1% and 10.5% in code question answering accuracy on CodeXGLUE.
Finding codes given natural language query isb eneficial to the productivity of software developers. Future progress towards better semantic matching between query and code requires richer supervised training resources. To remedy this, we introduce the CoSQA dataset.It includes 20,604 labels for pairs of natural language queries and codes, each annotated by at least 3 human annotators. We further introduce a contrastive learning method dubbed CoCLR to enhance query-code matching, which works as a data augmenter to bring more artificially generated training instances. We show that evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1%, and incorporating CoCLR brings a further improvement of 10.5%.