CLNov 30, 2020

Procode: the Swiss Multilingual Solution for Automatic Coding and Recoding of Occupations and Economic Activities

Nenad Savic, Nicolas Bovio, Fabian Gilbert, Irina Guseva Canu

arXiv:2012.07521v12 citations

Originality Incremental advance

AI Analysis

This tool addresses the problem of time-consuming and error-prone manual coding of occupations and economic activities for epidemiological studies, offering a more efficient solution for researchers and public health professionals.

This paper introduces Procode, a web-tool designed for automatic coding of free-text descriptions into occupational and economic activity classifications and recoding between different classification systems. Using a Complement Naive Bayes (CNB) classifier, Procode achieved 57-81% accuracy for occupational codes and 63-83% for economic activity codes, processing 10,000 records in one minute for coding and 5-10 seconds for recoding.

Objective. Epidemiological studies require data that are in alignment with the classifications established for occupations or economic activities. The classifications usually include hundreds of codes and titles. Manual coding of raw data may result in misclassification and be time consuming. The goal was to develop and test a web-tool, named Procode, for coding of free-texts against classifications and recoding between different classifications. Methods. Three text classifiers, i.e. Complement Naive Bayes (CNB), Support Vector Machine (SVM) and Random Forest Classifier (RFC), were investigated using a k-fold cross-validation. 30 000 free-texts with manually assigned classification codes of French classification of occupations (PCS) and French classification of activities (NAF) were available. For recoding, Procode integrated a workflow that converts codes of one classification to another according to existing crosswalks. Since this is a straightforward operation, only the recoding time was measured. Results. Among the three investigated text classifiers, CNB resulted in the best performance, where the classifier predicted accurately 57-81% and 63-83% classification codes for PCS and NAF, respectively. SVM lead to somewhat lower results (by 1-2%), while RFC coded accurately up to 30% of the data. The coding operation required one minute per 10 000 records, while the recoding was faster, i.e. 5-10 seconds. Conclusion. The algorithm integrated in Procode showed satisfactory performance, since the tool had to assign the right code by choosing between 500-700 different choices. Based on the results, the authors decided to implement CNB in Procode. In future, if another classifier shows a superior performance, an update will include the required modifications.

View on arXiv PDF

Similar