SE CLFeb 5

ArkTS-CodeSearch: A Open-Source ArkTS Dataset for Code Retrieval

Yulong He, Artem Ermakov, Sergey Kovalchuk, Artem Aliev, Dmitry Shalymov

arXiv:2602.05550v22.9h-index: 3Has Code

Originality Incremental advance

AI Analysis

This work provides a foundational dataset and benchmark for researchers and developers working on code intelligence for the ArkTS programming language, which is crucial for the OpenHarmony ecosystem.

This paper introduces ArkTS-CodeSearch, the first large-scale ArkTS dataset for code retrieval, addressing the lack of public datasets for ArkTS code intelligence. The authors extract comment-function pairs from open-source repositories and evaluate existing code embedding models, achieving a high-performing model for ArkTS code understanding through fine-tuning.

ArkTS is a core programming language in the OpenHarmony ecosystem, yet research on ArkTS code intelligence is hindered by the lack of public datasets and evaluation benchmarks. This paper presents a large-scale ArkTS dataset constructed from open-source repositories, targeting code retrieval and code evaluation tasks. We design a single-search task, where natural language comments are used to retrieve corresponding ArkTS functions. ArkTS repositories are crawled from GitHub and Gitee, and comment-function pairs are extracted using tree-sitter-arkts, followed by cross-platform deduplication and statistical analysis of ArkTS function types. We further evaluate existing open-source code embedding models on the single-search task and perform fine-tuning using both ArkTS and TypeScript training datasets, resulting in a high-performing model for ArkTS code understanding. This work establishes the first systematic benchmark for ArkTS code retrieval. Both the dataset and our fine-tuned model are available at https://huggingface.co/hreyulog/embedinggemma_arkts and https://huggingface.co/datasets/hreyulog/arkts-code-docstring .

View on arXiv PDF

Similar