CLJul 29, 2024
APE: Active Learning-based Tooling for Finding Informative Few-shot Examples for LLM-based Entity MatchingKun Qian, Yisi Sang, Farima Fatahi Bayat et al.
Prompt engineering is an iterative procedure often requiring extensive manual effort to formulate suitable instructions for effectively directing large language models (LLMs) in specific tasks. Incorporating few-shot examples is a vital and effective approach to providing LLMs with precise instructions, leading to improved LLM performance. Nonetheless, identifying the most informative demonstrations for LLMs is labor-intensive, frequently entailing sifting through an extensive search space. In this demonstration, we showcase a human-in-the-loop tool called APE (Active Prompt Engineering) designed for refining prompts through active learning. Drawing inspiration from active learning, APE iteratively selects the most ambiguous examples for human feedback, which will be transformed into few-shot examples within the prompt. The demo recording can be found with the submission or be viewed at https://youtu.be/OwQ6MQx53-Y.
CLOct 30, 2023
Open Domain Knowledge Extraction for Knowledge GraphsKun Qian, Anton Belyi, Fei Wu et al.
The quality of a knowledge graph directly impacts the quality of downstream applications (e.g. the number of answerable questions using the graph). One ongoing challenge when building a knowledge graph is to ensure completeness and freshness of the graph's entities and facts. In this paper, we introduce ODKE, a scalable and extensible framework that sources high-quality entities and facts from open web at scale. ODKE utilizes a wide range of extraction models and supports both streaming and batch processing at different latency. We reflect on the challenges and design decisions made and share lessons learned when building and deploying ODKE to grow an industry-scale open domain knowledge graph.
IRJan 30, 2019
Learning Fast Matching Models from Weak AnnotationsXue Li, Zhipeng Luo, Hao Sun et al.
This paper proposes a novel training scheme for fast matching models in Search Ads, which is motivated by the real challenges in model training. The first challenge stems from the pursuit of high throughput, which prohibits the deployment of inseparable architectures, and hence greatly limits the model accuracy. The second problem arises from the heavy dependency on human provided labels, which are expensive and time-consuming to collect, yet how to leverage unlabeled search log data is rarely studied. The proposed training framework targets on mitigating both issues, by treating the stronger but undeployable models as annotators, and learning a deployable model from both human provided relevance labels and weakly annotated search log data. Specifically, we first construct multiple auxiliary tasks from the enumerated relevance labels, and train the annotators by jointly learning from those related tasks. The annotation models are then used to assign scores to both labeled and unlabeled training samples. The deployable model is firstly learnt on the scored unlabeled data, and then fine-tuned on scored labeled data, by leveraging both labels and scores via minimizing the proposed label-aware weighted loss. According to our experiments, compared with the baseline that directly learns from relevance labels, training by the proposed framework outperforms it by a large margin, and improves data efficiency substantially by dispensing with 80% labeled samples. The proposed framework allows us to improve the fast matching model by learning from stronger annotators while keeping its architecture unchanged. Meanwhile, our training framework offers a principled manner to leverage search log data in the training phase, which could effectively alleviate our dependency on human provided labels.