Chao-Lin Liu

CL
h-index1
24papers
263citations
Novelty20%
AI Score24

24 Papers

AIApr 30, 2022
Using Wordle for Learning to Design and Compare Strategies

Chao-Lin Liu

Wordle is a very popular word game that is owned by the New York Times. We can design parameterized strategies for solving Wordle, based on probabilistic, statistical, and information-theoretical information about the games. The strategies can handle a reasonably large family of Wordle-like games both systematically and dynamically, meaning that we do not rely on precomputations that may work for non-fixed games. More specifically, the answer set can be arbitrary, not confining to the current 2315 words. The answer words may include any specific number of letters (does not have to be five), and the set of symbols that form the words does not have to be limited to only the English alphabet. Exploring possible strategies for solving the Wordle-like games offers an attractive learning challenges for students who are learning to design computer games. This paper will provide the results of using two families of parameterized strategies to solve the current Wordle, using the simulator that abides by the hard-mode rules as the baseline. The baseline simulator used an average of 4.078 guesses to find the 2315 answers, and needed more than six trials to solve the game 1.77% of the time. The best performing strategy of ours used an average of 3.674 guesses to find the 2315 answers, and failed 0.65% of the time.

CVDec 10, 2020Code
HRCenterNet: An Anchorless Approach to Chinese Character Segmentation in Historical Documents

Chia-Wei Tang, Chao-Lin Liu, Po-Sen Chiu

The information provided by historical documents has always been indispensable in the transmission of human civilization, but it has also made these books susceptible to damage due to various factors. Thanks to recent technology, the automatic digitization of these documents are one of the quickest and most effective means of preservation. The main steps of automatic text digitization can be divided into two stages, mainly: character segmentation and character recognition, where the recognition results depend largely on the accuracy of segmentation. Therefore, in this study, we will only focus on the character segmentation of historical Chinese documents. In this research, we propose a model named HRCenterNet, which is combined with an anchorless object detection method and parallelized architecture. The MTHv2 dataset consists of over 3000 Chinese historical document images and over 1 million individual Chinese characters; with these enormous data, the segmentation capability of our model achieves IoU 0.81 on average with the best speed-accuracy trade-off compared to the others. Our source code is available at https://github.com/Tverous/HRCenterNet.

CLSep 14, 2024
An empirical evaluation of using ChatGPT to summarize disputes for recommending similar labor and employment cases in Chinese

Po-Hsien Wu, Chao-Lin Liu, Wei-Jie Li

We present a hybrid mechanism for recommending similar cases of labor and employment litigations. The classifier determines the similarity based on the itemized disputes of the two cases, that the courts prepared. We cluster the disputes, compute the cosine similarity between the disputes, and use the results as the features for the classification tasks. Experimental results indicate that this hybrid approach outperformed our previous system, which considered only the information about the clusters of the disputes. We replaced the disputes that were prepared by the courts with the itemized disputes that were generated by GPT-3.5 and GPT-4, and repeated the same experiments. Using the disputes generated by GPT-4 led to better results. Although our classifier did not perform as well when using the disputes that the ChatGPT generated, the results were satisfactory. Hence, we hope that the future large-language models will become practically useful.

CLApr 29, 2025
Labeling Case Similarity based on Co-Citation of Legal Articles in Judgment Documents with Empirical Dispute-Based Evaluation

Chao-Lin Liu, Po-Hsien Wu, Yi-Ting Yu

This report addresses the challenge of limited labeled datasets for developing legal recommender systems, particularly in specialized domains like labor disputes. We propose a new approach leveraging the co-citation of legal articles within cases to establish similarity and enable algorithmic annotation. This method draws a parallel to the concept of case co-citation, utilizing cited precedents as indicators of shared legal issues. To evaluate the labeled results, we employ a system that recommends similar cases based on plaintiffs' accusations, defendants' rebuttals, and points of disputes. The evaluation demonstrates that the recommender, with finetuned text embedding models and a reasonable BiLSTM module can recommend labor cases whose similarity was measured by the co-citation of the legal articles. This research contributes to the development of automated annotation techniques for legal documents, particularly in areas with limited access to comprehensive legal databases.

CLOct 11, 2024
Similar Phrases for Cause of Actions of Civil Cases

Ho-Chien Huang, Chao-Lin Liu

In the Taiwanese judicial system, Cause of Actions (COAs) are essential for identifying relevant legal judgments. However, the lack of standardized COA labeling creates challenges in filtering cases using basic methods. This research addresses this issue by leveraging embedding and clustering techniques to analyze the similarity between COAs based on cited legal articles. The study implements various similarity measures, including Dice coefficient and Pearson's correlation coefficient. An ensemble model combines rankings, and social network analysis identifies clusters of related COAs. This approach enhances legal analysis by revealing inconspicuous connections between COAs, offering potential applications in legal research beyond civil law.

CLJul 22, 2020
When Classical Chinese Meets Machine Learning: Explaining the Relative Performances of Word and Sentence Segmentation Tasks

Chao-Lin Liu, Chang-Ting Chu, Wei-Ting Chang et al.

We consider three major text sources about the Tang Dynasty of China in our experiments that aim to segment text written in classical Chinese. These corpora include a collection of Tang Tomb Biographies, the New Tang Book, and the Old Tang Book. We show that it is possible to achieve satisfactory segmentation results with the deep learning approach. More interestingly, we found that some of the relative superiority that we observed among different designs of experiments may be explainable. The relative relevance among the training corpora provides hints/explanation for the observed differences in segmentation results that were achieved when we employed different combinations of corpora to train the classifiers.

CLAug 28, 2019
Onto Word Segmentation of the Complete Tang Poems

Chao-Lin Liu

We aim at segmenting words in the Complete Tang Poems (CTP). Although it is possible to do some research about CTP without doing full-scale word segmentation, we must move forward to word-level analysis of CTP for conducting advanced research topics. In November 2018 when we submitted the manuscript for DH 2019 (ADHO), we collected only 2433 poems that were segmented by trained experts, and used the segmented poems to evaluate the segmenter that considered domain knowledge of Chinese poetry. We trained pointwise mutual information (PMI) between Chinese characters based on the CTP poems (excluding the 2433 poems, which were used exclusively only for testing) and the domain knowledge. The segmenter relied on the PMI information to the recover 85.7% of words in the test poems. We could segment a poem completely correct only 17.8% of the time, however. When we presented our work at DH 2019, we have annotated more than 20000 poems. With a much larger amount of data, we were able to apply biLSTM models for this word segmentation task, and we segmented a poem completely correct above 20% of the time. In contrast, human annotators completely agreed on their annotations about 40% of the time.

CLAug 28, 2019
Classical Chinese Sentence Segmentation for Tomb Biographies of Tang Dynasty

Chao-Lin Liu, Yi Chang

Tomb biographies of the Tang dynasty provide invaluable information about Chinese history. The original biographies are classical Chinese texts which contain neither word boundaries nor sentence boundaries. Relying on three published books of tomb biographies of the Tang dynasty, we investigated the effectiveness of employing machine-learning methods for algorithmically identifying the pauses and terminals of sentences in the biographies. We consider the segmentation task as a classification problem. Chinese characters that are and are not followed by a punctuation mark are classified into two categories. We applied a machine-learning-based mechanism, the conditional random fields (CRF), to classify the characters (and words) in the texts, and we studied the contributions of selected types of lexical information to the resulting quality of the segmentation recommendations. This proposal presented at the DH 2018 conference discussed some of the basic experiments and their evaluations. By considering the contextual information and employing the heuristics provided by experts of Chinese literature, we achieved F1 measures that were better than 80%. More complex experiments that employ deep neural networks helped us further improve the results in recent work.

CLSep 18, 2017
Flexible Computing Services for Comparisons and Analyses of Classical Chinese Poetry

Chao-Lin Liu

We collect nine corpora of representative Chinese poetry for the time span of 1046 BCE and 1644 CE for studying the history of Chinese words, collocations, and patterns. By flexibly integrating our own tools, we are able to provide new perspectives for approaching our goals. We illustrate the ideas with two examples. The first example show a new way to compare word preferences of poets, and the second example demonstrates how we can utilize our corpora in historical studies of the Chinese words. We show the viability of the tools for academic research, and we wish to make it helpful for enriching existing Chinese dictionary as well.

CLSep 17, 2017
Character Distributions of Classical Chinese Literary Texts: Zipf's Law, Genres, and Epochs

Chao-Lin Liu, Shuhua Zhang, Yuanli Geng et al.

We collect 14 representative corpora for major periods in Chinese history in this study. These corpora include poetic works produced in several dynasties, novels of the Ming and Qing dynasties, and essays and news reports written in modern Chinese. The time span of these corpora ranges between 1046 BCE and 2007 CE. We analyze their character and word distributions from the viewpoint of the Zipf's law, and look for factors that affect the deviations and similarities between their Zipfian curves. Genres and epochs demonstrated their influences in our analyses. Specifically, the character distributions for poetic works of between 618 CE and 1644 CE exhibit striking similarity. In addition, although texts of the same dynasty may tend to use the same set of characters, their character distributions still deviate from each other.

DLSep 9, 2017
Matrix and Graph Operations for Relationship Inference: An Illustration with the Kinship Inference in the China Biographical Database

Chao-Lin Liu, Hongsu Wang

Biographical databases contain diverse information about individuals. Person names, birth information, career, friends, family and special achievements are some possible items in the record for an individual. The relationships between individuals, such as kinship and friendship, provide invaluable insights about hidden communities which are not directly recorded in databases. We show that some simple matrix and graph-based operations are effective for inferring relationships among individuals, and illustrate the main ideas with the China Biographical Database (CBDB).

CLNov 19, 2016
Tracking Words in Chinese Poetry of Tang and Song Dynasties with the China Biographical Database

Chao-Lin Liu, Kuo-Feng Luo

Large-scale comparisons between the poetry of Tang and Song dynasties shed light on how words, collocations, and expressions were used and shared among the poets. That some words were used only in the Tang poetry and some only in the Song poetry could lead to interesting research in linguistics. That the most frequent colors are different in the Tang and Song poetry provides a trace of the changing social circumstances in the dynasties. Results of the current work link to research topics of lexicography, semantics, and social transitions. We discuss our findings and present our algorithms for efficient comparisons among the poems, which are crucial for completing billion times of comparisons within acceptable time.

CLAug 28, 2016
Quantitative Analyses of Chinese Poetry of Tang and Song Dynasties: Using Changing Colors and Innovative Terms as Examples

Chao-Lin Liu

Tang (618-907 AD) and Song (960-1279) dynasties are two very important periods in the development of Chinese literary. The most influential forms of the poetry in Tang and Song were Shi and Ci, respectively. Tang Shi and Song Ci established crucial foundations of the Chinese literature, and their influences in both literary works and daily lives of the Chinese communities last until today. We can analyze and compare the Complete Tang Shi and the Complete Song Ci from various viewpoints. In this presentation, we report our findings about the differences in their vocabularies. Interesting new words that started to appear in Song Ci and continue to be used in modern Chinese were identified. Colors are an important ingredient of the imagery in poetry, and we discuss the most frequent color words that appeared in Tang Shi and Song Ci.

CLNov 5, 2015
Color Aesthetics and Social Networks in Complete Tang Poems: Explorations and Discoveries

Chao-Lin Liu, Hongsu Wang, Wen-Huei Cheng et al.

The Complete Tang Poems (CTP) is the most important source to study Tang poems. We look into CTP with computational tools from specific linguistic perspectives, including distributional semantics and collocational analysis. From such quantitative viewpoints, we compare the usage of "wind" and "moon" in the poems of Li Bai and Du Fu. Colors in poems function like sounds in movies, and play a crucial role in the imageries of poems. Thus, words for colors are studied, and "white" is the main focus because it is the most frequent color in CTP. We also explore some cases of using colored words in antithesis pairs that were central for fostering the imageries of the poems. CTP also contains useful historical information, and we extract person names in CTP to study the social networks of the Tang poets. Such information can then be integrated with the China Biographical Database of Harvard University.

CLNov 4, 2015
Mining Local Gazetteers of Literary Chinese with CRF and Pattern based Methods for Biographical Information in Chinese History

Chao-Lin Liu, Chih-Kai Huang, Hongsu Wang et al.

Person names and location names are essential building blocks for identifying events and social networks in historical documents that were written in literary Chinese. We take the lead to explore the research on algorithmically recognizing named entities in literary Chinese for historical studies with language-model based and conditional-random-field based methods, and extend our work to mining the document structures in historical documents. Practical evaluations were conducted with texts that were extracted from more than 220 volumes of local gazetteers (Difangzhi). Difangzhi is a huge and the single most important collection that contains information about officers who served in local government in Chinese history. Our methods performed very well on these realistic tests. Thousands of names and addresses were identified from the texts. A good portion of the extracted names match the biographical information currently recorded in the China Biographical Database (CBDB) of Harvard University, and many others can be verified by historians and will become as new additions to CBDB.

CLOct 11, 2015
Textual Analysis for Studying Chinese Historical Documents and Literary Novels

Chao-Lin Liu, Guan-Tao Jin, Hongsu Wang et al.

We analyzed historical and literary documents in Chinese to gain insights into research issues, and overview our studies which utilized four different sources of text materials in this paper. We investigated the history of concepts and transliterated words in China with the Database for the Study of Modern China Thought and Literature, which contains historical documents about China between 1830 and 1930. We also attempted to disambiguate names that were shared by multiple government officers who served between 618 and 1912 and were recorded in Chinese local gazetteers. To showcase the potentials and challenges of computer-assisted analysis of Chinese literatures, we explored some interesting yet non-trivial questions about two of the Four Great Classical Novels of China: (1) Which monsters attempted to consume the Buddhist monk Xuanzang in the Journey to the West (JTTW), which was published in the 16th century, (2) Which was the most powerful monster in JTTW, and (3) Which major role smiled the most in the Dream of the Red Chamber, which was published in the 18th century. Similar approaches can be applied to the analysis and study of modern documents, such as the newspaper articles published about the 228 incident that occurred in 1947 in Taiwan.

CLApr 8, 2015
Exploring Lexical, Syntactic, and Semantic Features for Chinese Textual Entailment in NTCIR RITE Evaluation Tasks

Wei-Jie Huang, Chao-Lin Liu

We computed linguistic information at the lexical, syntactic, and semantic levels for Recognizing Inference in Text (RITE) tasks for both traditional and simplified Chinese in NTCIR-9 and NTCIR-10. Techniques for syntactic parsing, named-entity recognition, and near synonym recognition were employed, and features like counts of common words, statement lengths, negation words, and antonyms were considered to judge the entailment relationships of two statements, while we explored both heuristics-based functions and machine-learning approaches. The reported systems showed robustness by simultaneously achieving second positions in the binary-classification subtasks for both simplified and traditional Chinese in NTCIR-10 RITE-2. We conducted more experiments with the test data of NTCIR-9 RITE, with good results. We also extended our work to search for better configurations of our classifiers and investigated contributions of individual features. This extended work showed interesting results and should encourage further discussion.

CLApr 8, 2015
Mining and discovering biographical information in Difangzhi with a language-model-based approach

Peter K. Bol, Chao-Lin Liu, Hongsu Wang

We present results of expanding the contents of the China Biographical Database by text mining historical local gazetteers, difangzhi. The goal of the database is to see how people are connected together, through kinship, social connections, and the places and offices in which they served. The gazetteers are the single most important collection of names and offices covering the Song through Qing periods. Although we begin with local officials we shall eventually include lists of local examination candidates, people from the locality who served in government, and notable local figures with biographies. The more data we collect the more connections emerge. The value of doing systematic text mining work is that we can identify relevant connections that are either directly informative or can become useful without deep historical research. Academia Sinica is developing a name database for officials in the central governments of the Ming and Qing dynasties.

AIFeb 27, 2013
State-space Abstraction for Anytime Evaluation of Probabilistic Networks

Michael P. Wellman, Chao-Lin Liu

One important factor determining the computational complexity of evaluating a probabilistic network is the cardinality of the state spaces of the nodes. By varying the granularity of the state spaces, one can trade off accuracy in the result for computational efficiency. We present an anytime procedure for approximate evaluation of probabilistic networks based on this idea. On application to some simple networks, the procedure exhibits a smooth improvement in approximation quality as computation time increases. This suggests that state-space abstraction is one more useful control parameter for designing real-time probabilistic reasoners.

AIJan 30, 2013
Using Qualitative Relationships for Bounding Probability Distributions

Chao-Lin Liu, Michael P. Wellman

We exploit qualitative probabilistic relationships among variables for computing bounds of conditional probability distributions of interest in Bayesian networks. Using the signs of qualitative relationships, we can implement abstraction operations that are guaranteed to bound the distributions of interest in the desired direction. By evaluating incrementally improved approximate networks, our algorithm obtains monotonically tightening bounds that converge to exact distributions. For supermodular utility functions, the tightening bounds monotonically reduce the set of admissible decision alternatives as well.

AIJan 30, 2013
Incremental Tradeoff Resolution in Qualitative Probabilistic Networks

Chao-Lin Liu, Michael P. Wellman

Qualitative probabilistic reasoning in a Bayesian network often reveals tradeoffs: relationships that are ambiguous due to competing qualitative influences. We present two techniques that combine qualitative and numeric probabilistic reasoning to resolve such tradeoffs, inferring the qualitative relationship between nodes in a Bayesian network. The first approach incrementally marginalizes nodes that contribute to the ambiguous qualitative relationships. The second approach evaluates approximate Bayesian networks for bounds of probability distributions, and uses these bounds to determinate qualitative relationships in question. This approach is also incremental in that the algorithm refines the state spaces of random variables for tighter bounds until the qualitative relationships are resolved. Both approaches provide systematic methods for tradeoff resolution at potentially lower computational cost than application of purely numeric methods.

CLOct 22, 2012
Some Chances and Challenges in Applying Language Technologies to Historical Studies in Chinese

Chao-Lin Liu, Guantao Jin, Qingfeng Liu et al.

We report applications of language technology to analyzing historical documents in the Database for the Study of Modern Chinese Thoughts and Literature (DSMCTL). We studied two historical issues with the reported techniques: the conceptualization of "huaren" (Chinese people) and the attempt to institute constitutional monarchy in the late Qing dynasty. We also discuss research challenges for supporting sophisticated issues using our experience with DSMCTL, the Database of Government Officials of the Republic of China, and the Dream of the Red Chamber. Advanced techniques and tools for lexical, syntactic, semantic, and pragmatic processing of language information, along with more thorough data collection, are needed to strengthen the collaboration between historians and computer scientists.

CLOct 20, 2012
Hidden Trends in 90 Years of Harvard Business Review

Chia-Chi Tsai, Chao-Lin Liu, Wei-Jie Huang et al.

In this paper, we demonstrate and discuss results of our mining the abstracts of the publications in Harvard Business Review between 1922 and 2012. Techniques for computing n-grams, collocations, basic sentiment analysis, and named-entity recognition were employed to uncover trends hidden in the abstracts. We present findings about international relationships, sentiment in HBR's abstracts, important international companies, influential technological inventions, renown researchers in management theories, US presidents via chronological analyses.

CLOct 15, 2012
Opinion Mining for Relating Subjective Expressions and Annual Earnings in US Financial Statements

Chien-Liang Chen, Chao-Lin Liu, Yuan-Chen Chang et al.

Financial statements contain quantitative information and manager's subjective evaluation of firm's financial status. Using information released in U.S. 10-K filings. Both qualitative and quantitative appraisals are crucial for quality financial decisions. To extract such opinioned statements from the reports, we built tagging models based on the conditional random field (CRF) techniques, considering a variety of combinations of linguistic factors including morphology, orthography, predicate-argument structure, syntax, and simple semantics. Our results show that the CRF models are reasonably effective to find opinion holders in experiments when we adopted the popular MPQA corpus for training and testing. The contribution of our paper is to identify opinion patterns in multiword expressions (MWEs) forms rather than in single word forms. We find that the managers of corporations attempt to use more optimistic words to obfuscate negative financial performance and to accentuate the positive financial performance. Our results also show that decreasing earnings were often accompanied by ambiguous and mild statements in the reporting year and that increasing earnings were stated in assertive and positive way.