# Lagarrue's GoT Cantonese Online Corpus

Lagarrue's GoT Cantonese Online Corpus Website: got.jyutdict.org (opens new window)

This website is an open-source project (opens new window). Issues, stars, and forks for building other online corpora are welcome.

This paper was presented at the 27th International Conference on Yue Dialects (opens new window) at Ohio State University on December 1, 2023 (Eastern Time). The related conference paper is published in Buckeye East Asian Linguistics 9 (BEAL 9) by Ohio State University.

Presenter: @以成 (opens new window)

This paper is a supplementary study to "The Affiliation of Cantonese at the Sino-Vietnamese Border in the Late 19th Century". The original main research was presented at the 25th International Conference on Yue Dialects (opens new window) at The Chinese University of Hong Kong on December 18, 2021. The related conference paper is published in Volume 102, Issue 2 of Current Research in Chinese Linguistics (CrCL) by The Chinese University of Hong Kong.

# 1. Abstract

The linguistic diversity in the Gulf of Tonkin (GoT) is intricately documented in the late Qing materials, notably in Lagarrue’s (1900) textbook, which composes Cantonese using the Vietnamese alphabet, deviating significantly from the standard utilization of the Latin alphabet. This valuable resource contains over 2,400 vocabulary items, 2,500 unique characters with pronunciation, pronunciation guides, dialogues, and classical Chinese pleadings with Cantonese phonetics written in Vietnamese alphabet. Furthermore, the corpus includes trilingual vocabulary, idioms translated into French, and a comparison with late 19th-century Guangzhou Cantonese.

The study focuses on developing a comprehensive pre-processing workflow for Lagarrue’s corpus, involving technology-enhanced text organization (manual organization, optical character recognition (OCR), machine translation), conversion of Lagarrue’s text to Jyutping++, extraction of linguistic insights through statistical analysis. The methodology includes a Jyutping++ transcription scheme for enhanced reversibility and frequency priority, a Vietnamese alphabet decomposing algorithm, useful regular expression patterns for Jyutping++ and the establishment of an open-access online corpus with search capabilities for worldwide research (got.jyutdict.org).

Preliminary linguistic findings (Lai, et al., 2023), such as the merging of rhymes 豪 and 侯, along with the 陽 rhyme merging with the colloquial reading of the class 梗, and noticeable instances of the rising tones 古上聲 are recorded. They highlight significant phonological characteristics of the Cantonese dialect at the Sino-Vietnamese border in the late 19th century. This underscores the importance of the pre-processing workflow, facilitating deeper dialectal exploration and emphasizing the significance of digitization and open-source efforts in linguistic research.

Keywords: Éléments de Langue Chinoise: Dialecte Cantonais, late Qing Cantonese, Vietnamese alphabet, historical corpus, pre-processing

# 2. Citation

Welcome to cite this project! If the cited content is related to the preprocessing and corpus linguistics of this project, please cite the supplementary study; if it is related to the historical linguistics of this project, please cite the main study; if it is related to the website code of this project, please cite the GitHub repository.

# 2.1 The supplementary study

MLA 8th:

Huang, Junxin, and Joeng-zit Lai. "Evolving Pre-processing of Raw Corpus: The Digitization Initiative of Cantonese Material at the Sino-Vietnamese Border in the Late 19th Century." Buckeye East Asian Linguistics, vol. 9, Nov 2024, pp. 32–51.

APA:

Huang, J., & Lai, J. (2024, Nov). Evolving Pre-processing of Raw Corpus: The Digitization Initiative of Cantonese Material at the Sino-Vietnamese Border in the Late 19th Century. Buckeye East Asian Linguistics, 9, 32–51.

# 2.2 The main study

MLA 8th:

Lai, Joeng-zit, et al. "The Affiliation of Cantonese at the Sino-Vietnamese Border in the Late 19th Century." Current Research in Chinese Linguistics, vol. 102, no. 2, July 2023, DOI: 10.29499/CrCL.202307_102(2).0004.

APA:

Lai, J., Wòng, P., Huang, J., & Ng, G.-O. (2023, July). The Affiliation of Cantonese at the Sino-Vietnamese Border in the Late 19th Century. Current Research in Chinese Linguistics, 102(2). https://doi.org/10.29499/CrCL.202307_102(2).0004

# 2.3 GitHub repository

MLA 8th:

Jyutdict Editorial Board IT Workgroup of Lingnaam Jyutjam. Lagarrue's GoT Cantonese Online Corpus. Version v0.1.0, GitHub, 27 Nov. 2024, https://github.com/JyutdictEB/GoTCorpus. Accessed [YOUR ACCESS DATE].

APA:

Jyutdict Editorial Board IT Workgroup of Lingnaam Jyutjam. (2024, November 27). Lagarrue's GoT Cantonese Online Corpus (Version v0.1.0). GitHub. https://github.com/JyutdictEB/GoTCorpus

Last Updated: 11/28/2024, 5:10:37 AM