Seyi Olojo

CLSep 8, 2024

Socially Responsible Data for Large Multilingual Language Models

Andrew Smart, Ben Hutchinson, Lameck Mbangula Amugongo et al.

Large Language Models (LLMs) have rapidly increased in size and apparent capabilities in the last three years, but their training data is largely English text. There is growing interest in multilingual LLMs, and various efforts are striving for models to accommodate languages of communities outside of the Global North, which include many languages that have been historically underrepresented in digital realms. These languages have been coined as "low resource languages" or "long-tail languages", and LLMs performance on these languages is generally poor. While expanding the use of LLMs to more languages may bring many potential benefits, such as assisting cross-community communication and language preservation, great care must be taken to ensure that data collection on these languages is not extractive and that it does not reproduce exploitative practices of the past. Collecting data from languages spoken by previously colonized people, indigenous people, and non-Western languages raises many complex sociopolitical and ethical questions, e.g., around consent, cultural safety, and data sovereignty. Furthermore, linguistic complexity and cultural nuances are often lost in LLMs. This position paper builds on recent scholarship, and our own work, and outlines several relevant social, cultural, and ethical considerations and potential ways to mitigate them through qualitative research, community partnerships, and participatory design approaches. We provide twelve recommendations for consideration when collecting language data on underrepresented language communities outside of the Global North.

73.7CYApr 1

Translating With Feeling: Centering Translator Perspectives within Translation Technologies

Daniel Chechelnitsky, Sireesh Gururaja, Seyi Olojo et al.

Rapid development of Large Language Models (LLMs) and similar automated approaches for translation tasks is increasingly affecting the landscape of translation technologies. As concerns about the outsourcing of translator work to these automated translation tools grow, it becomes increasingly crucial to gather insights from the translation community directly. To this end, we conduct an interview study with 19 professional translators working across 11 languages and 11 domains to understand their perspectives, experiences, and concerns with using translation technologies in their work. We find that translators are cautious when incorporating new tools into their workflow, with several expressing concerns machine translation (MT) and LLMs are infringing on the necessary human aspects and verification steps of translation, worried that these tools have potential for harmful downstream effects due to compromising the human aspect of translation work. These findings demonstrate the need to develop translation technologies that directly serve translators' needs rather than replacing human translation. This can be done by focusing more on the assistive, rather than the automating aspects of these tools.

Seyi Olojo

2 Papers