Week 1- Community Bonding at Apertium @ GSoC’ 21

Gourab Chakraborty
4 min readMay 28, 2021
@ https://github.com/apertium/

We joined a (unofficial) Whatsapp group and Discord server for all the Indian GSoC-ers. We got to get familiar with everyones experience of the onboarding process, shared information on various organisations, work culture, setting up payments and pervious years experiences.

Officially, we got added to a GSoC mail list by the GSoC program administrators. It contains GSoC-ers from all the countries and all the previous years.

I got in touch with my mentor and other members/developers at Apertium via the IRC channel.

This week I focused on getting familiar with the work that I was about to do. There is a very well documented wiki available for Apertium at https://wiki.apertium.org/wiki/Main_Page. We have moved from SourceForge to GitHub and the migration is complete. GitHub was also well documented for all the repositories. Our IRC channel #apertium shifted from Freenode to OFTC this week. All the site links have been migrated by the developers as of now.

I had to go through a few papers on NLP and Machine Translation for my project. Most of the papers specific to Apertium aren’t released yet and I’m not allowed to share as of now, but a useful paper that I found on Lexical Selection was: https://core.ac.uk/download/pdf/40024362.pdf

A very useful software that Apertium language pair developers can use is the Apertium Viewer. You can either use the Java build or the C++ build. It is a GUI based software that shows all the steps in the pipeline of the Apertium RBMT while translating between a language pair. And you can edit it in any steps. Apertium viewer saves the hassle of using a terminal for every step of the pipeline and shows all of them at once in a neat GUI interface. For further details check out: https://wiki.apertium.org/wiki/Apertium-viewer

To elaborate further, Apertium consists of both the management of the pipeline (the main apertium executable) and all the stages in this pipeline, except where outside tools (such as HFST for morphological analysis and generation, or CG for morphological disambiguation) are used. Each stage consists of a general processor which modifies the stream based on hand-crafted “rules” (coded linguistic generalisations) for a given language or language pair. The steps in the Apertium RBMT architecture, that you can see and edit in the Apertium Viewer are:

1. Input text

2. Morphological analyzer

3. Morphological disambiguator

4. Discontiguous multiword processing

5. Lexical transfer

6. Lexical selection

7. Morphological generator

8. Post-generator

9. Output text

The architecture of Apertium, a transfer-based machine translation system. Each rounded box is a module available for language-specific or pair-specific development. Broken lines show optional modules. Lines with arrows represent the flow of data through the pipeline.

The pipeline is quite well documented at https://wiki.apertium.org/wiki/Apertium_system_architecture.

I’ll be using VISL CG3 for morphological disambiguation. Installation guide @ https://mikalikes.men/how-to-install-visl-cg3-on-mac-windows-and-linux/

My main objectives of this week at Apertium was studying in-depth about the tools I’d be using like CG3 and also studying the rules of lexical selection, transfer syntax and a few extra topics like Hidden Markov Model (HMM) for POS Tagging. Once again all of it was well documented @ https://wiki.apertium.org/wiki/How_to_get_started_with_lexical_selection_rules , https://wiki.apertium.org/wiki/A_long_introduction_to_transfer_rules, https://wiki.apertium.org/wiki/Transfer_rules_examples

We have a few crash courses scheduled by the mentors for the GSoC students to help us get familiar with the various modules like Recursive Transfer module. It’s a recently developed alternative to the shallow structural transfer module (chunker, interchunk, and postchunk). Its linguistic data is specified as context-free grammars (CFGs) and it uses a Generalized Left-right Right-reduce (GLR) parser rather than finite-state chunking to more effectively implement long-distance reordering.

We will figure out a more detailed work schedule, by weeks with our mentor around this weekend. As of now, my key task at hand is the identification of the constructions that require a transfer operation (i.e. reordering dropping or adding words, or changing morphology tags, like e.g. changing the verbal tense or the case). It is very important because we have to resolve how will look at the verbal paradigms. Without solving this issue we won’t be able to know how much we’ll have to change/add/remove the morphological tags (right now, we’ll have to change a huge quantity of them in the transfer).

As of now the workflow is going smoothly and an exciting summer lies ahead. I’m not allowed to share the internal details but I’ll try to document my work as much as I can as very less resources for the same are available for everyone.

--

--