Language Flashcards from Songs and Movies - Part 4: Handling Russian Verbal Aspect

Derek Ferguson
Oct 20, 2019
4 min read

I must begin this article with a bit of an aside about the use of verbal aspect in the Russian language. I am no language scholar, but this has a specific touch point with my use of technology here and is an excellent illustration of the value in splitting even a relatively simple system like this into multiple, loosely-coupled components that can be easily augmented, removed and replaced, if needed.

In a nutshell, I obtained a spreadsheet or the 10,000 most common words - in order of frequency - from this Reddit thread. Looking at it, I am reminded (not that any student of the Russian language could ever forget) that verbs have multiple "aspects" in Russian. Put simply, you can think of aspects as telling you whether a given action has been completed or is ongoing. You can think of this not unlike the perfect and imperfect tenses in English, except that Russian... sort of takes this to the next level.

For example, in English, if I said "I will be going to school," there is some ambiguity there - do I mean that I will shortly be getting up to go to a school for the day, or do I mean that I have plans to embark upon an ongoing course. If I said "last night I read my book", do I mean that I read the entire thing, or simply read it for a while? In Russian, the verbs for these actions have prefixes - their aspects - that tell you conclusively which meaning is desired. Almost all of the verbs have these forms.

Now that you've heard my (overly simplistic) explanation of verbal aspect in Russian - what relevance does this have to this build? Well, the question arose as to whether the lemmatizer we finished in the previous part of this series would consider a verb used in two different aspects to be two lemmas or one. I added a unit test to my test suite that runs the Russian sentence "Я пил и она выпила" ("I was drinking and she drank (to completion)") through and verified that the lemmatizer provided by Yandex considers different aspects to equate to different lemmas (['выпивать', 'пить']).

This means that I need to take my source data, which considers separate aspects to still be the same verb and perform some preprocessing to split these rows into separate ones. I load everything into a new Python script via the CSV library's DictReader class and lose 30 minutes right off the bat because there is garbage at the start of the file that gets loaded as unprintable characters in my first column heading. After I sort this out, I'm able to read all 10k rows pretty easily.

Fixing up the rows to remove the notes on parts of speech and generate separate rows in a brand new dictionary that is keyed on the Russian verbs (since that is what will be fed from the lemmatizer) turns out to be a straightforward Python string parsing exercise. The whole thing is coded in less than 30 minutes, once the garbage character hurdle is cleared. I save the dictionary to a Pickle file and move on to the task of building a Lambda that will drain the SQS queue of lemmas and output flashcard suggestions.

This function is going to perform the same architectural function as the previous one exactly: reading from one queue, doing some processing and writing to another queue. So, I create a new repo, copy the previous function's files over and begin hacking out the pieces that don't apply - like any references to the Yandex lemmatizer.

I add a single test to verify that my stripped down function returns exactly the string that is passed into it. This won't last, I just want to have something in my tests before I deploy. The initial deployment fails because I forgot to set my AWS credentials under my Production deployment bit of Bitbucket configuration. I fix that and redeploy and I'm now presented with a surprising new error. Because I don't need any pre-set environment variables in config.yaml this time, I removed the variables... it seems that I need at least one in order to be able to add the name of the flashcard queue. So I add a temporary one and de-deploy.

It fails this time because the installer.py script expects the function to already exist when it runs. So, I change the bitbucket-pipelines.yml file to invert the execution order of the lambda and queue creation scripts. Redeploying results in a successful deployment of both the function and our new queue!

Unfortunately, as you can see - our messages are still stuck in the lemmas queue. :-(

The issue ultimately turned out to be in the code, in that it wasn't written to read the message as properly formatted, but as it was in the previous version.

The finished code starts by loading in the pickle of the formatted Python dictionary, as described above. It loads it into a variable stored outside the actual Lambda function, so it will live as long as the container instance. This would help if-and-when the functions became busier, by avoiding the need to read the data in every time.

The actual handler function reads in the passed event and uses string parsing to find the body portion, where the Russian lemmas are. It loads this list into a true Python array and passes it to a helper function to process the lemmas into proper flashcard items.

The actual processing is tremendously easy, as it just requires looking up each lemma in the dictionary retrieved from the Pickle file and saving the associated data points to a new dictionary of flashcard items.

The only catch is that it turns out, from my small Russian sample, that a lot of the words are not a part of the most common 20,000 lemmas! So, it is important to catch the "KeyError" exception in Python and create a "limited knowledge" flashcard entry in these cases. Later, this might become a separate path onto a separate queue for additional processing - like calling a translation service or something similar.

Once this is deployed, I am able to drop test verbiage into the first queue and get flashcard suggestions dropped into the second queue.

Next step is to pick up the flashcard objects from the new queue and use them to actually generate flashcards in one or more flashcard services - like Quizlet, Memrise, etc.

kNative from Scratch - a failed attempt

Recording multi-track TD-50 Drums with Pro Tools

Notes on GPT-2 (345M) use with custom text

Language Flashcards from Songs and Movies - Part 4: Handling Russian Verbal Aspect

Comments