Language Flashcards from Songs and Movies - Part 1: Bitbucket Pipelines and Python Unit Tests

Derek Ferguson
Sep 17, 2019
4 min read

For a while now, I have had a "grand vision" of something I want. I want to be able to point at a set of movie subtitles or song lyrics in Russian and get a set of flashcards from one of the I can practice to learn the relevant vocabulary prior to watching the movie, listening to the song, etc. Architected correctly, this should be easily-extensible to any other human language... English, for example. It should also be extensible to any of the myriad flashcard generators available out on the Internet (Quizlet, for example).

Step #1 in this process has to be taking all the words in the body of text and returning them to "dictionary form" (lemmatization, as those in the know call it), otherwise one would wind up with separate flashcards for "running", "ran" and "run" (or бежать, побежал and бегу, as the case may be), for example. Also, some items will be named entities (people, places, etc.) - so we won't need flashcards for those at all.

For any of this, we will need a language parser. I've heard good things about the Stanford NLP library, so I started off looking at that. But, first I found this article that says lemmatization via the Java API will only work in English, and this other post - actually written in Russian - that talk about how primitive the author found the lemmatization in general. So - I return to the "oracle of Google" and find that NTLK appears to deliver exactly that which I desire.

Now, the thing with this is that it is going to run as a single-execution process rather than a constantly-running server, so it will be a good fit for Lambda if-and-only-if I can excise from it everything that is going to make for a slow startup. Line 9 in the example seems pretty deadly to this...

nltk.download("stopwords")

Somehow I have to avoid this downloading its runtime data every time. I realize that I am high in the running for an award for simply copy/pasting code off the Internet at this point, but this article seems to already be proposing a solution whereby I can pre-download the data, bundle it with my Lambda function on deployment and set an environment variable telling NTLK to get the data from there.

Let's give it a whirl - first on my desktop and then on Lambda.

First, I run "python3" in a terminal window, because I can't remember if I have Python on this laptop or not. I get the Python prompt, so - clearly I'm OK to at least start there.

Part of my motivation in starting this process is to get a little experience with Bitbucket Pipelines and Python unit testing, so I'm going to do this on a Bitbucket repo instead of my usual JavaDerek repo on Github. I name my new Butbucket identity... PythonDerek. :-) . I create a repo called "lemmatizor" and run the suggested git command to initialize it in the directory I created for this on my laptop.

I copy the script as-is from the page above and try running it with "python3 lemmatize.py" and I'm told that I need the ntlk library. That makes sense, but I want this whole thing to be auto-deployed via Bitbucket Pipelines, so I won't run pip manually - I'll setup the right configuration for that.

Comparing this article with this article, I see that I need a bitbucket-pipelines.yml file. They simply disagree a little on the required content for that file. I like what I see on the lambci Docker repo site, in terms of this being particularly suited for Lambda use - so I decide to try this approach.

I write a simple pipeline that is intended just to run unit tests (which I haven't even written yet) and try committing and pushing it.

After doing this, I go straight to the Bitbucket UI and see the Pipelines link on the left-hand nav bar. Clicking it tells me I need to enable Pipelines by clicking "Enable", so I do - and then I see that my pipeline appears to be building.

It ends with a red exclamation mark, so I figure that can't be good - but there's no immediate explanation visible. I click around until I see under "Deployments" that it didn't find a step named "deployment" and was unhappy about that. I'd seen that in one of the two articles and wasn't sure if I needed it - so let's go back and add that, push and see what happens.

I see another failure, but no great explanation as to why. I removed all the Node references in the example from article #2 above, but surely need a Python-based replacement for them. I will have to look at this component next.

After a certain amount of struggling, trying to figure out which bit of this was disliked, I narrowed down on the image itself. So, in order to get this running, I will have to proceed with a slightly more vanilla Python image and follow instructions intended for that. I go with the "python:3.6" image and all suddenly starts working!

Once I get the initial build working, I add a couple of Python tests. The only bits worth noting here are that I have to do a trick with running "sys.path.append('../')" in order to let my scripts access the lemmatize script that I actually want to unit test. Also, in requirements.txt, I have to add in "pymystem3" and "nltk" for everything to build and run properly. Both tests pass. :-)

Now to try hooking this up to Lambda!

kNative from Scratch - a failed attempt

Recording multi-track TD-50 Drums with Pro Tools

Notes on GPT-2 (345M) use with custom text

Language Flashcards from Songs and Movies - Part 1: Bitbucket Pipelines and Python Unit Tests

Comentarios