Building a System to Generate Russian Language Flashcards from Arbitrary Text - Part One
- Derek Ferguson
- Sep 2, 2018
- 3 min read
So, for a long time, I have wanted something very simple: the ability to paste a bunch of Russian text into an app (song lyrics, movie subtitles, etc.) and have it produce a set of appropriate flash cards for me on the other side. This being Labor Day weekend in the US - I believe I may have time to significantly advance the existence of such a tool. As I do so, I will make relevant technical notes here.
So, to begin with - we need something that will parse the Russian text and reduce everything in it to its "dictionary form". Why? Well, if we have 1000 words of Russian, for example, we don't want separate flashcards for every possible tense, number and gender of the same words. Let's just get flashcards for the "dictionary forms" that require separate memorization. This will allow us to more effectively de-dupe the list later, also.
So, the technology I've chosen for this is something called FreeLing. There is a live demo available online and, based on that, it has the Russian abilities I want. For flexibility of deployment, I'd like to start off running this tool at home and then - in the unlikely event it becomes popular - move it to AWS. So, to ease that move, let's bundle the whole thing up in Docker. Happily, there are already Dockerfiles for this - but they are for version 4.0 of FreeLing instead of the latest (4.1) and, for reasons unknown, limited to Spanish. So, we'll deal with this issue, first.
The first thing to understand about the Docker files for FreeLing is that the Makefile build the actual Dockerfile at run time from all the .config files in this directory. So, as you'll see in the commit history, the way I changed this to use 4.1 was to change the .config files, not the Docker files. Specifically, Dockerfile.m4 needs to be amended to use the "stretch" images of Debian instead of the "jessie" image - this basically means using a newer version of Debian. The same change needs to be made in the freeling.docker file.
The second thing to understand is that this version removes support for everything except Spanish. We undo this limitation simply by removing the clause "--define=fl-es" from the m4 command in the Makefile. While we're at it, we remove "--define=py-dv" to remove the last bits of the Python API at the same time.
Side note, but after multiple iterations of this Dockerfile build, I began getting an "out of space on device" error. This seemed to be an internal limit on the amount of space allowed to hold images. The quickest way to set the entire Docker system back to empty turned out to be: "docker system prune." Dangerous if you have stuff you may want to keep in your live Docker system, but very quick and powerful.
Next, we need to add in the Java SDK. So, step #1 is to get the Java SDK on the image. To wit, I change the "FROM" on the Dockerfile.m4 to use the "library/openjdk:8u181-jdk-stretch" image as its base.
Now, at this point, we need the code for the Java APIs. At first I tried using GIT to clone the repo into the Docker image - but that took way too much space. So, my attention was drawn by the "svn export" code in the main Makefile. I'd noticed it before, but couldn't figure out what it was doing. Apparently, you can use "svn export" to grab just a portion of a GIT repository. Consult the oracle of Google for full details - or see the Makefile in my repo.
Finishing the Java SDK build is fairly simple. I've added a file called javafreeling.docker to the mix and, in this, I have added code to copy the downloaded API source into the Docker image and set up a logical link to the Java installation for where the build script expects to find it. The build itself requires make, swig and build_essential - so I have added those to the main dependencies of the Docker image.
Having said all of the above - as I was assembling the above, I discovered the analyzer server, which allows you to run a network service that can provide analysis across the wire. I will be using that for this project, henceforth.
Comments