Language Flashcards from Songs and Movies - Part 2: Deploying the Lemmatizer to Lambda

Derek Ferguson
Sep 21, 2019
10 min read

So, we left off with our lemmatizer building on Bitbucket Pipelines and passing its tests, but not really deploying anywhere. As a distinct function, this will be perfect for Lambda - so let's start with that. This will involve two distinct steps. First, we'll change our code to actually *be* a proper Lambda function. Finally, we'll modify our Bitbucket setup to actually perform automatic deployments via our pipeline.

So, the first thing we will try to do is get rid of the need to download the data every time. We'll do this by downloading the data up front and putting it in a specially-named data directory, as per the article cited in the first part of this blog series.

I like to see stuff like this break first, to assure myself that my new approach has fixed it. So, I start by killing this line...

nltk.download("stopwords")

Re-running my 2 tests results in a breakage and a warning that I need to have ntlk_data/corpora/stopwords.zip in my home directory. So I create this and re-run and all is well with my 2 tests again. Now they pass.

Promoting this to Bitbucket Cloud - of course I knew it would break. I add an ntlk_data directory to my project, as will be needed for lambda, but I also add some commands to the testing script in the pipeline to copy this data to the place it suggest: /root/ntlk_data/corpora/stopwords.zip. After making that change - my tests run to completion on the Bitbucket server.

So - now that we've addressed the need to download the data every time at a code level, let's change the code to be a proper Lambda function. We'll follow the instructions in this article.

So, first we get python-lambda on our development desktop using pip3, and also add that library to the requirements.txt, so we'll have it on the build server.

Then we run "lambda init" and, as the article promises, it adds a few new files to our project.

I sort of like the form of the new files, so let's move our code over to the new service.py, watch our test fail, and then make the fixes needed to see them work again.

One of the first things we realize is that Lambda functions take a different input style than stand-alone functions. So, we can no longer expect "text" passed into us.

That's fine, we'll change it to event.get('text') and work with that.

I have completely gutted "lemmatize.py" to populate my new Lambda service file, so the tests are definitely broken. I, therefore, completely delete "lemmatize.py" and start re-working my test file to point to the new service.py and pass in the JSON structure that it needs to run.

The python-lambda library provided event.json for this purpose, so I redo it to have a single string property called "text" which contains a sample of Russian text.

Running "lambda invoke" now gives me the lemmatized result that I expect for the text above.

I want to be able to run these as standard tests on the build server, so I change my test script to create very simple objects and pass them in, just as Lambda will have - only I don't bother mocking up anything other that "text," since that is all we will use.

Running the tests now gets a pass on the development machine. Promoting this to the build server works fine there, also. :-)

So, now we have to get this deploying to Lambda from Bitbucket cloud. The challenges here will be the scripting of the promote, but more specifically scripting this promote without inadvertently exposing our AWS credentials!

So, I look in the config.yml file that was created when we ran "lambda init" and I see places for aws_access_key_id and aws_secret_access_key, but a note says that if I leave these blank, the script will take the values from a config file in my home directory. Clearly, I don't want to put my credentials in a file I'm going to check in - but a little more digging on the Internet suggests that creating environment variables with the right names for these two values will cause their values to be taken from there.

So, I'll start by trying this approach from my desktop. As per my usual process, first I run the script ("lambda deploy") with no environment variables set and, sure enough - after running for a few minutes it errors out with "Unable to locate credentials." So now I set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to my credential values and re-try. It gets past that.

Unfortunately, it bombs with "Request must be smaller than 69905067 bytes for the CreateFunction operation." So... wow -- somehow or another this bundles up more than 69 MB of content? That's unexpected. I look in my IDE and find that a "Dist" folder has been added, and it contains a bunch of ZIP files... these must be the artifacts it is trying to upload to Lambda. They are 57.7 MB each. 57 certainly isn't 69, but it is a lot more than I would have thought for such a basic function, so I'll head down this path for a bit to see what might be saved. First, though, I add that new "Dist" folder to my .gitignore, so it won't get checked into Bitbucket. :-)

The Internet is such a great resource. :-) . Turns out that - whatever this message may say - it looks like we're dealing with a limit of about 50 MB here. So... yeah... I gotta find some stuff I can delete out of this ZIP. So, I pop it open and look.

This tells me that several libraries are just made available by Lambda itself. It references this list, which shows me that botocore is one of them and, according to the zip folder sizes, this alone will buy me over 40 MB. So, I can try just deleting this folder. The question is: how can I do this in a scripted manner? I really liked just using "lambda deploy", but that tries to do the zip generation and the deploy in one fell swoop. I need to to build the ZIP, let me tear some stuff out of it, and then do the deploy.

After churning on this for a bit, I reach the conclusion that it would be much simpler and more maintainable to just head down the path of using the S3 version of this command to publish to S3 and let Lambda consume the code from there. So, step #1, I create a new S3 bucket - just taking all of the defaults. I call it "lemmatizer". Yes, I'm aware I keep going back and forth between that spelling and "lemmatizor"... I haven't made up my mind yet. I know this is going to cause me at least one heinous bug in the future, though - will have to standardize that sooner rather than later. (Watch this series for when that bug finally nails me and I spend 3 days trying to figure out what I'm doing wrong because I didn't fix this sooner. :-) )

I set the name of this bucket as my s3 bucket in the config.yml file. The s3 key prefix mentioned in that file is an arbitrary string that will be put at the start of the name of each of your lambda functions in the s3 bucket. So, I guess using this you could have a single bucket shared by multiple lambda functions. I then try to redeploy using "lambda deploy-s3". The code goes into s3, but I get an error that the role defined for the function can't be assumed by lambda.

To fix this, I first uncomment the line in config.yaml that tells it to use the role "lambda_basic_execution". I then go to the IAM window in the AWS console and look for a role called "lambda_basic_execution." I find it there, but inspecting the associated roles - they are quite messed up. I don't know how that happened (some previous project, I'm sure), so I delete it and re-add it. I attach only the "AWSLambdaBasicExecution" role and save it. Running "lambda deploy-s3" now says it succeeds in creating my function! I wonder about the re-entrance capabilities for this code, so I re-run it. Thankfully, it says that it updated it. So, I appear to have everything working locally. Time to deploy to Bitbucket cloud and get it working as a pipeline!

Step 1 - I add "lambda deploy-s3" to the bottom of the script in "bitbucket-deploy.yml". I know it will fail without environment variables, though. So, I go into the Bitbucket Cloud interface and choose Settings, then Deployments. I see places for 3 kinds of Deployments, with the ability to add more. I think "Production" must be what we want, because that is what my steps in the bitbucket-pipelines.yml is labeled. I expand that and see a place to add environment variables. I add my 2 AWS variables with "Secured" checked, so one can't extract those values from the UI and check in my latest code to kick off a build and see what I forgot. :-)

Well, everything deployed on the first attempt... I'm shocked!

Let's just verify that it actually works, and then this installment in the series will be done.

So, to test this out, I go to the Lambda portion of the AWS console. I see that it is set to use Python 2, so - that can't be right, since we built against a Python 3 image. Let me change that and re-deploy. This is the "runtime" line in "config.yaml." I change it and re-push my changes to Bitbucket.

Refreshing the AWS console page shows the runtime is now 3.6. So, I select my function and choose "Test" from the "Actions" drop down. I create a new test event which is exactly the JSON I used when I got unit tests running in the previous installment in this series. I then run the test with that JSON.

It blows up, but for a reason I never would have anticipated. Apparently the pymystem3 library I am using is just a Python wrapper for a native lemmatizer for Russian that is built and distributed by Yandex. As a result, it looks like the first thing it tries to do is download the bits into a directory it wants to create. That directory location is read-only for Lambda functions but, beyond that, we *really* don't want it downloading software on every execution. So, we need to get this and point the software to use it. I download the binary from the location I see in the code for the library where it downloads it and use a parameter in the constructor (mystem_bin) to tell it where I've put it in the file system. I push this up to Bitbucket and watch the rebuild. I feel like if the unit tests pass, I'll probably get past this on Lambda, also.

To my pleasant surprise - the unit tests pass and the Python code gets past that issue, too... and breaks on the next line. :-) . So now it is complaining about the actual stop words collection. I see that I forgot to add the environment variable referenced by this article. So, easily done from the console - but how do I script it, so the whole thing is CI/CD? There's an "environment variables" section right in "config.yml", so I'll put it in there.

This still doesn't work, but I see it has changed the error message so that it IS looking in the right place now. I am perplexed for a couple of minutes, then I realize that the Lambda bundler that created the ZIP probably wasn't smart enough to know that I needed that data directory. I pop open the zip and, sure enough, it isn't in there. This is a little disappointing, as it is one way in which my unit tests just HAPPENED to work, even though the real deployment wasn't going to.

So, I look through "config.yml" again and find build\source_directories section, which appears to be exactly for such a circumstance. I add my nltk_data folder and try again. I just run "lambda build" locally after changing the config.yml file, though, so I can peek in the zip locally and see if it is fixed before going through the whole deploy - which is now taking about 5 minutes end-to-end for each test.

Popping open the zip shows that my data folder is, indeed, now included - so I proceed with a commit to Bitbucket, so CI can try the same thing. I have to make a slight additional adjustment to my deployment script to create a corpora directory in which to put the zip file. After I do that and redeploy, my Lambda function now runs to... 3 lines later. :-/

So, it looks like a simple permissions issue. I add a "chmod 777 mystem" line to my "bitbucket-pipelines.yml" file and try again. No joy.

I need a better way to troubleshoot this than waiting for the CI/CD process to promote to actual Lambda every time. Let's look at the Docker image that is supposed to be almost identical to the Lambda execution environment.

Reading the docs, it looks to me like I'm going to need to get the image, unzip one of the Lambda deployment zip files, write up an escaped-JSON version of my custom event, and then invoke the Docker container using the appropriate command line for Python 3.7, as shown in the link above - with the addition of my environment variable for the NLTK data. Piece of cake!

* Getting the image = easy!

* Unzipping the ZIP = do a "lambda build", then unzip, but don't forget to create a sub-folder called "corpora" under the nltk_data folder and copy the zip over there.

* Custom event = the code only really looks for the text property, so it is as simple as '{"text":"я вижу"}'

And with that, I have a line that exactly reproduces my issue locally!

docker run --rm -e "NLTK_DATA=./nltk_data" -v "$PWD":/var/task lambci/lambda:python3.7 service.handler '{"text":"я вижу"}'

So... what to try to actually fix the issue now. :-( . So, one thing that is cool is that the actual Python script ("service.py") is available down in the dist folder, so I first try changing the line that tells the location of mystem to be something nonsense, just to see if the error changes. It does - now it says the file doesn't exist. So, clearly the system was able to find the file, but not execute it.

docker run -v "$PWD":/var/task -it --entrypoint /bin/bash lambci/lambda:python3.7

This trick yields a nice insight! After logging in, going to /var/task and trying to run ./mystem yields the same permissions error. Running chmod 777 on it resolves the error.

So, I follow a variation on the approach in this article to add some Python code that copies and chmod's the mystem executable. Running that on my local Docker image results in success! So, I push that up to Bitbucket cloud and wait for the deployment to try it. And... it succeeds! :-) . First execution takes almost 4 seconds, but each subsequent execution only takes 100 ms, so I am confident that the right things are being cached.

And with this, we now have a fully-operational Lambda function that takes in Russian text and returns just the lemmas for it.

kNative from Scratch - a failed attempt

Recording multi-track TD-50 Drums with Pro Tools

Notes on GPT-2 (345M) use with custom text

Language Flashcards from Songs and Movies - Part 2: Deploying the Lemmatizer to Lambda

Comments