Location, Location, Location
For my work with CrossCut, I’ve needed to get some R scripts running on big EC2 instances to do some geospatial processing. This involves loading in a bunch of files of various formats (CSVs, TIFs, and also SHP files). The R scripts were written by a colleague; they naturally are tested first on his laptop and assume that data is going to be read from the local filesystem.
As expected, we’re storing the inputs needed in S3. At first, I thought this meant I’d replace the local filesystem load function with S3 equivalents (R has several s3 libraries to choose from: paws and aws.s3). There’s a nice helper function in the aws.s3 library that lets you save the file to local disk and load it in one go. So I thought I’d go through the code, consolidate the external resource loading into one place, and then would swap in cloud equivalents. Basically, it would look like this:
#before: local load
This proved to be a pain because Esri SHP files contain references to other files that also need to be pulled to run locally. For example,
my-shape-file.shp might contain references to
my-shape-file.dbf. If all of these are not present, loading the SHP file fails.
At this point, I thought I’d have to write a function in R that would take a shapefile, download it, and be smart enough to pull in any other files with the same name but different extensions. Not terribly complicated, but not fun to do in a language you don’t know that well. And it would have made the split between running locally and running in the cloud even greater.
When describing the problem to my wife, who’s also a software developer, she asked a simple but insightful question: “do you have to do that in R?” I responded at once with a firm “yes!” but then thought about it a bit. “Actually, I can do this in the entrypoint bash script…”
This proved to be a much simpler and elegant solution. When the docker container is launched, its entrypoint script fetches the data needed from S3 using the
aws-cli tools. Instead of iterating over files and being clever, we simply pull down all the data for a given country locally, and put it where the R script expects to find it. The relevant bits of the entrypoint look something like this: