Curious if someone has a dockerfile example to build a container to run simple ETL scripts. I imagine this would install the no-dependency version of the arcgis api along with required minimum dependencies listed in the documentation, as well as other libraries (e.g. pyshp, boto3, sqlalchemy, etc).
I am someone new to Docker, and am having issues with creating a conda env in Docker from a yml file and then installing the various dependencies. Curious if folks generally use pip for this, or conda works. Some preliminary research suggests there are some differences when using conda in a container vs. not in a container.
Esri has a Docker image for the Python API, if you wanted to skip past all the configuration and just get a container that's ready to go.
Conda is great when you need to keep your envs separate from one another. With Docker, that's less of an issue, since you can just spawn new containers for additional envs. I would just stick to pip in Docker.
What does your yml file look like? And what kinds of issues are you running into?
Thanks @jcarlson
My understanding is the Docker image provided by Esri is designed for running Jupyter in a container. I'm more interested in the stripped down version of the API (i.e. without Jupyter) for doing ETL and other automation scripts for our Portal.
It's a good point about containerization w/r/t conda vs pip. With conda, I was having trouble creating an environment from a basic YML file, activating it, then installing other dependencies - it seemed to be editing the base environment instead (which to your point I guess would be fine).
My YML file is only for libraries where dependencies are installed - I need to install arcgis separately because I need to pass the --no-deps flag.
My YML file looks like:
I see your env.yml file in there, I use files like too usually but with arcgis I find it's best to let it choose its own dependencies mostly. Like, it's fussy about which version of Python it wants. So for arcgis I tend to do something like this, let it install its own python and pandas and numpy etc
conda create --name=new-arcgis-env arcgis
Then stuff gets piled on later as needed for example
conda install --name=new-arcgisenv pillow autopep8 requests
Until something breaks and I start over 🙂 "Breaks" means the new thing demands a different version of something already installed and if I let it, arcgis module stops working.
Sometimes I have to have more than one environment for the same project because the ArcGIS module demands some crufty old version of gdal or proj or pandas. Really that's the beauty of environments, you can have many and use each one for different things.
Outside docker, I generally use conda first and then fall back to pip; they work together after all so I create the conda environment and activate it and then when I run pip commands they install into the active environment instead of the base.
Inside docker, the whole point is one image does one thing only so there is no need to use any environment, you just ignore them and work in the base. (In this case doing "one thing only" means I will only use the image to run Python with ArcGIS, so I don't need any other environments in the image.)
I put the project up on Github, it's here: https://github.com/Wildsong/docker-conda-arcgis
My Dockerfile looks like this:
FROM continuumio/miniconda3:latest
RUN conda update -c defaults conda -y && \
conda install -c esri arcgis
ARG CACHEBUST=1
RUN git clone https://github.com/Wildsong/docker-conda-arcgis.git /source
WORKDIR /source
CMD [ "python", "versions.py" ]
It's interesting as an exercise to do the git clone but I would keep my python files in a folder and then mount the folder as a volume onto /source. Maybe you want them set in concrete though, for example, to run from a scheduler or as a simple command?
Comments on the Dockerfile: I used a Continuum image because it already has conda and git in it.
I left off "--no-deps" in the conda install step, because I want arcgis to load and work and it won't unless its dependencies are installed.
The CACHEBUST=1 line allows you do do repeated builds and in this case it means it will always do the git command over again (not use cache) so you can update the repo and it actually loads fresh code each time you build.
The CMD means it will always run the versions.py by default but you can override it.
Build command: docker build -t conda-arcgis .
Running: docker run --rm conda-arcgis
This should just print the current arcgis module number which is 2.0.1 for me.
If you want to run something else you can override CMD on the command line, for example, using bash so you can look around in the container.
docker run --rm -it conda-arcgis bash
In this trivial case you could skip the git command, directly ADD the python file,
FROM continuumio/miniconda3:latest
RUN conda update -c defaults conda -y && \
conda install -c esri arcgis
WORKDIR /source
ADD versions.py .
CMD [ "python", "versions.py" ]
But the git thing works too.
PS The ADD command can read a file directly from a URL too and it knows how to unpack tar.gz files.
PPS if you still want to build from an Ubuntu image there is a conda "git" module so you don't need to do all the "apt install git" and all that. Do "conda install git" instead.