How to schedule data update - CSV file (not in AGOL)

mdun · ‎10-05-2022

Hello,

I've a very simple need, however I can't find in the documentation a global description of how to achieve this: we need to daily update data that feeds a service, and this data is in a CSV file hosted on our ArcGIS Server. The first import of the CSV (it contains lat & lon fields) is done manually, using the "Add CSV and create a hosted feature layer" function in Portal.

I'm a beginner regarding Python scripts / API use and all this stuff... What I found for now is that I need to create a View Layer (to avoid possible locks issues when data is overwritten) from the Hosted Feature layer, create a Python script and schedule it (using Windows scheduler) on the machine hosting my ArcGIS Server.

So basically, what I want to script and schedule is the "Update Data" > "Overwrite Entire Layer" Portal function:

Does the machine hosting ArcGIS Server & Portal need a specific environment/set up? It currently has Python 2.7 (folder C:\python27\ArcGISx6410.9)
Where can I find tutorial and/or samples of Python script for the "Update Data" > "Overwrite Entire Layer" Portal function? Not bits of code, but a full script that helps to understand the full logic (where to put this script, how to run it, etc). For example this link (https://developers.arcgis.com/python/samples/overwriting-feature-layers/) seems to match, but when I try to run the first lines, it blocks at "ImportError: No module named arcgis.gis". Then when I search for this issue, I find that this code is for ArcGIS API for Python, which is installed along with ArcGIS Pro, but what I need is to use the Python script on the server...

Context: ArcGIS Enterprise server (ArcGIS Server site federated with a portal) 10.9.1 (Windows).

Many thanks!

Martin

Brian_Wilson · ‎10-05-2022

Breaking it down and selecting only the easiest part. 🙂 You won't get anywhere without the module, so, you can install it. Personally I have adopted "conda" everywhere and I suggest you do the same, because it lets you installed python + modules on any machine without disrupting what is already there. But that's another wrinkle and maybe you already have it??

If you do then on the server you could create an environment and install the modules you need in it,

conda create --name=load_csv
conda activate load_csv
conda install -c esri arcgis

When you do the first command it will create an empty environment, not even any Python (unless your "base" environment has Python in it.) The second command activates the new empty environment, and the third installs about 100 packages, including arcgis and the version of Python that arcgis prefers.

At that point you are in an activated environment. It has Python and the arcgis 2.x module installed. You can start an interpreter and test it, these are commands I use in a Linux bash shell,

python --version
Python 3.8.13

python
import arcgis
arcgis.__version__
'2.0.1'

The trick now is getting the right python to run when you invoke it from a scheduled task. I can find out where the Python I am using runs using "which python" in Linux or "get-command python" in PowerShell. Then you can invoke the command using the complete path in your scheduled task. Invoking the full path should also drag along the modules installed in the conda environment.

If you don't have conda installed on your server I suggest you start there, it's really going to save pain down the road. To ignore this advice and press on you could try doing "pip install arcgis". You probably need to use elevated permissions.

Brian_Wilson · ‎10-05-2022

BTW you can install the "arcpy" module the same way but it won't work without solving the licensing issues. By contrast you can install "arcgis" anywhere and it will work.

Brian_Wilson · ‎10-05-2022

"I've a very simple need"... ha ha ha ha -- oh sorry. My projects always start that way too. I usually say "This should be easy".

It happens that I need a way to automate transfer of CSV data too but have to work on more pressing projects right now. If you give me feedback on what I wrote above you can probably keep me going here and it will help both of us.

mdun · ‎10-06-2022

Thank you very much Brian, so let's first try to create an environment and install the modules with Conda. I'll try to run a test script then and post here my progress!

mdun · ‎10-06-2022

So I managed to set up the environment I think. But when I run the script, the first error I get is:

ImportError: No module named arcgis.gis

However the module is there:

And here is the script I try to run (created from bits and pieces, there will be surely other errors after...):

# Import libraries
from arcgis.gis import GIS
from arcgis import features
import pandas as pd

# Connect to the GIS
gis = GIS(url='https://XXX.XXX.org/portal')

csv_file = r'C:\CSV_TIR\POI.csv'
csv_item = gis.content.get('XXX')
# csv_item = target.content.get(csv_item['XXX'])

from arcgis.features import FeatureLayerCollection
csv_featurelayercoll = FeatureLayerCollection.fromitem(csv_item)

#call the overwrite() method which can be accessed using the manager property
csv_featurelayercoll.manager.overwrite(csv_file)

mdun · ‎10-06-2022

In fact I'm not sure I run the script correctly (when I said I'm new with Python...).

If I do instead:

(load_csv) C:\Program Files\ArcGIS\Server\framework\runtime\ArcGIS\bin\Python\Scripts>python "C:\Program Files\ArcGIS\Server\framework\runtime\ArcGIS\bin\Python\Scripts\overwrite_feature_layer.py"

I get this error:

Traceback (most recent call last):
  File "C:\Program Files\ArcGIS\Server\framework\runtime\ArcGIS\bin\Python\Scripts\overwrite_feature_layer.py", line 10, in <module>
    csv_item = gis.content.get('XXX')
  File "C:\Program Files\ArcGIS\Server\framework\runtime\ArcGIS\bin\Python\envs\load_csv\lib\site-packages\arcgis\gis\__init__.py", line 5880, in get
    raise e
  File "C:\Program Files\ArcGIS\Server\framework\runtime\ArcGIS\bin\Python\envs\load_csv\lib\site-packages\arcgis\gis\__init__.py", line 5870, in get
    item = self._portal.get_item(itemid)
  File "C:\Program Files\ArcGIS\Server\framework\runtime\ArcGIS\bin\Python\envs\load_csv\lib\site-packages\arcgis\gis\_impl\_portalpy.py", line 1416, in get_item
    return self.con.post("content/items/" + itemid, self._postdata())
  File "C:\Program Files\ArcGIS\Server\framework\runtime\ArcGIS\bin\Python\envs\load_csv\lib\site-packages\arcgis\gis\_impl\_con\_connection.py", line 1412, in post
    force_bytes=kwargs.pop("force_bytes", False),
  File "C:\Program Files\ArcGIS\Server\framework\runtime\ArcGIS\bin\Python\envs\load_csv\lib\site-packages\arcgis\gis\_impl\_con\_connection.py", line 900, in _handle_response
    self._handle_json_error(data["error"], errorcode)
  File "C:\Program Files\ArcGIS\Server\framework\runtime\ArcGIS\bin\Python\envs\load_csv\lib\site-packages\arcgis\gis\_impl\_con\_connection.py", line 923, in _handle_json_error
    raise Exception(errormessage)
Exception: You do not have permissions to access this resource or perform this operation.
(Error Code: 403)

(where "XXX" is the id of the layer of course).
Is this second way to run the script more correct?

Brian_Wilson · ‎10-07-2022

From a "best practices" perspective you should store your python somewhere not privileged, C:\Program Files is probably the worst place. Also personally I hate paths with spaces in them, they always cause problems for me eventually somewhere. (So, Microsoft blew it when they named it "Program Files"! 🙂

Anyway -- just don't do that, put them someplace like for example C:\Users\YOURACCOUNT\scripts

I usually keep them with the project I am working on so I can find them later. C:\Users\MYACCOUNT\Documents\MYPROJECTNAME\scripts\

Since your command line prompt says (load_csv) on it I assume you got conda going and activated it added the modules. That means in that command window that python works from anywhere, you don't need to be in the folder C:\Program Files\ArcGIS\Server\framework\runtime\ArcGIS\bin\Python\Scripts and that is probably the wrong place anyway.

If you do "conda info" it will tell you everything it knows about the active set up, the first two lines will show the environment which is probably "load_csv" for you and then the location, for me right now it's

$ conda info
    active environment : arcgispro29
    active env location : C:\Users\bwilson\.conda\envs\arcgispro29
.
.
.

If I run python while the conda environment is activated, then it will point at the one in "C:\Users\bwilson\.conda\envs\arcgispro29"

It's probably running the right python in your test (the one in the load_csv environment). To get it to run the wrong one (the one in Program Files) you would need to put a dot in front like ".\python" to say "use the python right here in this folder not the one on my PATH".

I think the problem now is you are not logging in and you are asking for access to a restricted service in GIS, try adding username and password, for example "gis = GIS(url,user,pass)"

The error code you see (403) is a HTTP "access forbidden" error so it's running the script and hitting the web server and the URL is correct but there are permissions issues.

You could wrap the GIS call in a try / except block to get a friendlier message,

url="https:///////etc/etc/etc"
user="joe"
pass="secret"
try:
  gis = gis(url,user,pass)
except Exception as e:
  print("Login failed,",e)
  exit(-1)

jcarlson · ‎10-14-2022

I've got a handful of datasets that do this. Personally, I would avoid the overwrite method. Unless your schema is changing drastically, you don't need to do it.

You can easily accomplish this with truncate/append, or by using a source/destination comparison to identify rows which actually need updating. Here's a basic truncate/append:

from arcgis.gis import GIS
from arcgis.features import GeoAccessor
import pandas as pd

# connect to portal, get destination layer
gis = GIS(profile='your profile')
dest = gis.content.get('itemid').layers[0] # or whatever index it is

# load CSV as dataframe, convert to spatial dataframe
csv_df = pd.read_file('your-file.csv')
csv_sdf = GeoAccessor.from_xy(csv_df, 'lat', 'lon') # or whatever the fields are called for your coordinates

# truncate the destination layer
print(f'Truncating destination layer: {dest.manager.truncate()}')

# load new rows
adds = dest.edit_features(adds=csv_sdf.spatial.to_featureset())
print(f"Adds: {len(adds['addResults'])}")

By the way, if your situation permits, take a look at using persistent profiles. Makes automating this stuff nicer, as you don't need to store any credentials in plain text anywhere.

Edited post to add:

As the API docs note, submitting lots of rows through edit_features can itself be trouble. While the docs recommend using append, it's not a great alternative, as the source data needs more wrangling, and it never likes to work as simply and directly as edit_features. You could just use a while loop and pull rows from your source a couple hundred rows at a time and submit each batch separately. You can still populate a layer pretty quickly this way. One of our layers that goes through a truncate/append process this way has about 50k point features, and it fills in in 1-2 minutes with a not-great internet connection and sub-par processor.

- Josh Carlson
Kendall County GIS

mdun · ‎10-17-2022

Thank you Josh for your help and hints. I don't understand why according to you the overwrite method is not your preferred one. To me it seems the most straightforward, not only as a concept but also in coding. Isn't it?