Mass data download from a portal site

forestknutsen1 · ‎03-15-2018

I been asked to setup an automated mass download for a nearby borough (kinda like a county) GIS data. They are hosting there data on portal. For the other target boroughs we are using a python web scraping pattern to grab the data links off there html pages. But it does not look like this is practical for the portal page.

I was wondering if anyone has any suggestions on how to set this up.

One of my co-workers suggested that I look into jupyter notebook to list all of there stuff and then grab it. But I am guessing that I will not be able to use jupyter with their portal as an anonymous user.

JonathanQuinn · ‎03-16-2018

Would collaboration make sense for you?

forestknutsen1 · ‎03-16-2018

Okay, that is very cool. I did not know about that at all (no surprise as I know little about the portal environment )

It may make sense for everyone in the long run. But it would require buy in from a lot of people. We do have portal setup on our end so that part is in place. I will run it by everyone for sure.

For now I guess I can just go and grab a list of urls by hand and use that to drive a python download/import process. Are the zip urls stable? This is a sample one:

https://opendata.arcgis.com/datasets/5881a8933a264ab98df6aface0b7a678_0.zip?outSR=%7B%22wkid%22%3A10...

It sure would be sweet if one could use an api to grab public data from portal sites without having to setup a collaboration. After all the organization is public, they have data that is public, and it's exposed to the public for anyone. Why would esri not provided an efficient way for other organizations to grab the data? Just a thought....

JonathanQuinn · ‎03-19-2018

You can do a lot with Python and the Sharing API. For example, use the urllib, urllib2, and json modules to search within the Sharing API for items based on tags, titles, or item types. From there, you'll get the item ID so you can construct the URL to download the item and urlretrieve to actually do so:

import urllib,urllib2,json,traceback, os
from pprint import pprint

def openURL(url,params=None):
    try:
        request_params = {'f':'pjson'}
        if params:
            request_params.update(params)
        encodedParams = urllib.urlencode(request_params)
        request = urllib.urlopen(url,encodedParams)
        response = request.read()
        json_response = json.loads(response)
        return json_response
    except:
        print(traceback.format_exc())


sharingAPIURL = 'https://portal.domain.com/portal/sharing/rest/'
searchURL = '{}/search'.format(sharingAPIURL)

searchParams = {'q':'Administrative_Assembly_Districts'}

response = openURL(searchURL,searchParams)

itemURL = "{}/content/items/{}".format(sharingAPIURL,response['results'][0]['id'])

itemInfo = openURL(itemURL)
pprint(itemInfo)

downloadURL = "{}/data".format(itemURL)
result = urllib.urlretrieve(downloadURL,os.path.join(os.path.dirname(__file__),"{}.zip".format(itemInfo['title'])))‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Working with users, groups, and items—ArcGIS REST API: Users, groups, and content | ArcGIS for Devel...

Search—ArcGIS REST API: Users, groups, and content | ArcGIS for Developers

Search reference—ArcGIS REST API: Users, groups, and content | ArcGIS for Developers

forestknutsen1 · ‎03-19-2018

Sweet! Thanks! What about a origination that hosts their data on Open Data ArcGIS Online. Can I get at that with python as well?

Mataunska-Susitna Borough Open Data

JonathanQuinn · ‎03-19-2018

Sure, you'll just need to know the requests and what to expect from the responses. For example, the following link returns JSON that contains the information for each item that displays when you go to ArcGIS Hub:

https://opendata.arcgis.com/api/v2/datasets?filter%5Bcontent%5D=spatial+dataset&filter%5Bcatalogs%5D...

From here, parse the response for whatever information or data you need. Use your browsers developer tools or Fiddler to capture the network traffic so you know what request needs to be made.