Suggestions for monitoring ArcGIS Server logs

berniejconnors · ‎02-05-2024

Hello,

We are not in a position to purchase ArcGIS Monitor - we are considering other monitoring software options. In the meantime I am looking for suggestions on using python scripts to monitor ArcGIS Server.

I am working on a python script that will routinely query the ArcGIS Server logs for the most recent 1000 log messages. What patterns can I detect in the logs that will indicate a problem?

Our enviroment is open to the public and it runs on virtual machines (Windows Server 2016) in our own data centre. We do not use an enterprise geodatabase. All of our data is read-only and served from File Geodatabases - no editing is permitted. Our site typically receives 1,000,000 requests per day. We have 3 ArcGIS Servers sharing the load in a single site. Occassionally, one of the three servers will stop responding to requests and flood the logs with messages. The web adaptors don't detect the problem so the effect is that 1 out of 3 requests will fail. We often do not notice the problem until a user reports an error.

Currently, my script will look at the balance of the messages coming from the 3 servers. They should each contribute ~333 messages out of 1000. However, this varies by a few 100 messages. I know from examining the logs during past occurrences that one server will contribute nearly all 1000 messages when it is malfunctioning. I am planning to use 800 out of 1000 mesages generated by a single machine as a threshold to indicate there is a problem.

Another condition I am monitoring is the duration in seconds between the 1st log message and the 1000'th log message. Current monitoring shows the duration can range from 5000 to 70,000 seconds. During the two most recent occurrences the duration was reduced to 40 and 90 seconds. I am considering a threshold of 1000 seconds duration to indicate there might be a problem.

If either of these conditions are met email alerts will be sent. In my experience a simple stop / start of the misbehaving server will quickly return performance back to normal levels.

In your experience, what are some other conditions that can be detected in the ArcGIS Server logs?

Thanks,

Bernie.

P.S. Here are some smaple stats (in CSV) from the python script I am building:

DATE, TIME, DURATION, NODEA_MESSSAGES, NODEB_MESSSAGES, NODEC_MESSSAGES, MAX_DRAW_TIME, AVG_DRAW_TIME, MIN_DRAW_TIME, MAP_DRAWS
2024-02-01, 15:45, 9887, 326, 379, 295, 0.98988, 0.14948588888888886, 0.01422, 90
2024-02-01, 15:50, 8132, 295, 398, 307, 0.97084, 0.14198474137931036, 0.00576, 116
2024-02-01, 15:50, 10425, 463, 333, 204, 0.82525, 0.23391200000000012, 0.02613, 95
2024-02-01, 16:36, 16022, 351, 322, 327, 2.56564, 0.21938160919540237, 0.01393, 87
2024-02-01, 16:40, 9383, 378, 352, 270, 0.97073, 0.13698309734513278, 0.01551, 113
2024-02-01, 22:17, 17915, 269, 383, 348, 0.37303, 0.10854188235294118, 0.01472, 85
2024-02-01, 22:30, 26976, 221, 417, 362, 0.39142, 0.11690624999999995, 0.02177, 112
2024-02-01, 22:33, 14758, 254, 372, 374, 0.52373, 0.10919143939393938, 0.01306, 132
2024-02-01, 22:45, 26553, 251, 376, 373, 0.77659, 0.09833545454545461, 0.01493, 121
2024-02-01, 23:00, 51258, 216, 367, 417, 0.88067, 0.11203641509433963, 0.01489, 106
2024-02-01, 23:15, 72957, 236, 376, 388, 1.8946, 0.1751447115384615, 0.01602, 104
2024-02-01, 23:30, 62913, 258, 360, 382, 0.41343, 0.07835200000000005, 0.01341, 120

Brian_Wilson · ‎02-05-2024

If your server really stops serving then can't you hit it with a request and react when it fails to respond?

I run a script once every 10 minutes that hits a list of 70 URLs and sends me email if any of them fail. Once a day it lists out every test it runs (verbose mode) so that I know it's still running.

If you are dealing with the server slowing down rather than failing you could probably just set its timeout shorter.

I run the script from cron on a Linux machine but it should run in Windows too.

berniejconnors · ‎02-05-2024

Brian,

We have a similar ASP.NET page that we display in a simple ExB app that refreshes every 7 minutes and gives us a visual of map services that are failing. It's just a list of map services in a table. When a service fails it appears in red text in the list. Using a python script like you would probably be more useful. Can you share a copy of your script?

Bernie.

Brian_Wilson · ‎02-06-2024

But it's not beautiful code! It was written to test my servers so I wrote functions for each set of tests so that I could just run one test set at a time to test it out. It handles special cases for specific services this way. It's not generic code but works well for me. I am leaving all the server names in there because they are all public anyway.

"""
unittest.py

Test all services that are proxied by Varnish.
"""
import sys
import requests
from string import Template
import xml.etree.ElementTree as ET

# TODO: test each layer, both demo and wms
# TODO: allow testing JPEG type
# TODO: collect a checksum and use it to make sure the image returned is correct

verbose = False
tests = 0
errors = 0
msgBuffer = ''
errorBuffer = ''

if len(sys.argv)>1 and sys.argv[1] == '-v':
    verbose = True

# Some tests can't run when Varnish is not in use.
VARNISH = True
CHALLENGE_SERVER = False

# Type identifiers
HTML='text/html'
HTMLUTF='text/html; charset=utf-8'

# Server identifiers
MAPPROXY = 'Undefined'
MATOMO = 'Apache/2.4.53 (Debian)'

def addMessage(msg : str) -> None:
    global msgBuffer
    msgBuffer += msg + "\n"
    return

def addError(msg : str) -> None:
    global errorBuffer, errors
    errorBuffer += msg + "\n"
    errors += 1
    return

mapproxy_services = [
    {'service': "bulletin78_79", 'layers': ['astoria_quad', 'cathlamet_quad', 'cannonbeach_quad', 'svensen_quad'],
        'wms': 'bulletin78_79/service?SERVICE=WMS&VERSION=1.3.0&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=true&LAYERS=$layer&CRS=EPSG%3A3857&STYLES=&WIDTH=1959&HEIGHT=901&BBOX=-13752738.810054881%2C5782328.275296917%2C-13726415.862597933%2C5794434.949956801',
    },
    {'service': "city-aerials", 'layers': ['astoria2021'],
        'wms': 'city-aerials/service?SERVICE=WMS&VERSION=1.3.0&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=true&LAYERS=warrenton2011&CRS=EPSG%3A3857&STYLES=&WIDTH=1959&HEIGHT=901&BBOX=-30079294.132968683%2C-13834325.683412343%2C30079294.132968683%2C13834325.683412343',
    },
    {'service': "county-aerials", 'layers': ['osip2018'],
        'wms': 'county-aerials/service?SERVICE=WMS&VERSION=1.3.0&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=true&LAYERS=osip2018&CRS=EPSG%3A3857&STYLES=&WIDTH=1959&HEIGHT=901&BBOX=-13787639.235713175%2C5807384.717645408%2C-13783675.71958535%2C5809207.651811394',
    },
    {'service': "county-aerials-brief", 'layers': ['osip2018'],
        'wms': 'county-aerials-brief/service?SERVICE=WMS&VERSION=1.3.0&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=true&LAYERS=osip2018&CRS=EPSG%3A3857&STYLES=&WIDTH=1959&HEIGHT=901&BBOX=-13787495.573015662%2C5807181.223440446%2C-13783636.475556312%2C5808956.132481217',
    },
    {'service': "lidar", 'layers': ['digital_terrain_slope_model_layer', 'digital_surface_model_layer_hs', 'digital_terrain_model_layer_hs'],
        'wms': 'lidar/service?SERVICE=WMS&VERSION=1.3.0&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=true&LAYERS=$layer&CRS=EPSG%3A3857&STYLES=&WIDTH=1959&HEIGHT=901&BBOX=-13800548.570441948%2C5751370.182628403%2C-13765310.402330594%2C5767577.221662774',
    },
    {'service': "usgs-nhd", 'layers': ['usgs_wbd_layer', 'usgs_nhd_hr_layer', 'usgs_nhd_layer'],
        'wms': 'usgs-nhd/service?SERVICE=WMS&VERSION=1.3.0&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=true&LAYERS=$layer&CRS=EPSG%3A3857&STYLES=&WIDTH=1959&HEIGHT=901&BBOX=-13800548.570441948%2C5751370.182628403%2C-13765310.402330594%2C5767577.221662774',
    },
]

def testGET(url : str, expected_type='text/html') -> dict:
    global tests
    addMessage(url)
    tests += 1
    r = None
    try:
        r = requests.get(url)
        if r.status_code != 200:
            addError(f"FAIL: Status={r.status_code} {r.url}")
    except Exception as e:
        addError(f"FAIL: req_failed {url}")
    return r

def testGETtype(url: str, expected_type='text/html', retcode=200, expected_server=None, expected_x_service=None) -> dict:
    global tests
    tests += 1
    r = None
    addMessage(url)
    try:
        r = requests.get(url)
    except Exception as e:
        addError(f"FAIL: req_failed {url}")
    if r:
        if r.status_code != retcode:
            addError(f"FAIL: Status={r.status_code} {r.url}")

        #print(r.headers)
        t = r.headers['Content-type']

        if t.lower() != expected_type.lower():
            # When MapProxy can't find a tile it sends status 200 but
            # also back an XML doc with the error. If you only check status, it looks fine.
            if t.startswith('text/xml'):
                root = ET.fromstring(r.text)
        #        for n in root.iter():
        #            addError('n=',n)
                error = root.findall('{http://www.opengis.net/ogc}ServiceException')[0]
                addError(f"FAIL: Service Exception: \"{error.text}\"")
            else:
                addError(f"FAIL: Content type is wrong. expected {expected_type} but got {t} on {url}")

        if expected_server:
            try:
                s = r.headers['Server']
            except KeyError:
                # Mapproxy does not return a server string!
                s = 'Undefined'
            if s != expected_server:
                # The perils of proxies, sometimes the proxy points the wrong
                # direction and the response code (200) lulls us into a sense of security.
                addError(f'FAIL: wrong server? {url} Expected:"{expected_server}\" but got "{s}"')

        if expected_x_service:
            try:
                s = r.headers['X-SERVICE']
            except KeyError:
                # Only services we've modified have this header.
                s = 'Undefined'
            if s != expected_x_service:
                # The perils of proxies, sometimes the proxy points the wrong
                # direction and the response code (200) lulls us into a sense of security.
                addError(f'FAIL: wrong server? Expected:"{expected_x_service}\" but got "{s}"')
    return r


def test_web_apps(server):
    web_apps_services = [
        "/apps/ClatsopCounty/",
        "/apps/ATApp/",
        "/apps/PlanningApp/"
    ]
    for s in web_apps_services:
        url = server + s
        r = testGET(url, expected_type=HTML)


def test_dev_apps() -> None:
    return
    server = 'cc-testmaps.clatsop.co.clatsop.or.us'
    testGETtype(f"http://{server}:3344/", retcode=200, expected_type=HTMLUTF) # WABDE

    # service is running, don't know why it fails here but right now I don't care
#    testGETtype(f"http://{server}:3000/", retcode=301, expected_type='') # EXB
#    testGETtype(f"https://{server}:3001/", retcode=301, expected_type='') # EXB
    return

def test_mapproxy(server) -> None:

    # Test all the Get Capabilities docs first,
    # this confirms that each service is running.
    for s in mapproxy_services:
        testGET(f"{server}/{s['service']}/wms?REQUEST=GETCAPABILITIES")

    for s in mapproxy_services:
        for layer in s['layers']:

       # The demo mode pages are not required.
#            url = f"{server}/{s['service']}/demo/"
#            testGETtype(url, expected_type=HTML)

            route = Template(s['wms']).substitute(layer=layer)
            url = f"{server}/{route}"
            if url:
                testGETtype(url, expected_type='image/png')
    return


def test_property_photos(server) -> None:
    property_photos = [

        # Thumbnails
        ("property/api/?tn=1&t=80907DB13400", 'image/jpeg'),
        ('property/api/?tn=1&t=80918DA06900', 'image/jpeg'),
        ('property/api/?tn=1&f=url&t=80822B002100', 'image/png'),
        ('property/api/?tn=1&t=80822B002100', 'image/png'),

        ('property/api/?tn=1&t=80918DA06900', 'image/jpeg'),
        ('property/api/?tn=1&f=url&t=80918DA06900', 'image/jpeg'),

        ('property/api/?f=url&t=80822B002100', 'image/png'),
        ('property/api/?t=80822B002100', 'image/png'),
        ('property/api/?t=80918DA06900', 'image/jpeg'),

        ('property/api/?f=url&t=80918DA06900', 'image/jpeg'),

        ('property/api/?t=80822B002100', 'image/png'),
        ('property/api/?t=80918DA06900', 'image/jpeg'),
    ]
    for ph in property_photos:
        url = f"{server}/{ph[0]}"
        r = testGETtype(url, ph[1])

    return


def test_survey_docs(server: str) -> None :
    url = server + '/surveys/10000-10999/CS%2010425B.pdf'
    testGETtype(url, 'application/pdf')

    url = server + '/surveys/10000-10999/CS%2010425C.pdf'
    testGETtype(url, 'application/pdf')

    return

def test_septic_documents(server: str) -> None :
    url = server + '/septic/41007B000101'
    testGETtype(url, 'application/json')

    return

def test_photos(server: str) -> None :
    url = server + '/'
    testGETtype(url, 'text/html') # photo landing page
    photos = [
        ('/photos/bridges/1002A.jpg', 'image/jpeg'),
        ('/photos/tn/bridges/1002A.jpg', 'image/jpeg'),

        ('/photos/tn/waterway/ph5428.png', 'image/png'),
        ('/photos/waterway/ph5428.png', 'image/png'),

        # These short urls redirect to the long ones
        ('/photos/tn/waterway/5428', 'image/png'),
        ('/photos/waterway/5428', 'image/png'),
        ('/photos/tn/waterway/5372', 'image/png'),
        ('/photos/waterway/5372', 'image/png'),
    ]
    for p in photos:
        url = server + p[0]
        r = testGETtype(url, p[1], retcode=200)
    return


if __name__ == "__main__":

    port = ':444' # testing on port 444
    port = ''
    if VARNISH:
        this_server = f'https://giscache.clatsopcounty.gov{port}' # Varnish
    else:
        this_server = f'https://giscache.co.clatsop.or.us{port}' # Caddy

    test_web_apps('https://delta.co.clatsop.or.us')

    # The challenge server only needs to be accessible via HTTP
    if CHALLENGE_SERVER:
        file = "/.well-known/acme-challenge/test.html"
        s = f'http://giscache.co.clatsop.or.us{port}'
        r = testGETtype(s + file, expected_type=HTML, expected_x_service='ACME')
        s = f'http://giscache.clatsopcounty.gov{port}'
        r = testGETtype(s + file, expected_type=HTML, expected_x_service='ACME')

    # The basic web server that serves the landing page
    ws = this_server
    testGETtype(ws + '/', expected_type=HTML)
    ws = f'https://giscache.co.clatsop.or.us{port}'
    testGETtype(ws + '/', expected_type=HTML)

    test_survey_docs("https://delta.co.clatsop.or.us")

    mapproxy_caches = ["https://giscache.co.clatsop.or.us"]
    if VARNISH:
        mapproxy_caches.append("https://giscache.clatsopcounty.gov")
    for server in mapproxy_caches:
        test_mapproxy(server) # Tests map image delivery

    # TODO - I should test actual databases here, not just the server
    mapproxy_couchdb = 'http://cc-giscache:5984/_utils'
    r = testGET(mapproxy_couchdb, expected_type=HTML)

    photo_servers = ['https://giscache.co.clatsop.or.us']
    if VARNISH:
        photo_servers.append("https://giscache.clatsopcounty.gov")
    for server in [this_server]:
        test_photos(server) # Tests www content delivery

    # These are on a totally different server, but we need to test them anyway.
    test_property_photos('https://apps.clatsopcounty.gov')

    # Matomo
    testGETtype(f'https://echo.clatsopcounty.gov{port}/', 'text/html; charset=utf-8')#, expected_server=MATOMO)
    testGETtype(f'https://echo.clatsopcounty.gov{port}/matomo.js', 'application/javascript')#, expected_server=MATOMO)
    # I don't care about the old URL anymore.
#    if VARNISH:
#        # These only work when Varnish is running.
#        testGETtype(f'https://echo.co.clatsop.or.us{port}/', 'text/html; charset=utf-8', expected_server=MATOMO)
#        testGETtype(f'https://echo.co.clatsop.or.us{port}/matomo.js', 'application/javascript', expected_server=MATOMO)

    # I don't care if these are proxied right now. Work on it later.
    test_dev_apps()

    # Test the weird "property api" microservice for Septic Documents
    test_septic_documents('http://giscache.clatsopcounty.gov:5002')

    # === TESTS ARE DONE, NOW REPORT ===

    # ALWAYS report errors
    if errors>0:
        print(f"SKY IS FALLING! Error count: {errors}")
        print("");
        print(errorBuffer)
        print("\n")
        passfail = f"\nFAILED {errors}.\n"
    else:
        passfail = f"\nPASSED {tests} tests.\n"

    # ONLY report status if requested
    if verbose:
        print(passfail, msgBuffer, passfail)

    exit(errors)

This is the crontab entry, if you use the Windows scheduler, that's on you. I do the least amount of work for me. 🙂

# Test everything, once a day, and tell me about it.
50 6 * * * cd $HOME/docker/www && python3 unittest.py -v
#
# Test everything all day long (business hours, weekdays)
*/20 7-17 * * 1-5 cd $HOME/docker/www && python3 unittest.py

The script started out as a unit test when I was writing tests for a generic web server and the name stuck. Hence "unittest.py"

Another way to do this would be to write a script that sleeps for 10 minutes and just have it loop. Then you can just run it once instead of using a scheduler.

Brian_Wilson · ‎02-06-2024

For people who don't know, if you run a script in crontab anything it generates as output will be emailed to the owner of the crontab account. If you do it some other way you'd need to generate output to email or a pager or something.

I said "pager" 🙂 but you know. I had it go to Telegram for a while so that output would appear on my phone.

berniejconnors · ‎02-06-2024

Thanks @Brian_Wilson ! Hopefully some more people will add suggestions like yours.

Bernie.

Brian_Wilson · ‎02-07-2024

On Linux systems the normal practice is to tail log files, you read the entire file until the process blocks. Then whenever new entries are written the process unblocks and reads only the new input. This works on Windows too, but Windows people like to create new files with new names periodically so it's much harder. I can't understand Windows people. They make simple things so hard sometimes for no apparent reason. You end up watching the wrong file.

Now ask me what I think of Powershell! 🙂 🙂 🙂

I think it's mostly just NIH syndrome.

I use a bash shell and the "tail -f" command all the time to watch logs for example on Portal when restarting it so that I can see what's going on. I hate staring at a blank screen wondering if anything has failed.

Brian_Wilson · ‎02-07-2024

I should say "I use a bash shell on WINDOWS" to make it clear. Usually I install "git for windows" and it comes with a nice bash shell and most of the normal tools.