aaron.lee

Application Performance Monitoring for Complex ETLs

Blog Post created by aaron.lee on Jun 2, 2017

New Relic's APM (Application Performance Monitoring) Python agent (Learn More) is a powerful tool you can utilize to monitor and instrument the performance of your complex GIS ETL processes and custom Python based services. Deploying APM to your existing Python processes is easy.  Even easier if you already utilize functions, as the agent instruments the execution of functions via a Python decorator.

 

It's important to understand that short processes, those lasting only a minute or less, are not ideal candidates for instrumentation.  Think of your large, mission critical data sync jobs that cycle services, copy data, add columns, perform spatial analysis, calculate fields, etc.  Those that tend to run for several minutes and execute multiple steps.  We can gain insights beyond what we write to log files and see how resource intense each step is, the response times of the service calls our ETLs make, which services are slowing down our ETLs, and be able to monitor performance before and after changes to code, data or services.

 

Check out some sample executions and then I'll explain how to get the Python agent installed and instrumenting.  

 

 

 

 

DISCLAIMER:  Things like corporate firewalls, security settings, and proxy servers can make configuring any tools difficult.  The steps outlined below assume those items have been negotiated already.

 

Assumptions:  You are running your ETLs on a Microsoft Windows machine.  The New Relic agent works natively with Linux also, but the steps below are for Windows.

 

1.  New Relic Account

To get started instrumenting your own processes, you'll first need a New Relic account.  You can sign up for one at https://newrelic.com/.  

 

2.  Install "pip"
Next, you'll need the Python package manager "pip" installed on the machine that you wish to run the instrumented ETL from.  You can learn more about installing pip here:  https://pip.pypa.io/en/stable/installing/.

 

Tip:  If you have multiple Python installations or do not have Python's directory included in you path, you may have to explicitly call Python from the location ArcGIS' installer placed it, e.g. "c:\Python27\ArcGIS10.4\python.exe c:\path\get-pip.py"

 

3.  Install the New Relic Python Agent

With pip installed, we need to install the New Relic agent.  As with installing pip, from a command line execute some manner of the following;

 

python -m pip install newrelic

or

c:\Python27\ArcGIS10.4\python.exe -m pip install newrelic

 

Tip:  For Windows pip users, we have to pass the "-m" parameter to call the pip module.  Linux based Python does not need the module parameter.

 

4.  Create a New Relic agent configuration file

Your agent will need a configuration file created for each individual ETL process.  The configuration contains settings such as your license key, application name, logging settings, and proxy settings.

 

Here is a basic ini file to get you started:

[newrelic]
license_key = <your key goes here>
app_name = <your ETL name goes here>
monitor_mode = true
log_file = C:\PATH\newrelic.log
log_level = error
ssl = true
high_security = false
#proxy_scheme = http
#proxy_host =
#proxy_port =
#proxy_user =
transaction_tracer.enabled = true
transaction_tracer.transaction_threshold = apdex_f
transaction_tracer.record_sql = obfuscated
transaction_tracer.stack_trace_threshold = 0.5
transaction_tracer.explain_enabled = true
transaction_tracer.explain_threshold = 0.5
transaction_tracer.function_trace =
error_collector.enabled = true
error_collector.ignore_errors =
browser_monitoring.auto_instrument = true
thread_profiler.enabled = true
[newrelic:development]
monitor_mode = false
[newrelic:test]
monitor_mode = false
[newrelic:staging]
monitor_mode = true
[newrelic:production]
monitor_mode = true

 

Place the ini somewhere central to your ETL scripts as we'll need to reference it from the code.

 

The important parts are the license key which you'll get from New Relic and the app name which you choose.  I included the proxy settings commented out just in case you have that obstacle too.

 

For more information on the agent configuration file, see the New Relic documentation here.

 

5.  Add the agent to your ETL script

Adding the agent is as simple as importing and initializing the module.

 

Import:

import newrelic.agent

 

Initialize:

newrelic.agent.initialize('C:\\PATH\\newrelic.ini')

 

6.  Add the Python decorator to your functions

With the agent initialized, we need to identify the functions we want to measure the performance of.  We do so by wrapping the function in a Python decorator.

 

The decorator code looks like:

@newrelic.agent.background_task()

 

Adding it to a function is as easy as placing it on the line directly before your function definition.  See the following example.

 

Without:

def setworkspace():
    # Set the workspace
    env.workspace = "C:/Temp/GISData.gdb"
    # Set a variable for the workspace
    workspace = env.workspace
    outputGDB = "C:/Temp/OutputData.gdb"
    print "Workspace set"

 

With:

@newrelic.agent.background_task()
def setworkspace():
    # Set the workspace
    env.workspace = "C:/Temp/GISData.gdb"
    # Set a variable for the workspace
    workspace = env.workspace
    outputGDB = "C:/Temp/OutputData.gdb"
    print "Workspace set"

 

Here is a complete sample with basic functions to demonstrate how it works.  (I know, it's quick and ugly.)

 

# Instrumentation test
# New Relic Python ETL Test

import newrelic.agent, time, arcpy, arcpy.da
from arcpy import env
import shutil, os

@newrelic.agent.background_task()
def copyfiles(src, dst):
    shutil.copytree(src, dst)

@newrelic.agent.background_task()
def setworkspace():
    # Set the workspace
    env.workspace = "C:/Temp/GISData2.gdb"
    # Set a variable for the workspace
    workspace = env.workspace
    outputGDB = "C:/Temp/GISOutput.gdb"

@newrelic.agent.background_task()
def setenvironment():
    #Turn on overwriting of logs
    arcpy.env.overwriteOutput = True

@newrelic.agent.background_task()
def copydata():
    # Function grabbed here https://community.esri.com/thread/168802
    for gdb, datasets, features in arcpy.da.Walk(env.workspace): 
         for dataset in datasets: 
             for feature in arcpy.ListFeatureClasses("*","POLYLINE",dataset): 
                 arcpy.CopyFeatures_management(feature,os.path.join(outputGDB, dataset))
                   
if __name__ == "__main__":
    newrelic.agent.initialize('C:\\PATH\\newrelic.ini')
    copyfiles('C:/Temp/GISData1.gdb','C:/Temp/GISData2.gdb')
    setworkspace()
    setenvironment()
    copydata()

 

Check out my other how-to articles to this series showing you how to:

 

Instrument ArcGIS Server

Application Performance Monitoring for Complex ETLs

Instrument Portal for ArcGIS Server

Instrument ArcGIS Data Store

Outcomes