Data lost when using add a feature

MarkBramer · ‎08-16-2013

GeoEventers,

A colleague and I are seeing behavior where messages seem to be getting "dropped" when using the "Add a feature" output. Whether or not this is what's really going on is unknown, as we don't know the specific location of the "droppage" (GEP not sending all data to AGS? AGS not committing all data sent to it? SQL Server?, etc). We do not see any errors in either GEP's logs or ArcGIS Server's logs. Here are details:

We're using a TCP/text input to receive csv data. The geometry in the messages is POLYGON data in json format. We have one geoevent service with this input that flows through a filter to only allow one message type through, then outputs to a feature service. We have another service that's identical, but has a processor after the filter, before the output. The feature service is on an ArcGIS Server instance on the same machine, and is pointing to a feature layer hosted in SQL Server 2008 R2 (64-bit) on a different machine. The layer is POLYGON, not point - very important to note this (as I suspect 99.99% of GEP instances are working with points).

We are using the simulator packaged with GEP to send the same record to GEP over and over, just to test throughput. The message definition is pretty ordinary - about 20 fields or so, all string except for a date and geometry field. One of the string fields is tagged with TrackID, the date field is tagged with TIME_END and TIME_START, and the geometry field as GEOMETRY.

For each test, we truncated the SQL Server table, reset statistics in GEP Manager's Monitor page, and started the simulator(s). In the table below, I list how many simulators were used, and at what rate messages were sent ("Simulator Settings"), the data flow ("GEP Settings"), how many features Monitor displayed as processed ("GEP Monitor"), number of records written to the database after the simulators were stopped ("DB Records") and the number of messages not successfully written to the database table ("LOSS"... the difference between GEP Monitor and DB Records). Incidentally, Monitor reported the same number of messages processed for the TCP Input, the GE Service, and the feature service output. So the number in the column for "GEP Monitor" is the same for input count, service in/out, and output count as displayed in Monitor.

Here are the specs and results on each test:

Simulator Settings	GEP Settings	GEP Monitor	DB records	LOSS
2 simulators, 5 msgs/sec	tcp->filter->processor->fs output 1 sec insert	2070 features	2065	5
2 simulators, 5 msgs/sec	tcp->filter->processor->fs output 1 sec insert	2690 features	2678	12
2 simulators, 5 msgs/sec	tcp->filter->processor->fs output 1 sec insert	3395	3386	9
2 simulators, 5 msgs/sec	tcp->filter->processor->fs output 3 sec insert	2125	2123	2
2 simulators, 5 msgs/sec	tcp->filter->processor->fs output 3 sec insert	2655	2651	4
2 simulators, 5 msgs/sec	tcp->filter->processor->fs output 3 sec insert	3475	3469	6
2 simulators, 5 msgs/sec	tcp->filter->processor->fs output 6 sec insert	2045	2040	5
2 simulators, 25 msgs/sec	tcp->filter->fs output 6 sec insert	2225	2220	5
2 simulators, 5 msgs/sec	tcp->filter->fs output 6 sec insert	2690	2688	2
2 simulators, 25 msgs/sec	tcp->filter->fs output 12 sec insert	2250	2235	15
2 simulators, 25 msgs/sec	tcp->filter->fs output 12 sec insert	2425	2419	6
2 simulators, 25 msgs/sec	tcp->filter->fs output 3 sec insert	2300	2293	7
1 sumulator, 25 msgs/sec	tcp->filter->fs output 1 sec insert	2075	2065	10

Based on these numbers, there's no real pattern easily noticeable. One might deduce that a 3 to 6 second feature service output update may be the best but a) we really can't definitively say this but more importantly, b) we're still losing records in these scenarios. As far as just GEP is concerned, based on what Monitor is reporting, GEP has received, processed, and output everything just fine.

So...

Is there a definitive way to confirm GEP is successfully receiving/processing/outputting everything, other than what Monitor is reporting?
Or is Monitor a very trustworthy tool for determining that GEP has received/processed/output everything sent to it?
What ways exist to track down where the message "dropping" is occurring (especially given there are no errors in GEP or AGS logs)?
Is it possible that GEP is successfully passing on everything, but ArcGIS Server is not successfully writing all records?
Could this be a SQL Server thing?
What other places in the entire cosmos of this data flow (not just within GEP) are potential suspects?
Is there any flaw in our methodology that could explain the discrepancy between what GEP reports and what's written to the DB?

RJSunderman · ‎08-19-2013

Mark --

I can comment on a couple of the questions you asked in your post - but perhaps we should schedule a follow-up meeting to look more in depth at what you're seeing.

The Monitor page in GeoEvent Processor Manager is your first indication of what GEP has successfully ingested through its input component, processed in/out of the GeoEvent Service, and output through its output component. If there are no errors in the logs, and all of the counts match on the Monitor page, you are usually assured that all of the received event data is being processed and output.

Some exceptions:

The output connector for SMS ("Send a Text Message") successfully sends its message to the SMS carrier which rejects it. You won't see any indication in GEP that the event data was not received by the intended recipient.

Similarly when updating a published feature service, if ArcGIS for Server is unable to handle the request to update the feature service you might find error messages in the ArcGIS for Server log - but you won't see an indication in the GEP Monitor that features are not being written. You might see error messages in the GEP log. (I recently had to work around an RSS provider which had HTML in one of their tags which ArcGIS for Server wouldn't accept; GEP was posting the JSON without error, but the features were not being added to the target feature service.)

So ... I'd say that GEP's Monitor is trustworthy and reliable - but not definitive. You would need to monitor both the GEP logs and the AGS logs if you suspect that event data is being dropped. Since you are not seeing anything in the GEP logs, you might try the following recommendations (passed along from the Server team):

From the Server Manager, take a look at each feature service"?s configuration. On the "?Pooling"? tab there are settings for "?The maximum time a client can use a service"? and "?The maximum time a client will wait to get a service"? "¦ reduce these from their defaults down to 5 seconds.

Reducing these configurable settings on your target feature service(s) should reduce the server"?s queue sizes and help it more efficiently handle the continuous stream a requests coming from GeoEvent Processor "? which can stress Server by sending more requests than it can process. These settings should NOT be used if you plan on using the REST interface to clean out (delete) features. For example, if you accumulate several tens of thousands of rows it will take longer than 5 seconds to delete the features through the REST interface and your severe time limits on how long a service can be used will force a timeout before the operation is complete.

/RJ

RyanElliott · ‎08-19-2013

Is there a definitive way to confirm GEP is successfully receiving/processing/outputting everything, other than what Monitor is reporting?

The monitor counts and the log files are the only way for you to get information from the GEP system at 10.2. The only exception to this is if you develop your own Adapter/Transport/Processor, you can connect a debugger and step through your own code. Note that setting the log level to DEBUG for specific components within the GEP is possible.

Or is Monitor a very trustworthy tool for determining that GEP has received/processed/output everything sent to it?

I would say that the Monitor page is very useful for determining if a GeoEvent got dropped by the Input, Service, or Output. In your case, it seems pretty clear that the messages are being dropped in the Output or in the transaction between the Output and the Feature service. But the Monitor page doesn't give you details on what takes place inside the Output. The problem with getting much more from the Monitor here is that the Monitor is generic to all Outputs and you need details on the inner working of a specific Adapter & Transport.

What ways exist to track down where the message "dropping" is occurring (especially given there are no errors in GEP or AGS logs)?

There's 4 possible culprits that I can identify from a GEP perspective: Adapter, Transport, Output, Feature Service. I'm lumping everything downstream from the Feature service (like the SQL Server) as part of the Feature Service. As mentioned, there's no definitive way for you to determine if messages get passed successfully from Adapter to Transport, but you have a way to intercept the data that goes from the Transport to the Feature Service. All of this data is transferred through http, and can be intercepted by wireshark. However, this would take a lot of patience as you would have to figure out what message got dropped, and it sounds like you are only losing about 0.1% of your data. That's a lot of data to sift through looking for the lost geoevent. You can also set the log level of your Transport to DEBUG and see the content of all features sent to the feature service, and the response from the server for each of those insert statements. But that's still a lot of data to sift through.

Is it possible that GEP is successfully passing on everything, but ArcGIS Server is not successfully writing all records?

That is possible. I would look at the memory and cpu usage of the GEP and the ArcGIS Server processes on your machine to determine if one of them looks to be behaving abnormally. The ArcGIS Server logs may contain something useful as well.

Could this be a SQL Server thing?

This seems possible as well, but I wouldn't know how to systematically diagnose an error here.

What other places in the entire cosmos of this data flow (not just within GEP) are potential suspects?

That's difficult to say from the perspective of GEP. There may be settings use when publishing the feature service that affect its performance as well as a plethora of database settings, network settings, etc.

Is there any flaw in our methodology that could explain the discrepancy between what GEP reports and what's written to the DB?

The only possible logical error I can think of would be if you were using the "Update Feature Service" connector. This connector will only update each track ID once per buffer interval. For example, if you set it to buffer the geoevents for 3 seconds before sending an update to the feature service, then send in two GeoEvents with the same track ID, only one of those GeoEvents will get written to the Feature Service.

MarkBramer · ‎08-20-2013

Ok, so we changed the settings for those two ArcGIS Server properties ("The maximum time a client can use a service" and "The maximum time a client will wait to get a service") to 5 seconds, re-ran some of the same tests and got the following:

Simulator Settings	GEP Settings	GEP Monitor	DB Records	LOSS
1 simulator, 25 msgs/sec	tcp->filter->fs output 1 sec insert	2075	2068	7
2 simulators, 25 msgs/sec	tcp->filter->fs output 3 sec insert	2300	2277	23
2 simulators, 25 msgs/sec	tcp->filter->fs output 12 sec insert	2400	2386	14
2 simulators, 25 msgs/sec	tcp->filter->fs output 6 sec insert	4700	4664	36

So it appears to have actually worsened the loss incurred. We also bumped up the max number of instances for the service from 2 to 8, but this doesn't appear to have helped.

No new info/errors in either the GEP or AGS logs.

I guess I will keep trying to hunt down the culprit. Any more input most graciously accepted.

Mark

JordanGottlieb · ‎05-06-2014

I'm having the same issue with data generated by the simulator being 'lost' in transport, TCP in -> Feature Service out.

RJSunderman · ‎05-06-2014

Hello Jordan -

How many events per second is the GeoEvent Simulator you are using sending to GeoEvent Processor? Are you using the Update a Feature or the Add a feature outbound connector to update the feature service?

Updating features can be expensive if the target feature service contains many thousands of features. This is because GeoEvent Processor must identify the feature to be updated by performing a full table scan querying for the feature's unique identifier to obtain the ObjectID of the table row before it can POST the update.

There are a couple of things you might try to help narrow the scope of the problem:

Reduce the number of events being sent to see if the problem is with data rate.

Switch to use the Add a feature outbound connector to see the problem is with adding records vs. updating records.

Specify a smaller number (50, 10, or even 1) for the outbound connector's Maximum Features Per Transaction parameter to see if updating the feature service with smaller batches helps.

Check to make sure that there are no errors with the event data, such as a string which exceeds the maximum length allowed by the feature service or an integer value being sent which exceeds the range of the feature service's target field.

You might also use the GeoEvent Processor's log to see how many milliseconds (or seconds) elaspse between GeoEvent Processor's POST to retrieve information and its POST to update the features. If there are any errors in these transactions, you might also find them in the log.

First, from GEP Manager, navigate to the Logs page and click �??Settings�??. Select 'Debug' and enter the following logger item to debug: com.esri.ges.transport.featureService.FeatureServiceOutboundTransport. Don�??t forget to set this logger item back to ERROR, especially if you have thousands of events being sent to GeoEvent Processor. The logging statements generated at this level have the potential to produce log files which are quite large.

If you edit the C:\Program Files\ArcGIS\Server\GeoEventProcessor\etc\org.ops4j.pax.logging.cfg configuration file you will see this logging setting persisted. If you have trouble using the GeoEvent Processor's Logging utility, you can always edit the configuration file directly (as an admin, since its under Program Files) to reset the logging level. Also, you might find it easier to view the log in a text editor. Look for the log file in the C:\Program Files\ArcGIS\Server\GeoEventProcessor\data\log folder.

When reviewing the logs, you are looking for lines with the following logger entry: com.esri.ges.framework.transport.featureservice-transport. Key phrases you will be looking for include:

querying for missing track id
Posting to url
Response was

At the debug level you should get a copy of exactly what is being sent/posted to the server. The query to get the object identifiers, for example, will look something like: Posting to url : http:// �?� FeatureServer/0/query with parameters ... FlightNumber IN ('SWA1568','SWA510' �?�

You should be able to see the JSON being sent to Server by each GEP transaction. You may need to copy/paste it into something like http://jsonlint.com/ to format it so you can read it. You can look at the timestamps on each log entry to get a feel for how long the different transactions with Server are taking.

The illustration below is an example of what you want to dee in the logfile. Notice the latency in the last two DEBUG statements is about 325 milliseconds. If GeoEvent Processor is being fed a large number of events every second, it may be that transactions are taking to long and Server simply cannot keep up with the data being fed to it from GeoEvent Processor.

[ATTACH=CONFIG]33629[/ATTACH]

Hope this information is helpful -
RJ

JordanGottlieb · ‎05-07-2014

RJ, thank you for the prompt and thorough response, much appreciated!

Bumping Mark Bramer's post was the result of finally getting stuck on a somewhat involved GEP performance and tuning experiment. I've been hammering on the GEP in order to discover and improve the maximum workloads achievable via configuration using our sandbox hardware. After trying dozens of configuration and simulation rate combinations, I have yet to discover one that doesn't vanish expected output into the ether after 50 or more simulated events per second.

Some of the configuration setting variables I've been experimenting with are a feature service's number of instances, service timeouts and process isolation settings, and varying the number of feature services being used to simulate new events as well as the number of Geo Fences my GEP Services have to consider. Before each sample-gathering test run, the relevant feature service DB tables have been truncated, the GEP Manager Services Monitor stats have been reset and the relevant GEP Services, inputs and outputs have been restarted.

How many events per second is the GeoEvent Simulator you are using sending to GeoEvent Processor?

I've executed each test configuration at 1, 5, 10, 25, 50, 100 and 200 simulated events per second for a duration of 5 minutes at each rate while measuring system resources, throughput and various logs. At each rate, all configuration tests were repeated with events being published to a single feature service with no filtering or processing:

TCP -> FS1

...and with the events evenly spread out over multiple feature services:

TCP -> FS1, FS2, FS3, FS4, FS5

In the latter case one simulated event to the TCP input would be output to five feature services, so I'd simulate 20 events per second to get 100 total events per second.

As mentioned, expected output is lost very consistently at or above a rate of 50/sec, and somewhat consistently at or above ~25/sec.

Are you using the Update a Feature or the Add a feature outbound connector to update the feature service?

Add a feature only.

Updating features can be expensive if the target feature service contains many thousands of features.

Either way, the data sources for the feature services have been truncated prior to each test run, though some of my tests have generated tens of thousands of rows... actually one generated 600K... had to restart the ArcGIS server after that one. =]

There are a couple of things you might try to help narrow the scope of the problem:

Reduce the number of events being sent to see if the problem is with data rate.

With the best configuration I've found, 10/10 5-minute runs at 20 events/sec to a single feature service ran without dropping expected output.

Switch to use the Add a feature outbound connector to see the problem is with adding records vs. updating records.

That's all we've been using.

Specify a smaller number (50, 10, or even 1) for the outbound connector's Maximum Features Per Transaction parameter to see if updating the feature service with smaller batches helps.

I'll try that, the default of 500 may be overwhelming the FS... Good one!

Check to make sure that there are no errors with the event data, such as a string which exceeds the maximum length allowed by the feature service or an integer value being sent which exceeds the range of the feature service's target field.

The event data consists of 100 duplicates of a few valid, tested events, differing only by time stamp.

You might also use the GeoEvent Processor's log to see how many milliseconds (or seconds) elaspse between GeoEvent Processor's POST to retrieve information and its POST to update the features. If there are any errors in these transactions, you might also find them in the log.

I've been monitoring the Server and GEP logs but not with your deeper logger setting... I'll try a few more test runs with your logger setting.

Hope this information is helpful -RJ

Sure is, thanks! ~Jordan