Question regarding "Incremental Update" workarounds, custom components?

daleward · ‎07-20-2018

This is regarding the Polling feature services for "Incremental Updates blog post. I'm asking this question here as well since I'm not sure if the blog post is still monitored(?)

If I'm understanding the workarounds correctly - I don't think they'll work for us. We're not handling streams of real-time data, but are trying to respond to changes in the data sources - "Somebody edited a field, which changed data the contents of this FeatureData Service". We are potentially looking at millions of records - even retrieving these from the feature data services so that we can apply the workarounds will take a large amount of time.

As a way around this, so that GeoEvents Service remembers the 'last polled' datetime between service restarts, would you recommend that we write custom component that serializes the datetime?

If so - which component? A custom transport component?

Thank you.

RJSunderman · ‎07-27-2018

Hey Dale -

Following-up on your questions and additional information you sent to Esri Technical Support, hoping the information below will be helpful to someone else in the future.

Please also review Using a partial GeoEvent Definition to update feature records which includes comments with illustrations showing how you might go about configuring your input to only poll feature records you have not marked as having already been processed.

>> Some of our data tables are large, > 2.5 million records...so we'd like to avoid the workarounds mentioned in the blog post that involve retrieving the whole data set, filtering it etc.

There’s one clarification I’d like to make, and an option you might consider. The workaround I suggest isn’t receiving the whole feature record set every poll and filtering it down to those records which match some specified criteria. The GeoEvent Server input is using the criteria you specify as the input’s Query Definition parameter to construct a WHERE clause the underlying RDBMS will use to determine which records to include in the record set returned to the client’s query (GeoEvent Server is acting as a client in this context).

So the input you configure is still using a WHERE clause like it’s doing now, polling for feature records where date/time is greater-than some key value. The difference is that the input will query for feature records whose hasBeenProcessed attribute is clear (e.g. ‘0’) and the GeoEvent Service into which you incorporate this input will be responsible for setting the hasBeenProcessed attribute to ‘1’ for any record it routes through an event processing and notification workflow.

Here's how you configure the solution, from within GeoEvent Server, using only out-of-the-box capability. First, you have to add a new attribute to your data’s schema, something like hasBeenProcessed, and re-publish the feature service. When the feature service is published make sure that new field is configured to apply a default value of zero when feature records are created (so exiting input web-forms or scripts don’t need to be modified). Of course you’ll have to explicitly set the new attribute value to zero if you’re importing existing feature records into the new RDBMS table. Then configure a GeoEvent Server input specifying its Query Definition parameter as hasBeenProcessed=0. All existing records will be polled on the first interval, assuming you’re setting all the unprocessed feature records attribute to zero, but as records have their hasBeenProcessed flag raised, they won’t be included in responses to future queries.

Now, whatever GeoEvent Service you’re using to process the polled feature records, have it implement some business logic needed to verify that a record is complete and ready for processing. That’s an advantage to this approach – if your feature record polling relied only on date/time change and you found that one or more fields were null/empty and the record wasn’t ready for processing, you would have to somehow force an update to the date/time field (probably by editing the feature record) in order for that feature record to be included in a later query / poll request sent by the GeoEvent Server input.

You’ll add a short event processing branch to your GeoEvent Service which includes a field mapper processor and a field calculator processor. The field mapper reduces the event schema to exactly two fields – the TRACK_ID field and the hasBeenProcessed field. The field calculator takes the field mapped record and sets the hasBeenProcessed attribute value to ‘1’. You then send this event record to an Update a Feature output. You’re using GeoEvent Server to update exactly one attribute of the same feature record set it is polling as input. The flagged feature record will now be excluded from the result feature set when the input next request a set of feature records to process.

The approach of setting a hasBeenProcessed attribute is really quite powerful. It enables you to use GeoEvent Server filtering and processing to QA/QC event records ingest from a set of feature records before proceeding with event record notification.

>> ...we'd really like that to happen in the database instead.

Here’s something you might consider; you could register a spatial view as a feature service and allow GeoEvent Server to always poll for all features. As I indicated above, I do believe that requests made by a GeoEvent Server input are appropriately leveraging the power of the RDBMS to select records for processing. The input is not querying the full record set and filtering to discard records – it is making a request for any feature records which satisfy a WHERE clause, receiving and then processing only those records. But the input *is* limited to what can be expressed in a simple WHERE clause …

The trick when using a spatial view is really subtle. The database is view is responsible for selecting database records which match some configured criteria. The view might execute a tabular JOIN or use SELECT statements to retrieve database records with relative date/time values – such as any record updated within the last five minutes. A GeoEvent Server input cannot do this using a simple WHERE clause. Using a spatial view, the RDBMS has the burden of preparing a (possibly highly dynamic) view of its data; GeoEvent is simply saying “give me whatever records you’ve got” when it polls the feature service endpoint it’s been configured to query.

>> It seems like the only thing that's missing is we'd need to save the "most recent successful poll" date after each poll, and restore it on service restart. Is that correct, or are we missing something?

Nope, there’s no confusion on your part here. It would be ideal if the GeoEvent Server input wrote the key value it was using either to a system file or into the Zookeeper distributed configuration store. We considered this capability rejected it because we were uncomfortable with how often the GeoEvent Server transport would have to update this key value. At a minimum the updates would introduce latency into the event record ingest workflow. There’s also a consideration as to where exactly the key value should be persisted and how a GeoEvent Server administrator would locate and clear that key value if, indeed, they wanted to clear it so the input would once again poll for all feature records. Administrators would have to understand that every Poll an ArcGIS Server for Features input they configured was potentially persisting a key value that was affecting which feature records were included in a response to a query … and that even something as heavy as restarting the GeoEvent Server would not clear some stubborn setting because persistence of the key value was, by design, rendering the input impervious to system restart.

>> Can we extend from the existing ESRI transports and adaptors, or do we need to work directly off of the bare ESRI base InboundTransport and InboundAdaptor base classes, implementing the rest of the functionality ourselves as well?

Unfortunately, no, the base out-of-the-box transports and adapters cannot be extended. Their source code is not provided, so you will have to re-implement your own FeatureService inbound transport and (probably) your own “Esri Feature” JSON inbound adapter in order to configure your own Poll an ArcGIS Server for Features connector. That is more work than I am generally comfortable suggesting a customer push onto their Java developers.

- - -
I can appreciate that some of your feature record sets are quite large – in excess of 2.5M records. Hopefully their schema is not locked and you have the freedom to add an attribute field which GeoEvent Server can use as a flag to say: “I’ve processed that one, don’t give it to me again.”

- RJ

View solution in original post

RJSunderman · ‎07-27-2018