Hello Walter --
Up front, I'm going to ask that you open an incident with Esri Tech Support to get a specialist assigned who can help reproduce the issue on our end. This is for traceability in case the behavior you are seeing is a bug which needs to be addressed by the development team (as opposed to something we can configure in your deployment or environment that solves the issue for you).
A couple of things I know about stream services:
- There is only one web socket running for each stream service output. This web socket is run as part of the GeoEvent Server's JVM. When multiple clients subscribe concurrently to the web socket they all suffer a deterioration in performance.
- I do not know what the upper limit is on how many clients can concurrently subscribe to a single stream service - only that there is little you can do to scale out the stream service for many hundreds or thousands of subscribers. You cannot, for example, configure more than one web socket resource to run and support a stream service you expect to experience a high volume of subscribers.
There are four component loggers for stream services which you might try setting to log DEBUG messages to see if you can capture an error or warning around the time the stream service stops broadcasting data.
com.esri.ges.datastore.agsconnection.NewStreamServiceBuilder
com.esri.ges.framework.streamservices.client.StreamServiceClientImpl
com.esri.ges.framework.streamservices.client.AGSConnectionStatusListenerStreamServiceClient
com.esri.ges.transport.streamService.StreamServiceOutboundTransport
I do not know off hand how verbose these component loggers are relative to data ingest volume / velocity. If you request DEBUG logging, monitor your karaf.log found beneath ...\ArcGIS\Server\GeoEvent\data\log to gauge how fast the log file is growing and log file rollover. If the logging is too verbose you could easily miss the data you are trying to capture as the current karaf.log fills-up, rolls over, and older rolled log files are deleted.
As one stream service appears to be working reliability and the other seems to be failing every day or so, is there anything obviously different between the two? Are both stream services broadcasting roughly the same number of Point (my assumption ?) feature records per second? Do both stream services have roughly the same number of concurrent subscribers? You indicate that the problem does not manifest when the same stream service is run on a second machine, but then, there isn't anyone subscribing to that instance of the stream service.
I know that Esri Tech Support has a similar case / incident they are helping the development team reproduce -- not specifically stream service related, but related to event processing stopping after a variable length of time (reportedly 2 - 5 days). To briefly describe some underlying GeoEvent Server architecture for you, there are two Kafka topic queues working with each GeoEvent Service. The first routes event records from an input (the producer) to a GeoEvent Service for processing. The second routes processed event records from the GeoEvent Service to an output (a consumer). If something were to go wrong with the Kafka event broker I might expect an input would continue to ingest and adapt data to create event records, but those event records would not be processed by a GeoEvent Service -- you would see the event count of the input increase as data was received, but the 'In' / 'Out' event count of the GeoEvent Service in which that input was incorporated would not increment. Or, if the second Kafka topic had failed, you might see event data being processed and the 'Out' event count for a GeoEvent Service incrementing, but the stream service output's event count would not increment. Knowing which of these you are observing (if either applies) will help Esri Technical Support better address an incident you submit with them.
To help Esri Technical Support we will need you to capture a sample of the radio device or iPad device data being sent to GeoEvent Server. It doesn't look like there is anything "stateful" in your GeoEvent Service (e.g. I don't see an Incident Detector or any particular filtering logic) which might prevent us from running the same few hundred event records in a loop through GeoEvent Server for 48 hours. If you have reason to believe that we need more temporally consistent / contiguous data, we'll need your help to get a snapshot of that so that we can run the system for 48 hours to see if we can recreate the failure in-house. Please also create an XML snapshot of your configuration using the 'selective' export option to export just the GeoEvent Service whose stream service output is having a problem. This will simplify the XML snapshot by not including inputs, outputs, GeoEvent Definitions, or other configurable elements not shown in the relatively simple GeoEvent Service you illustrated above.
Hope this information helps --
RJ