TCP input buffer overflow

MarkBramer · ‎08-08-2013

Hi,

At some point last night, my most busy TCP input stopped working. I discovered this by noticing on Manager's Monitor page this morning that it hadn't received any data for 14 hours. This would have been around 9:40pm, and I knew the message stream couldn't possibly have been idle that long. So I first looked in Manager's Logs tab and saw pages and pages of the following error:

Logger: com.esri.ges.manager.stream.internal.InboundStreamImpl
Message: "Input 'tcp-text-in' experienced a buffer overflow on channel '1'. It cannot parse the data as fast as the data is coming in. Some data was lost."

I wanted to find out if the time of the first buffer overflow error correlated with the time the TCP input stopping receiving data (this wouldn't solve my issue, but at least would provide a cause/effect relationship). The earliest record in Manager's log display was well after the TCP input stopped working, so I then went on disk to look at the karaf logs. There were 11 log files (karaf.log, and then karaf.log.1 through karaf.log.10). Even then, the earliest record in the earliest karaf log (karaf.log.10) was 1.5 hours after the TCP input stopped working. So I did not have enough information at my hands to reach my goal of determining if the first instance of the buffer overflow was at the same time the TCP input stopped receiving data.

For what it's worth, the full error in the karaf logs is: "<date/time> | ERROR | Socket Listener | InboundStreamImp | 301 - com.esri.ges.manager.internal-streammanager-10.2.0 | Input 'tcp-text-in'... buffer overflow ...."

Manager's Monitor page showed the max rate for this TCP input as 148/sec, but I have seen it on other occasions at 200/sec. Even so, these are relatively low rates compared to benchmarks I've seen on GEP's throughput capabilities. Monitor had also shown that 5,206,192 messages had been received.

All messages coming in during this period were of the same message definition (I had thought of possibly setting up multiple TCP inputs per message definition, but this wouldn't apply here, I don't believe...).

Finally, restarting the TCP input in Manager did not change the state of anything - I had to bounce GEP in the Windows Services console to get everything healthy again.

Questions:

1. Do karaf.log archived files stop at karaf.log.10? If yes, can this be increased?
2. With these relatively low rates, is the log message *really* what's going on here (the input's not able to keep up)?
3. If this *is* really what's going on, why are other benchmarks so much higher than what I'm seeing?
4. If this is really what's going on, are there any creative ways to "scale" GEP on one machine so that it can handle hundreds of features per second of the same message definition?
5. What things, if any, can I do on my end to try and troubleshoot and/or configure?

Thanks,
Mark

RyanElliott · ‎08-08-2013

1. It stops at 10 by default. There appears to be a configuration setting that controls this called "log4j.appender.out.maxBackupIndex" in the file "org.ops4j.pax.logging.cfg" in the "etc" folder of your GeoEvent Processor. There are a few other settings in that file that control file size and other details. You are actually asking about the File appender which is just one output from the GEP log system (the Log panel in manager is another one). For more details about configuring appenders, see The Log4J manual.

2. Yes I suspect that's what is really going on, but the Adapter isn't keeping up because it sounds like it got completely stuck (since you said it completely stopped sending out GeoEvents). From the GEP perspective, it know s that raw bytes were passed into the Adapter, but no GeoEvents are coming out).

3. The benchmarks are measuring how fast GEP can pass a simple message all the way through from beginning to end. This assumes minimal parsing of data in the Input, and minimal formatting of data in the Output.

4. The Text Adapter (which I assume you are using) is single threaded in release 10.2. So if the parsing of the incoming text data is the bottleneck, you would have to create two inputs to get two separate parsing threads. But you would have to have control over the incoming data, and have it send half of the text messages to one Input, and half of the text messages to the other Input. But again, you said data completely stopped flowing, so I suspect something in the Text Adapter got completely stuck.

5. You mentioned that you restarted GEP when it didn't work to stop/start the input. Another intermediate step might be to delete the offending input, and recreate it. When you restart an input, there are data structures in memory that are not deleted, and may be in a bad state. Deleting the input and recreating it will cause these structures to be recreated and might fix the problem.