MBramer-esristaff

TCP input buffer overflow

Discussion created by MBramer-esristaff Employee on Aug 8, 2013
Latest reply on Aug 8, 2013 by relliott-esristaff
Hi,

At some point last night, my most busy TCP input stopped working.  I discovered this by noticing on Manager's Monitor page this morning that it hadn't received any data for 14 hours.  This would have been around 9:40pm, and I knew the message stream couldn't possibly have been idle that long.  So I first looked in Manager's Logs tab and saw pages and pages of the following error:

Logger: com.esri.ges.manager.stream.internal.InboundStreamImpl
Message: "Input 'tcp-text-in' experienced a buffer overflow on channel '1'.  It cannot parse the data as fast as the data is coming in.  Some data was lost."

I wanted to find out if the time of the first buffer overflow error correlated with the time the TCP input stopping receiving data (this wouldn't solve my issue, but at least would provide a cause/effect relationship).  The earliest record in Manager's log display was well after the TCP input stopped working, so I then went on disk to look at the karaf logs.  There were 11 log files (karaf.log, and then karaf.log.1 through karaf.log.10).  Even then, the earliest record in the earliest karaf log (karaf.log.10) was 1.5 hours after the TCP input stopped working.  So I did not have enough information at my hands to reach my goal of determining if the first instance of the buffer overflow was at the same time the TCP input stopped receiving data.

For what it's worth, the full error in the karaf logs is: "<date/time> | ERROR | Socket Listener | InboundStreamImp | 301 - com.esri.ges.manager.internal-streammanager-10.2.0 | Input 'tcp-text-in'... buffer overflow ...."

Manager's Monitor page showed the max rate for this TCP input as 148/sec, but I have seen it on other occasions at 200/sec.  Even so, these are relatively low rates compared to benchmarks I've seen on GEP's throughput capabilities.  Monitor had also shown that 5,206,192 messages had been received.

All messages coming in during this period were of the same message definition (I had thought of possibly setting up multiple TCP inputs per message definition, but this wouldn't apply here, I don't believe...). 

Finally, restarting the TCP input in Manager did not change the state of anything - I had to bounce GEP in the Windows Services console to get everything healthy again.

Questions:

1. Do karaf.log archived files stop at karaf.log.10?  If yes, can this be increased?
2. With these relatively low rates, is the log message *really* what's going on here (the input's not able to keep up)?
3. If this *is* really what's going on, why are other benchmarks so much higher than what I'm seeing?
4. If this is really what's going on, are there any creative ways to "scale" GEP on one machine so that it can handle hundreds of features per second of the same message definition?
5. What things, if any, can I do on my end to try and troubleshoot and/or configure?

Thanks,
Mark

Outcomes