GenerateGeodatabaseJob no network always hits addJobDoneListener at 41% download

602
6
Jump to solution
06-01-2020 10:19 AM
AaronDick
Occasional Contributor

I am wondering why  GenerateGeodatabaseJob (when network is lost in the middle of a GenerateGeodatabaseJob) always continues to count up (addProgressChangedListener continues to provide values erroneously) until it reaches 41%?  I assume that the ArcGIS Runtime SDK does this faux count up in percentage for those with a bad internet connection that goes in and out.  However this is super annoying as it appears to take around 10 minutes to count up to 41% to finally provide a failed status on addJobDoneListener with an error of "Illegal State" (generateGeodatabaseJob.getError().getMessage()).  

Is this configurable at all for GenerateGeodatabaseJob to trigger addJobDoneListener at a smaller increase in faux percentage?

As far as I can tell it seems to disregard RequestConfiguration which defaults to..

requestConfiguration.setConnectionTimeout(10000);
requestConfiguration.setSocketTimeout(30000);

Any assistance greatly appreciated.  Can provide code sample if what I am describing does not make sense.

0 Kudos
1 Solution

Accepted Solutions
Nicholas-Furness
Esri Regular Contributor

Hey Aaron.

So there are a couple of things going on here:

  • What the server is doing for the Job.
  • How the Runtime checks in on that.

When the generate geodatabase job is started in the Runtime, it kicks off a process on the server. Runtime then polls it periodically. We're resilient to (i.e. ignore) network failures because networks could come and go while the service continues to prepare the geodatabase for download and it would be bad to fail the whole job just because we couldn't get through once or twice. If we haven't had a successful connection to the server in 10 minutes though, that's when we'll fail the job. You can see this documented here:

Additionally, jobs using services are designed to handle situations where network connectivity is temporarily lost without needing to be immediately paused. A started job will ignore errors such as network errors for a period of up to 10 minutes. If errors continue for longer, the job will fail and the message will indicate the loss of network connection.

If the Job can get through to the service to check on the job, that 10 minute timer is reset, so it's 10 minutes from the last successful connection.

If you know these calls will fail for 10 minutes, Runtime provides a way to work with Jobs to do provide a good experience. Assuming you have some notification of network reachability available to you, a Job has a pause() method. This doesn't actually pause the server-side work, but pauses the client-side polling of the server. You could pause the job when you notice network reachability is gone, and call start() on the job to resume it when it is back. I'm checking with the team to find out whether the 10 minute countdown since the last successful connection includes paused time.

A job can also be serialized/deserialized. It's often helpful to make use of that. If you have a serialized copy of the uncompleted/not-failed job, you can later deserialize it and call start() and it will check in with the server and see where to pick up from. Often that will mean polling the server, finding out that while you were gone the job completed, and jumping straight to the download phase. You could for example, serialize the job and start it again from that serialized state after the 10 minute failure. This is also covered at the above doc link:

To deal with partially-connected workflows, you can serialize and pause a job when your app detects that network connectivity has been lost for a few minutes to avoid job failure purely due to this lack of network connectivity (as failed jobs cannot be restarted). The job can then be deserialized and resumed when connectivity is reinstated.

The iOS Toolkit has a JobManager component that handles all this for you. We don't have a timeframe for the Android Toolkit to get this, but perhaps the iOS one can help you understand some patterns.

So, in short:

  • Yes, you should ideally track network reachability.
  • You can pause and restart the job based off that.
  • When you pause, grab a serialized copy of the job. This will be helpful to reconnect to the server's job processing if you hit the 10 minute timer limit on your job object.
  • If a job failed because it could not connect to the server to get status for 10 minutes, you can create a new job from the serialized version and call start() to reconnect to that job on the server.

Does this help?

Nick.

View solution in original post

6 Replies
AaronDick
Occasional Contributor

I guess I should add a little clarification here.  I realize I can easily put in a NetworkChangeReceiver as a BroadcastReceiver to immediately determine if network is lost and cancel everything through this event.  The problem is that this does not pick up on the situation where the user is still connected to a 4G hot spot, but the 4G hot spot itself is not receiving data.  For my line of work unfortunately this is a common problem.  Even this can be alleviated by doing a ping off of google with an httpurlconnection and setting a read time out to test ever 30 seconds or so.  Having a timed handler run every 30 seconds is a lot of extra overhead to deal with, so although it does work, it is not ideal.

I would prefer to work through the existing GenerateGeodatabaseJob timeout that is built in preferably.  

0 Kudos
Nicholas-Furness
Esri Regular Contributor

Hey Aaron.

So there are a couple of things going on here:

  • What the server is doing for the Job.
  • How the Runtime checks in on that.

When the generate geodatabase job is started in the Runtime, it kicks off a process on the server. Runtime then polls it periodically. We're resilient to (i.e. ignore) network failures because networks could come and go while the service continues to prepare the geodatabase for download and it would be bad to fail the whole job just because we couldn't get through once or twice. If we haven't had a successful connection to the server in 10 minutes though, that's when we'll fail the job. You can see this documented here:

Additionally, jobs using services are designed to handle situations where network connectivity is temporarily lost without needing to be immediately paused. A started job will ignore errors such as network errors for a period of up to 10 minutes. If errors continue for longer, the job will fail and the message will indicate the loss of network connection.

If the Job can get through to the service to check on the job, that 10 minute timer is reset, so it's 10 minutes from the last successful connection.

If you know these calls will fail for 10 minutes, Runtime provides a way to work with Jobs to do provide a good experience. Assuming you have some notification of network reachability available to you, a Job has a pause() method. This doesn't actually pause the server-side work, but pauses the client-side polling of the server. You could pause the job when you notice network reachability is gone, and call start() on the job to resume it when it is back. I'm checking with the team to find out whether the 10 minute countdown since the last successful connection includes paused time.

A job can also be serialized/deserialized. It's often helpful to make use of that. If you have a serialized copy of the uncompleted/not-failed job, you can later deserialize it and call start() and it will check in with the server and see where to pick up from. Often that will mean polling the server, finding out that while you were gone the job completed, and jumping straight to the download phase. You could for example, serialize the job and start it again from that serialized state after the 10 minute failure. This is also covered at the above doc link:

To deal with partially-connected workflows, you can serialize and pause a job when your app detects that network connectivity has been lost for a few minutes to avoid job failure purely due to this lack of network connectivity (as failed jobs cannot be restarted). The job can then be deserialized and resumed when connectivity is reinstated.

The iOS Toolkit has a JobManager component that handles all this for you. We don't have a timeframe for the Android Toolkit to get this, but perhaps the iOS one can help you understand some patterns.

So, in short:

  • Yes, you should ideally track network reachability.
  • You can pause and restart the job based off that.
  • When you pause, grab a serialized copy of the job. This will be helpful to reconnect to the server's job processing if you hit the 10 minute timer limit on your job object.
  • If a job failed because it could not connect to the server to get status for 10 minutes, you can create a new job from the serialized version and call start() to reconnect to that job on the server.

Does this help?

Nick.

View solution in original post

AaronDick
Occasional Contributor

Nick, Thank you for the thorough explanation.  I was noticing it was about 10 minutes and therefore coincidence that it was going to that percentage based on clicking airplane mode at the same time.  I like the added robustness of downloads versus the 10.x way, which was more lacking.  

I am leaning toward just allowing things to go the full 10 minutes, rather than trying to ping google every 30 seconds.  The problem in the Android world is that there is no listener that I know of that will detect the scenario where you have a connection to a 4G hotspot but it is not delivering data.  Easy enough to put in a listener that detects the presence of a WiFi/4G connection or not, but the connected but no data being delivered scenario (4G hotspot) is challenging.  

I think the most likely scenario is that the person will be going in and out of network versus completely losing it for 10 minutes, so it makes sense just to use the SDK out of the box.

Regards,

Aaron

Nicholas-Furness
Esri Regular Contributor

Thanks for the extra info, Aaron. Yeah, that sounds like it could work. You could serialize the job too, and offer the user the option to try downloading the job's geodatabase in a few minutes if they know they'll have network later. There are obviously many approaches you could take depending on the user experience you want to provide and how much effort you'd want to put in to automatically detect reachability. Also, I'm sure you've seen this, but this library seems to do the heavy lifting for you.

0 Kudos
AaronDick
Occasional Contributor

Nick, No was not aware of that reactive library, but will delve into it with more detail tomorrow.  Looks like it does have the ability to do a ping to a url to see if it gets a response.   I am assuming it runs the ping as a timed task but will take a close look.  Asynchronous handlers with timed tasks have a great deal of overhead in managing and also closing properly, all while trying to do it right to avoid memory leaks.  Although I hate to add more overhead in a third party library will take a look and see if it is easy to use and helpful toward the cause.  Thanks for the link.

JoeHershman
MVP Regular Contributor

@Nicholas-Furness thanks for the information this is really helpful.

Some follow-up question, in AGOL how long is a job kept around after it has been completed.  If I go the route of   serializing/deserializing  the job and rechecking after the 10 minutes has expired.  At some point that job is removed from the server so I would need to just give up.  What is that time limit?

Something else we see in the sync errors is some will give the "Illegal state: Job failed because responses have failed for more than 10 minutes." but other services in the offline map will give "A task was canceled." message.   Is that what would be expected, if so why?  Unfortunately, I am not in a position to know where the edits occurred.

Finally (for now);  What about a case where we get a success response from the job, but then we lose connectivity during download of the delta file.  This almost seems like a more likely case in an area of poor connectivity

Thanks  -joe

btw: I am using the Xamarin Forms API, but based on everything being built on the same core, I am assuming this applies equally well

0 Kudos