We're fighting the same issues with the ArcGIS Python library.
Appears to be totally random when it occurs.
We can run our script for 24 hours without a problem or it can throw an exception in 5 minutes.
It's generally when trying to read or write to/from a hosted layer in the Data Store.
The data is being created by Collector and then we're reading the inspection records, determining the result of the inspection (the status, passed, failed, etc...) and writing that value back up to the Feature's (a Hydrant) record so that the map symbology will change in quasi-real-time.
My thought was the same as yours, use nested try:excepts to give a few shots connecting correctly. At some point, toss a fatal exception and stop the process. We're running the process on a 15 minute loop so if it fails, it will start again within 15 min at the most.
It's not at all uncommon to see issues like this in the web, right? Lot of switches, etc... that messages have to go through and possibly get bumped around. It would be nice to see the logic for dealing with these issues put into the core code so that we don't have to hack around it.