I too ran into a very similar situation with a script I wrote to data mine, of sorts, all of the MXDs and LYRs in a file system to see what and how users were actually using data sources instead of asking them and relying on incomplete and usually inaccurate responses from the users. I had something like 10,000 files to look at, and I could never get the script to make it more than several hundred to a thousand before it would crash, vanish crash. I would find the MXD that was being analyzed when it crashed, and there was never any issues with them in ArcMap or if I copied subsets of related MXDs to a different folder and processed only a few hundred.
Since the crashes would bring down Python, I could never find a way to adequately catch the errors. No matter how much error trapping and compartmentalization/isolation I did in the code, it would hit some magical number and poof. In the end, because I had to get something working, I used multiprocessing to pool workers so that a given subprocess could crash and not kill the rest of the script. Since the subprocesses were tracking their chunks of the list and reporting back, I could recycle the lists from the crashed subprocess and get other processes working on it.
Even the multiprocessing approach got clunky because of timeout problems. There were certain MXDs, and I have no idea why, that would hang indefinitely. Some subprocesses would crash and some would hang indefinitely. It was messy in the end, but I got what I needed.
Fundamentally, ArcMap seems much more tolerant of MXD structure than arcpy.mapping. It would be nice if there was a mapping.isValid method/property that could basically catch structural errors and return a Boolean rather than have the user start to list layers and have errors raised or the code crash.