Data Deduplication on the File System

RandyKreuziger · ‎10-27-2015

Our Network Group wants to turn "data deduplication" on the file system containing our File and Personal Geodatabases. It's currently enabled on the non-GIS file system and has saved several hundred GB in disk space and no corruption. Does anyone have experience to know if this slows the system down or if it can create anytime corruption?

MattWilkie1 · ‎12-28-2023

How did this work out Randy?

Our file reporting tool is finding terabytes of duplicate files within file-gdb directories but I'm not sure I believe it. Or better said as: I trust that the contents of these reported dupes are currently identical, but I'm concerned that replacing the duplicates with hardlinks to a single source will mean that when ArcGIS changes one of them, that will cascade through all the others, and that would almost certainly be wrong.

JoshuaBixby · ‎12-28-2023

Data de-duplication doesn't work like that, or else no one would use it with active storage volumes. Data is only de-duped when it is truly duplicative. If a file is changed and no longer has duplicative contents/data, then it is no longer de-duped. Data de-duplication is more advanced than just hardlinking files. If someone is only hardlinking files and saying they are doing data de-dupe, it is time to get a new IT person.

MattWilkie1 · ‎12-28-2023

Thanks for that clarification Joshua, and drawing out that Randy and my scenarios are different. I was using the language of the reporting tool, Tree Size Pro, which uses 'deduplicating' when it's talking about hard linking (ref). So we are asking very different questions. Mine is: is hardlinking file-gdb files a la Tree Size Pro deduplication dangerous?

JoshuaBixby · ‎12-28-2023

The root of a file geodatabase is a directory, and hardlinks aren't allowed on directories. Hardlinking individual files between separate file geodatabases will definitely cause corruption of one or more of the geodatabases, it would be a matter of when and not if.

If you have lots of duplicative geospatial data, it is better to change your workflows and practices to reduce the duplication than relying on filesystem-level functionality that is completely unaware of the filesystem structure of geospatial data.

Storage isn't free, and it can definitely add up in cost if mindlessly wasted, but in general the price of strorage pales in comparison to the cost of collecting or deriving data.

MattWilkie1 · ‎12-28-2023

Hardlinking individual files between separate file geodatabases will definitely cause corruption of one or more of the geodatabases, it would be a matter of when and not if.

Thank you for confirming. I thought this might be the case but wasn't sure.

You're right that changing workflow is the <strike>better</strike> only real solution. I first started raising the alarm over "this is a workflow problem" 10 years ago when we were bumping against 2 TB. Twice since then the chosen mitigation has been "buy bigger servers" so now we're at 22 TB on a 24 TB volume, having the same conversation again. Whee.