<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Data Deduplication on the File System in Data Management Questions</title>
    <link>https://community.esri.com/t5/data-management-questions/data-deduplication-on-the-file-system/m-p/1364741#M44877</link>
    <description>&lt;P&gt;How did this work out Randy?&lt;/P&gt;&lt;P&gt;Our file reporting tool is finding terabytes of duplicate files within file-gdb directories but I'm not sure I believe it. Or better said as: I trust that the contents of these reported dupes are &lt;EM&gt;currently&lt;/EM&gt; identical, but I'm concerned that replacing the duplicates with hardlinks to a single source will mean that when ArcGIS changes one of them, that will cascade through all the others, and that would almost certainly be wrong.&lt;/P&gt;</description>
    <pubDate>Thu, 28 Dec 2023 15:28:57 GMT</pubDate>
    <dc:creator>MattWilkie1</dc:creator>
    <dc:date>2023-12-28T15:28:57Z</dc:date>
    <item>
      <title>Data Deduplication on the File System</title>
      <link>https://community.esri.com/t5/data-management-questions/data-deduplication-on-the-file-system/m-p/270822#M15642</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Our Network Group wants to turn "data deduplication" on the file system containing our File and Personal Geodatabases.&amp;nbsp; It's currently enabled on the non-GIS file system and has saved several hundred GB in disk space and no corruption.&amp;nbsp; Does anyone have experience to know if this slows the system down or if it can create anytime corruption?&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Tue, 27 Oct 2015 16:30:39 GMT</pubDate>
      <guid>https://community.esri.com/t5/data-management-questions/data-deduplication-on-the-file-system/m-p/270822#M15642</guid>
      <dc:creator>RandyKreuziger</dc:creator>
      <dc:date>2015-10-27T16:30:39Z</dc:date>
    </item>
    <item>
      <title>Re: Data Deduplication on the File System</title>
      <link>https://community.esri.com/t5/data-management-questions/data-deduplication-on-the-file-system/m-p/1364741#M44877</link>
      <description>&lt;P&gt;How did this work out Randy?&lt;/P&gt;&lt;P&gt;Our file reporting tool is finding terabytes of duplicate files within file-gdb directories but I'm not sure I believe it. Or better said as: I trust that the contents of these reported dupes are &lt;EM&gt;currently&lt;/EM&gt; identical, but I'm concerned that replacing the duplicates with hardlinks to a single source will mean that when ArcGIS changes one of them, that will cascade through all the others, and that would almost certainly be wrong.&lt;/P&gt;</description>
      <pubDate>Thu, 28 Dec 2023 15:28:57 GMT</pubDate>
      <guid>https://community.esri.com/t5/data-management-questions/data-deduplication-on-the-file-system/m-p/1364741#M44877</guid>
      <dc:creator>MattWilkie1</dc:creator>
      <dc:date>2023-12-28T15:28:57Z</dc:date>
    </item>
    <item>
      <title>Re: Data Deduplication on the File System</title>
      <link>https://community.esri.com/t5/data-management-questions/data-deduplication-on-the-file-system/m-p/1364784#M44878</link>
      <description>&lt;P&gt;Data de-duplication doesn't work like that, or else no one would use it with active storage volumes.&amp;nbsp; Data is only de-duped when it is truly duplicative.&amp;nbsp; If a file is changed and no longer has duplicative contents/data, then it is no longer de-duped.&amp;nbsp; Data de-duplication is more advanced than just hardlinking files.&amp;nbsp; If someone is only hardlinking files and saying they are doing data de-dupe, it is time to get a new IT person.&lt;/P&gt;</description>
      <pubDate>Thu, 28 Dec 2023 17:25:43 GMT</pubDate>
      <guid>https://community.esri.com/t5/data-management-questions/data-deduplication-on-the-file-system/m-p/1364784#M44878</guid>
      <dc:creator>JoshuaBixby</dc:creator>
      <dc:date>2023-12-28T17:25:43Z</dc:date>
    </item>
    <item>
      <title>Re: Data Deduplication on the File System</title>
      <link>https://community.esri.com/t5/data-management-questions/data-deduplication-on-the-file-system/m-p/1364796#M44879</link>
      <description>&lt;P&gt;Thanks for that clarification Joshua, and drawing out that Randy and my scenarios are different. I was using the language of the reporting tool, Tree Size Pro, which uses 'deduplicating' when it's talking about hard linking (&lt;A href="https://www.jam-software.com/treesize/deduplicate_files.shtml" target="_self"&gt;ref&lt;/A&gt;). So we are asking &lt;EM&gt;very&lt;/EM&gt; different questions. Mine is: is hardlinking file-gdb files a la Tree Size Pro deduplication dangerous?&lt;/P&gt;</description>
      <pubDate>Thu, 28 Dec 2023 18:13:49 GMT</pubDate>
      <guid>https://community.esri.com/t5/data-management-questions/data-deduplication-on-the-file-system/m-p/1364796#M44879</guid>
      <dc:creator>MattWilkie1</dc:creator>
      <dc:date>2023-12-28T18:13:49Z</dc:date>
    </item>
    <item>
      <title>Re: Data Deduplication on the File System</title>
      <link>https://community.esri.com/t5/data-management-questions/data-deduplication-on-the-file-system/m-p/1364867#M44880</link>
      <description>&lt;P&gt;The root of a file geodatabase is a directory, and hardlinks aren't allowed on directories.&amp;nbsp; Hardlinking individual files between separate file geodatabases will definitely cause corruption of one or more of the geodatabases, it would be a matter of when and not if.&lt;/P&gt;&lt;P&gt;If you have lots of duplicative geospatial data, it is better to change your workflows and practices to reduce the duplication than relying on filesystem-level functionality that is completely unaware of the filesystem structure of geospatial data.&lt;/P&gt;&lt;P&gt;Storage isn't free, and it can definitely add up in cost if mindlessly wasted, but in general the price of strorage pales in comparison to the cost of collecting or deriving data.&lt;/P&gt;</description>
      <pubDate>Thu, 28 Dec 2023 20:52:06 GMT</pubDate>
      <guid>https://community.esri.com/t5/data-management-questions/data-deduplication-on-the-file-system/m-p/1364867#M44880</guid>
      <dc:creator>JoshuaBixby</dc:creator>
      <dc:date>2023-12-28T20:52:06Z</dc:date>
    </item>
    <item>
      <title>Re: Data Deduplication on the File System</title>
      <link>https://community.esri.com/t5/data-management-questions/data-deduplication-on-the-file-system/m-p/1364871#M44881</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;Hardlinking individual files between separate file geodatabases will definitely cause corruption of one or more of the geodatabases, it would be a matter of when and not if.&lt;/BLOCKQUOTE&gt;&lt;P&gt;Thank you for confirming. I thought this might be the case but wasn't sure.&lt;/P&gt;&lt;P&gt;You're right that changing workflow is the &amp;lt;strike&amp;gt;better&amp;lt;/strike&amp;gt; only real solution.&amp;nbsp; I first started raising the alarm over "this is a workflow problem" 10 years ago when we were bumping against 2 TB. Twice since then the chosen mitigation has been "buy bigger servers" so now we're at 22 TB on a 24 TB volume, having the same conversation again. Whee.&lt;/P&gt;</description>
      <pubDate>Thu, 28 Dec 2023 21:02:37 GMT</pubDate>
      <guid>https://community.esri.com/t5/data-management-questions/data-deduplication-on-the-file-system/m-p/1364871#M44881</guid>
      <dc:creator>MattWilkie1</dc:creator>
      <dc:date>2023-12-28T21:02:37Z</dc:date>
    </item>
  </channel>
</rss>

