Existing Approaches to the Problem

     
Background   Native File Management Tools

 
Some of the causes of excess storage consumption and inefficient data management operations are due to:
  • Current tools for scrubbing the system are slow and crude.
  • Tools are focused on individual storage locations and not well suited for data across multiple physical storage locations.
  • Too time consuming to do right. Data managers are already spending as much time as they can on the problem.
  The de facto approach to data management is via the native file manager tools provided by the native operating system environment. These tools have the following limitations:
  • Native file managers (Explorer, Finder, and Nautilus) all slow down and become massive time wasters when looking at multiple TB. Waiting for roll up info can take minutes to hours.
  • Answering "why" then requires more roll ups at a lower level.
  • du and df based reports are slow and usually require follow on reports.
     
data management via excel
     
Ad-Hoc Data Management in Excel   Limitations of Data Management Via Excel

 
Customers develop processes to try to streamline this, with imperfect results. Data managers will run a tool such as du or a script and enter results in a spreadsheet.
  1. Run scripts using find, df, and du commands.
  2. Dump data to raw spreadsheet. Typically one row per file.
  3. Create pivot table.
  4. Manually attach business tags and metadata.
  5. Sort by size, business unit, user, etc.
  6. Rinse and repeat as needed!
  Analysis and ad-hoc data management in Excel yields imperfect results due to:
  • File system scanning can take days! Script at one animation studio takes 2.5 days to complete. Info is already out of date by then!
  • Wake of debris. Usually target only the largest space wasters, leaving an ever-increasing pile of small files. Repeatedly re-crawling the small files can add as much as 10x to future scanning times.
  • Infirmity. As engineers move on, custom scripts age and deteriorate.