Unstructured Data
“If we don’t carefully name and organize our file system data, the only guarantee is we will quickly loose track of it” - a major CG Animation Studio
Definition: Data stored on the file system outside of any database as an independent, stand alone file.
Research Analysts estimate that 80-85% of all data is unstructured. It is further estimated that 30-40% of that data is no longer of any use to the organization that generated or stored it. Nevertheless, corporations continue to buy nearly 50% more storage every year to hold their digital assets. Since only, roughly 10% of the storage management cost is the actual hardware acquisition, corporations have a material, growing problem to deal with related to: 1) finding their data, 2) managing (performance, throughput, etc), backing up, restoring, and quality checking their data, and 3) controlling their IT costs. More and more, a company’s ability to drive down IT costs will partially hinge on their effectiveness in dealing with unstructured data.
Why does this happen? Organizations have hundreds or thousands of users creating, extending, copying, or distributing data every day of every year. Because the user community is typically measured according to their impact on their core business, not their ability to manage data, the creator/owners of the data spend little time or effort to manage the data growth. Unfortunately, the IT groups with the mandate to manage the corporate data assets do not have the information required about the company’s business objectives, schedules, or substantial knowledge necessary to bring precision to the process. Thus, each small group of users organizes their data on the company’s shared drives with varying degrees of organization, consistency, and discipline. Since there is no data base to force structure, there is little knowledge within the internal IT group to determine a better structure, and there is no obvious technology suite available to support a more “standard” structure across the organization, the data simply collects across a host of file servers based on very localized schema and temporary project definitions. Not only does this present a problem to later find the data, users also exaggerate the problem by copying, or emailing their version of file X all around the network as the data migrates between groups or across boundaries of responsibility.
Poor data management practices can be attributed as the root cause of storage growth in the catch-all “unstructured” category.
What approaches can be used to manage, or reduce the propagation of unstructured data?
Due to the shear size and complexity of the unstructured data problem set, organizations will likely deploy a combination of the following approaches.
1. Attempt to round up and makes sense of the data after the fact using search and classification schemes
The search engine approach creates a new database of metadata based on a scan of the existing storage and the files hosted. The search technology relies on file names, file metadata, or, in some cases, content within the files to build an index structure allowing the user to search, classify, and view available files. Depending upon the quality, presence, and consistency of the metadata and the nature of the file (binary or ASCII), the search results may improve the user’s ability to find files.
2. Investigate more efficient ways to store unstructured data after the fact by reducing duplication via data de-duplication or content addressed schemes.
These approaches typically use specialized storage hardware that often assign a unique identifier or checksum to each individual file and attempts to only store a single instance of each unique file.
3. Catalog data into specialized databases
This approach results in “silos” of data that can be effective for some of the data but rarely provides a solution for all of a company’s data types. In general, no application database holds the breadth of required data items, and performance typically becomes an issue as database applications attempt to hold more and more diverse data items. In addition, these schemes often hinder the data management strategy by preventing or making it extremely difficult to manage data outside the control of the silo.
4. Consider the underlying file systems to be an effective database. Introduce consistency into nomenclature and data organization.
In essence, the approach considers and manages the file systems as a database, where files in a directory are considered to be a hierarchical database with files as atomic data elements, file names are semantic tags, and the directory structure is the schema. The collection of file systems now become (or comprise) the largest possible data silo (or catalog), thereby removing limitations on file types, and formats that can be managed. Traditional catalog limitations are removed, as the file systems tend to be the lowest common denominator to software applications. By implementing standards, rules and naming conventions into the file system, structure is automatically built into what otherwise would become unstructured. An organization can effectively QC, find and manage the data that has historically been without structure, project or business context, or ownership. The file systems now become a vehicle to detach a company’s data management practice from their application infrastructure.
DataFrameworks allows an organization to implement their policies and business logic INTO the existing file systems. By structuring the file systems according to the business standards within your corporation, data of any origin or file type can be organized. For the first time, a user’s business knowledge can be captured and leveraged across the file systems. Unlike various specialized database applications, the user is now free to save or retrieve files with no interruption to their process and based on their current, existing workflow applications or browsers. Data can be managed outside the control of a database.
Reduce storage requirements, lower your costs, and mitigate the user and IT burdens and frustration created by rampant storage growth - DataFrameworks.