Estimates suggest that upwards of 80% of business information is unstructured data.
That could cause headaches for anyone who needs to manage, organise and keep all that data secure.
One survey, commissioned by Igneous, an unstructured data management provider, found that 82% of respondents manage one billion or more files and objects. In fact, a staggering 59% of those asked manage more than 10 billion files.
Unstructured data is, broadly, all data and information that does not have a predefined data model.
In practical terms, for IT that means information that is outside a relational database, or stored outside an application environment such as an ERP or HR system that sits on top of a database.
But, a growing volume of information could best be described as semi-structured. Although this data is not held in a database, there is some structure there, mostly in its metadata.
And, as technology, including object storage allows for richer metadata, the boundaries between structured and unstructured information could blur further.
Unstructured data in context
Business information is, for the most part, generated by systems, or by people. Data from systems is most likely to be structured. An order number created by a sales system, and stored in a database, is a typical example.
Unstructured data is often created by people. An email from a sales team confirming the order would be unstructured, as would a social media message or voicemail complaining the order was late.
A photograph of a damaged item in delivery would, superficially, also be unstructured data – although metadata from the camera files is semi-structured information.
Data can also move between unstructured and structured during its lifecycle. A business seeing a spike in delivery complaints could combine metadata from customer photographs with geo-tracking information from delivery vehicles in a business intelligence tool.
Although free text-based analysis – and even image analysis – is becoming more powerful, most text analysis tools use a database engine of some sort.
Structured data usually comprises small pieces of information, such as the value of a single database entry, although collectively data volumes can be large.
Unstructured data comes in a much wider range of sizes, from a few kilobytes for a message to potentially terabytes for uncompressed video footage.
Handling such a variety of records poses issues for storage managers. It is one reason businesses are looking to move more data – or at least metadata – to structured formats.
Sharad Patel, a data expert at PA Consulting says companies want to move away from 80% or 90% of data being unstructured. One client he cites has a target to reduce unstructured data to 50%.
Reasons include better security and compliance, as well as improved systems performance. Building in more control over data is increasingly important, as organisations gather and store ever-greater volumes of information.
Semi-structured machine data
As well as dealing with more data, information and storage managers now need to handle a wider range of data types, both in centralised and end-user systems.
IT has moved a long way beyond spreadsheets and word-processor files on the desktop and a few shared databases, to a much richer range of information sets. Audio, image files and videos now work alongside information from the web, and increasingly, information from connected devices and the internet of things (IoT).
Sensor and connected devices data is essentially semi-structured. Whether it is a temperature sensor in a factory, or a surveillance camera stream, the raw data is of only limited use. Metadata, such as time and location, is essential for human or automated analysis of the raw information.
Without metadata, it is difficult, and perhaps impossible, to make informed decisions. It is also the metadata that allows analysts to categorise information and move it into a structured environment, such as a database, for processing.
Interrogating the data, whether for a simple historical report or sophisticated predictive analysis, is impossible without a framework of metadata.
Industry observers expect IOT data volumes to grow rapidly: Gartner expects to see 20 billion IOT connected devices by 2020. Another estimate, from IDC, suggests that IOT data will reach 163 Zettabytes by 2025.
But, the need to capture metadata is having a greater impact on business’ data management and storage needs. As much as 5.2ZB of data will need to be analysed, and perhaps because of that, 26% of data will be in the public cloud by 2025, IDC predicts.
Cloud storage is an attractive option for at least some types of unstructured data. The cloud is well-suited to information that needs to be accessed infrequently. One type of data already widely migrated to public cloud storage is archive material.
Firms that need to keep long-term records can make use of the cloud’s low cost per gigabyte, only paying retrieval fees for data they need, or in a disaster recovery or forensic investigation scenario.
Long-term data storage is also able to handle the performance lags that potentially come with use of the public cloud. Systems such as Microsoft SharePoint – a common repository for unstructured business information – are less affected by any latency than a transactional system, such as ERP, which relies on relational database.
For semi-structured data, the cloud is appealing because of its use of object storage.
With semi-structured data, the business value can lie as much in the metadata as the data itself. As object stores can distribute data – and metadata – across multiple locations, they raise the prospect of fast, localised searches for metadata while gaining the scale economies of the cloud for raw data.
Object storage gaining traction
The larger the dataset, the more attractive the object model, and as a result, object storage is gaining traction in industries as diverse as media and entertainment, life sciences, and oil and gas.
According to Boris Evelson and Elizabeth Cullen, analysts at Forrester Research, cloud-based text analytics tools can be up and running in minutes, even if it takes rather longer to train algorithms to become productive. As businesses can now run analytics in the cloud, there is a stronger case for keeping data in the cloud too.
Performance requirements, though, will act to keep some unstructured and semi-structured datasets on-premises. Over the last decade, storage vendors have steadily improved the performance of network-attached storage – still the go-to architecture for on-premises, unstructured data.
Clustered NAS can offer performance close to direct attached or SAN storage. Data that requires fast processing, such as real-time analytics or customer-facing systems, can be supported on NAS.
And CIOs are likely to favour NAS, or on-premise object storage, where data security and compliance considerations rule out the cloud. In this case, policy requirements may well trump technical or cost considerations.