Data scientists estimate that 80% of the world’s electronic information is “unstructured”, or held as email, documents, video and photographs, or even free text.
“Unstructured data is data held outside data structures like tables and rows without predictable content patterns, such as documents, emails, photos or free text,” says Jacob Isaksen, a digital forensics expert and founder and CEO of Avian, a consulting firm based in Copenhagen.
Or, as Mathieu Gorge, CEO of compliance specialists Vigitrust puts it, unstructured data looks rather like an unbuilt Lego model.
“Once it’s built you end up with a toy, but it starts with chaos,” he says. “Each piece of information is a brick scattered across the network or even cloud providers.”
Often, this is because an organisation has no defined process in place to categorise or tag data. And, given the volumes of information most businesses now deal with, that might be impossible, at least for older records.
Businesses are moving towards a more structured – or semi-structured – approach through data classification and metadata, to make it easier to manage information and extract value from it. But it remains a work in progress.
“The contents are less predictable in unstructured data. GDPR-relevant information, for example, can reside almost anywhere,” says Isaksen.
Untidy data is a compliance risk
Most unstructured data is never used. According to industry analysts IDC, more than 90% of unstructured data is never examined. This means businesses are not making the most of what could be a valuable asset. But, it also means the organisation is probably not compliant with data protection laws.
“There are all kinds of ways organisations can end up in technical breach of regulations with unstructured data,” says Neil Harris, head of technical services at law firm DWF. “Data retention is a key one: you are likely to have some data for longer than you should.”
This “data debt”, he suggests, is unlikely to attract regulatory penalties, unless the data is lost or stolen. “If you don’t know what you have or where it is, you can’t protect it,” he warns.
The lack of data categorisation and classification is an ongoing challenge for commercial and public sector bodies, with too many organisations relying on individual employees to file or categorise information. At a low level this includes using email rules and applying data classification to files on Sharepoint, cloud and local document servers.
But the sheer variety of file types, and the volumes of data, make manual processes inefficient or impractical.
As Harris points out, businesses in sectors such as insurance have been forced to use rather arbitrary measures, such as the age of a document, to select files for deletion. Other organisations are less proactive.
“Unstructured data is largely governed in a decentralised way, often by each user,” says Avian’s Isaksen. “Many enterprises simply write off unstructured data as the responsibility of each employee, whether it is the mailbox owner, the SharePoint site owner, or the network folder owner. But, when a stash of documents containing sensitive data is leaked it very much becomes the enterprise's problem.”
Mining data, growing risks
And there is a further challenge around unstructured data. Regulators such as the ICO say they will take a pragmatic view of technical breaches, such as keeping historic data longer than is justified.
But the picture is very different when that data is subject to active processing. And that processing – attempting to mine business value from the 90% of unstructured data that is untouched – is increasingly important to businesses.
“Unstructured data must be regarded as just as critical and sensitive as structured data,” says Matthias Reinwarth of KuppingerCole. “It often comes as composite objects with embedded documents of varying origin and source system. However, in most cases, these lack clear ownership or categorisation by criticality.”
This can force businesses to embark on time-consuming and often expensive data categorisation exercises, or face the risk that security measures such as data loss prevention and access control tools will miss sensitive documents. At the same time, organisations could be wasting resources protecting information that is less sensitive than they think.
But there is a further compliance risk: carrying out analysis or processing that breaches data protection rules, and especially, those around consent.
“The cloud has lowered the barriers to entry for the computing power needed to process unstructured data. This is a key reason why we are talking about unstructured data today,” says Luke Pritchard, CTO for data and artificial intelligence at IT consultants Avanade. “And specialised software has been developed to simplify the task of drawing insights and make more use from unstructured data via machine learning and deep learning algorithms.”
Those processes, though, can change the nature of the data, and alter the basis for consent.
As Mathieu Gorge warns, the use of personal information for marketing is just the most visible way that organisations can fall foul of data protection rules around consent. Most members of the public would, for example, support the use of CCTV for public safety or theft prevention, he says. But combining video footage with, say, facial recognition and loyalty card data, would be a serious privacy breach.
“If you take data for one specific purpose and get access to a database or other data that I trusted for a different purpose, and mix that up to get a different dataset or perform business intelligence, that may give you the opportunity to sell products or services and you may be in breach,” he says.
And the further challenge with unstructured data is tracing back consent to the original source, or data subject. Implied consent for being filmed, for example, might be appropriate for a business conference or even visiting a shopping mall, but if a business wanted to process the images further they would be on very shaky ground – and data protection regulators would be far less tolerant.
“You need to design so that only consented data is combined and used in downstream applications such as advanced analytics” warns Avanade’s Pritchard. “Tracking movement and usage becomes even more critical as data is mixed throughout the organisation’s data ecosystem. Those organisations that are able to figure this out and stay within compliance are going to have a leg up on their competitors in the marketplace – as the use of data drives business of the future.”