What is the dark data and why should organisations start looking into it?

Since it’s halloween 🎃🧛 I decided to bring more light to the subject of dark data. Time to talk about skeletons in the clos.. I meant databases 😈.
When I talk with different senior leaders, I get a lot of perplexed faces when I mention “dark data”. That’s when some of these questions arise:

How do we get a meaningful value out of our data?
How do we leverage our data to deliver various analytic and predictive insights?
Is dark data the same thing as unstructured data?
How do we utilise the cleaned up data to deliver the trends to serve our customers more efficiently?

Many people think it’s hard to access it due to the governance of the company. They also say it’s challenging for them to manage all such data from both an infrastructure and operational perspective.
I always respond that it depends what do you understand by dark data, as there are many layers to it. You can already get a lot of valuable information (e.g. better customer profiling) by linking the data already covered by your internal governance. There is no real magic here, it’s all within what’s possible by current ML and NLP advancement.

What is dark data?

Dark data is a type of untapped, unstructured and untagged data that is found in data repositories and which has not be used in any manner to derive insights or for decision making. In some cases the organisation may not even be aware that the data is being collected. Dark data is also known as dusty data.

Dark data can contain important information about the entity, be it an individual or an organisation. In fact, around 80% of collected data in a typical enterprise is hidden — largely because it’s either semi-structured or completely unstructured. We’re talking about the data found in word-processing documents, images, audio, videos, instant messages on Slack, and emails, among other types of files. If you think about it, human workers rely on this kind of data every day to get their jobs done and their effectiveness depends on the ability to spot and process the links in between fragmented information.

where dark data is — Source: FactorDaily / HP

From an intra organisational point of view, this information can be used for management — information containment, know-how sharing, employee onboarding, compliance, fraud detection and threat prevention. From an external organisation perspective, most of the information contained in dark data can be used for customer 360 to strengthen an engagement process.

For most businesses, understanding the vast amount of dark data can be an overwhelming challenge. Generally, businesses use excuses like legality issues, legacy workflows or architectural costs as to why it has been reluctant to maximise its dark data, yet it should not need to be an elephant in the room. All it needs is a data-first leading to an analytics-first and finally an emerging tech-advocate type of mind-set.

Assisting humans to see the value in between different data points

Organisations retain dark data for a multitude of reasons, and it is estimated that most companies are only analysing 1% of their data. Often it is stored for regulatory compliance and record keeping. Some organisations believe that dark data could be useful to them in the future, once they have acquired better business intelligence technology to process the information. Because storage is becoming more and more inexpensive, storing data is easy.

According to Computer Weekly, 60% of organisations believe that their own business intelligence reporting capability is “inadequate” and 65% say that they have “somewhat disorganised content management approaches”.

How do you find dark data?

Finding dark data in your organisation is the biggest challenge. How do you find something if you don’t know it exists? You could compare it to finding a needle in a haystack. At least here you know what you’re looking for. Trying to find dark data is more like exploring a subterranean cave in total darkness. Maybe the cave is empty, maybe there’s a new species of organisms. You can explore around for days without finding anything. Even if you bump into something, you won’t know what it is. Conventional data analysis tools won’t work. Most analytics and business intelligence tools rely on structured data. So do relational databases.

As a lot of dark data is unstructured, the information is in formats that may be difficult to categorise, be read by the computer and thus analysed. Often the reason that business do not analyse their dark data is because of the amount of resources it would take and the difficulty of having that data analysed.

So you get into situations where human knowledge workers — who are capable of so much more — are forced to spend their valuable time extracting key information from semi-structured and unstructured data files — usually locking that information in their head and for that particular process. That information is only as adequate as the person’s effectiveness in linking such fragmented pieces of the information.

To be able to effectively extract information value from the data you hold, you need a platform that supports all your data formats of your organisation, understands your queries in a natural language and gets you answers.

Why and how should organisations make use of dark data

Organisations need to understand that any data left unexplored is an opportunity lost and a potential security risk. Based on an organisation’s intent and investment appetite, dark data can either be tapped to generate more opportunities or remain in the dark. That, however, requires organisations to make strategic decisions and investment toward information protection, retention and mining.

The ideal technology for finding dark data is built to use unstructured data. But there’s more to it than that. You need a platform that automatically detects what type of data it’s looking at, ingests it and prepares it for analysis. Query languages like SQL require you to structure your queries based on the structure of the data, which you can’t do if you don’t know the structure of the data. SOLR does pretty good job in helping you index and structure your text data but it’s only part of the solution.

You need a technology that lets you go into any situation and start asking questions in a natural language right away. That helps you link documents and data from different siloed sources to give you a better picture of the problem. That doesn’t care if your data is structured or unstructured and is built to be an investigative platform.

Each goal requires different technology to deliver the information and insight. Your data could be telling you things. All kinds of things. With the right technology, all you need to do is to figure out what to ask it.