Data Quality Measures – Help or Hindrance

With so many organisations focusing on their readiness for the General Data Protection Regulations (GDPR), the quality of an organisation’s data has never been more in the spotlight.

The onus is now on organisations to be more rigorous in being responsible about their data and what they can and can’t do. In practice, the pursuit for quality data can in fact come into conflict with commercial practices and business operations. Let’s take a look at a couple of examples to explain the point…

When cleansing duplicate data isn’t necessarily the solution

We were called in by a large publisher who, faced with 10+ sources of data, wanted to be able to see the interaction between them. These sources included online sales data, membership data, loyalty programme data and video user data. The Publisher wanted to determine how customers were interacting with its various offerings.

The company’s system suggested that they had 20m unique email address. On further analysis and de-duplication, that data whittled its way down to 5m. A further review by us concluded there were in fact 3m unique customer records across the various sources.

At one level this presented a great picture of just how many touch points individual customers were engaging with. It was clearly a very loyal and active customer following for this publishing brand.

On the other hand, the business saw its customer list contract from 20m (when you added individual sources separately) to 3m unique customers. As the Publisher gained revenue from the promotional opportunities its media partners sought in targeting the audiences of different sources, it did not make commercial sense to simply consolidate down to just 3m customers. On top of this, the Publisher’s systems were also unable to merge duplicate customer IDs into one consolidated account.

A fundamental lesson here then is that data cleansing always leads to a loss of information. In most cases it makes sense to remove duplicate email addresses. However there may be instances where commercially you need to keep two email addresses for a reason.

This could become an issue under GDPR where different parts of a business get consent to communicate with customers, but then hold duplicate data in silos across the organisation (more on that in a blog later this month).

Are you comparing like with like?

Another point to make when considering data quality is what are you comparing the quality to? Is all data in the dataset on a level playing field, and are you comparing like with like? Irrespective of the quality of the data collected at source, if it’s not in alignment with the use it is destined, then the quality is questionable.

By way of a simple example, let’s take an insurance company who wants to sell its health insurance product via ‘aggregator’ sites. And let’s apply this to the categorisations of a ‘lapsed smoker’ in the questionnaires on those sites.

Depending on how recent the person has given up smoking, the insurer will expect to adjust the proposed insurance premium in line with its internal criteria and pricing structure. But typically different aggregator sites will have different criteria for defining a lapsed smoker.

	Aggregator site 1	Aggregator site 2	Aggregator site 3
Lapsed smoker	Not smoked for past 5 years	Not smoked for past 2 years	Not smoked in past year

For the insurer, the different classifications then means further work will have to be done with the data feeding through from these sources. Adjustments will likely to have to be made so its internal categorisations (and the resulting pricing of its premiums) are reflected and consistent.

So, whilst each aggregator site collected quality data – the insurer knows it’s not getting like for like data and will need to adjust to gain the quality it needs to operate and remain commercial.

Lessons to take away

Whether you are merging records, rationalising data, creating data mapping, or doing some deeper text cleansing, you are always at risk of losing information. That loss may or may not make commercial sense to your organisation’s operations and legitimate business interests in the long term.

The best way to ensure data quality is prevention. This means ensuring quality and robustness of data at the point of data capture. But this is not always possible for commercial reasons. In these situations a well-documented process therefore needs to be in place.

Such a process should at one level retain the source data (so the original record can always be checked for auditing and quality referencing). It should also create standardised and approved mappings of the data that overcome any limitations in the use of the data.

When looking to improve your data quality it’s therefore important to think through the source data in relation to how it will be used in the organisation – not just the content of the data itself.

Can we help?

If you would like to learn more about how Fusion can help you improve the effectiveness and quality of your data, please get in touch. We are confident we can help you achieve your business goals, so much so we are giving you the opportunity to receive a Free Consultation – which can be organised here.