Using AI to cleanse data?

Posted on May 31, 2024

Do we trust AI to clean data?

Maybe…?

Sometimes…?

It depends…?

True data cleansing basically means doing a lot of really granular work across entire files or entire databases, which means across some, or potentially all rows, and potentially across all columns.

50,000 rows of data and 25 columns may sound like a pretty ‘small’ set of data, but that means there are a total of 1,250,000 fields that might require cleaning. 🤯

Some fields are BLANK, some contain the WRONG data, some contain MISSPELLED data, some are MISSING data, and there are numerous other possibilities.

The first step to cleaning data is to understand the input, and to define the desired outputs. In other words, knowing what kinds of problems exist, and knowing the appropriate solution for each of those problems.

In the example above with 50k rows and 25 columns, this would require knowledge of the problems in all 1,250,000 fields, and knowledge of the desired future state, and the knowledge of how to appropriately resolve each issue in the data set.

Data profiling tools help with this because they tell you about the content and the structure of the entire data set, column by column. They make it much easier to identify many of the problems.

Phone numbers in the email field or vice versa? Easy to spot. 5 and 9 character postal codes? That’s simple too. These tools help to scope out all of the different issues that need to be resolved, column by column, with less manual work.

Data cleansing tools can make it simple to make big, ‘broad stroke’ changes, to make the data more useful. Once we better understand all of the different problems with the data, we can use smart tools to resolve those problems.

Then we can automate. But we can’t effectively automate before we understand the current state, and define the desired future state.

Asking an AI to ‘clean the data’ is probably going to fall short.

When you’re looking for duplicates it’ll be easy to find the obvious duplicates, where all records contain the same exact data across multiple columns.

But most duplicates aren’t so easy to find because the data isn’t the exact same across all columns.

You might try looking for ‘same or similar looking’ values across 3 to 5 key fields. To do that, you’ll need to compare the data in 3 to 5 fields, across all 50k records. 🤯

This is where it gets tricky. For example, name, address and phone number match but the email address doesn’t match so it could be the same person using two different email addresses or maybe it’s another person in the same household.

There will always be a lot of simple ‘pairs’ of duplicate records, where one record matches another. But with data matching and entity resolution tools, you might end up with very large groups of duplicate records, where many different records all match each other.

AI has been and will continue to be used for all of this but to be effective it’ll always need to know what kinds of issues will need to be resolved, and how to resolve those issues.

Leave a Reply Cancel reply