Using LLMs for Data Cleanup – Managing Expectations, Process Explanation, Internal Costs, and Techniques

February 5, 2026

Zack Carlson

Senior Solution Architect

Using LLMs for Data Cleanup – Managing Expectations, Process Explanation, Internal Costs, and Techniques

Or as I like to call it:
“How to shove data into an LLM and not be sad”

I made the leap back into consulting a bit more than a year ago, and it is common that clients have product or marketing data that can span hundreds of thousands of rows and thousands of columns, all with some purpose.

On nearly every project I get asked, “Couldn’t we fix this with AI?” or a coworker might quip, “So, I took the sample spreadsheet and I uploaded it to ChatGPT and it gave me good results. We just need to scale it across the whole catalog.” If you’re cursed with the ability to script, you shudder at the coming moment that they will hit you up to help because they’ve hit a wall.

Somehow, almost all AI classification jobs relating to data cleanup I have dealt with start this way, in some form or another. The advent of multimodal LLMs and Agents has made this scenario a new reality and also freakishly easy to show as a proof of concept. Throw in MCP, ever-increasing context windows, and the fact that it can be blazing fast to throw a script together with Claude or any other Agent, and it might seem like a recipe for turning your consultants into superheroes.

AI is amazing. However, in my experience with real-life data, AI can struggle mightily at contextual tasks without excellent prompting, system design, an understanding of the dataset, and clear objectives for success. Very often, proofs of concept are just that: concepts. They appear to solve 95% of the problem. That makes the remaining 5% seem like a manageable amount of work, worthy of investing time in, and definitely not an endless time sink.

Often, a limited amount of data that worked in a single Excel document starts to meet many other challenges as business rules and data begin to scale. Quickly, “one-shot magic” starts to become a multi-phased curated funnel with rounds of revisions, quality assurance, unit tests, bug fixing, scaling, batching, queueing, and so on.

“Script the data into an LLM with a good prompt.”

To which your answer would be, “Of course we can do that,” followed immediately by the ten different reasons you can think of that investing time elsewhere would keep you more sane and make life more enjoyable. The truth is, AI quality and results from single-stage LLMs can vary wildly by task, as low as 20% anecdotally, and the rosiest claims as high as 90%.

To compound this, AI benchmarks are notoriously unreliable indicators of how well AI will perform for your task, based on your experience with prompting, specific models, solution design, and whether your business rules are repeatable.

As consultants, we often get to decide which problems we solve, especially when data completeness prevents a business from realizing the full potential of expensive business software. I would like to take a moment to share my observations on how to frame this kind of work in a way that gives you solid structure. Many of the items I outline apply generally to any well-run project, but they are extremely important here because AI will be solving subjective tasks with probabilities.

Scope Boundaries

First, since we are an e-commerce, MDM, and PIM agency, let’s set the edges of the universe, at least for this blog.

We are not working with:

Model tuning
Training individual models
RAG classification pipelines
Chunking strategies

We are taking core business data and using multimodal LLMs or off-the-shelf open-source models to clean up errors or inaccuracies that have accumulated over time.

If embeddings are generated, they’re used to guide accuracy in the enrichment loop.

Now let’s cover the types of AI data work Sitation may be doing at any given moment.

Types of AI Data Work

Image classification
Take an image, look at the image, and add metadata based on a prompt or rules.

Image renaming
Take an image and, based on asset data, metadata, or other rules, assign a new name based on client-provided requirements.

Image metadata enhancement
Add metadata to existing assets by looking at the image, either on the data model side or by filling in missing meta properties.

Product data enrichment
Take a single product record and add fields or data based on client-specific business rules or deductions that can be made from existing information in other systems.

Product data correction
View product records and identify bad eggs, such as incorrect tone in descriptions or descriptions not matching images.

Product data normalization
For example: lbs to LB to pound, Navy or Midnight Blue to Blue.

Product relationship or assignment
Link product A to product B based on a subjective set of rules.

Project “Artifacts” (a.k.a. Things That Grow Legs)

In a project, each one of these items is going to have similar elements. In Agile, they call them “artifacts.” Others might call them project components. As with any data project, they tend to grow legs, so I’m going with that term.

The corpus

All the data you are starting with.

You absolutely need an inventory of every dataset, including known or estimated data quality. System A might be good with measurements, but System B might have the best pricing. Know what you are getting into.

You must know if the corpus changes, how often it changes, and define change management. If you are working on stale data, what is the plan to work on the real data?

Decision makers
If the data comes from systems managed by others not on your project, you need at least a path to communicate in order to remove roadblocks.

Plan for pre-cleaning data
An LLM cannot easily read a Quark document, InDesign file, MAME Zelda save file, or other uncommon file types. If they need to be transformed to be analyzed, plan for this in the pipeline or pre-pipeline stages.
LLMs can’t easily read Quark, InDesign, game save files, or uncommon formats. If transformation is required, plan for it early in the pipeline.

Guidelines (How to “Solve” the Problem)

Guidelines should be client-provided documentation defining what a correct outcome looks like. They can be vague about how the solution is implemented, but clear about expectations. There should be clarity on how data should be corrected and include description in human terms or with examples.

You must interrogate these guidelines aggressively. Find gaps. Edge cases. Logical inconsistencies. Just because example scenarios work does not mean production data will.

Win / Loss Scenarios

AI is non-deterministic. This means there is a chance you will get a different result with the same prompt data.

Success criteria must be defined collaboratively:

Scoring
Percentages
UAT criteria

100% accuracy is unrealistic. Chasing it becomes an endless void. Start by defining what is useful versus a non-starter.

One-time cleanup projects may tolerate more manual correction. Continuous pipelines require stricter guardrails.

Exceptions

No two businesses operate the same way, even with the same software.

Find exceptions as early as you can. Spell them out early, and keep track if the exception count keeps rising as you process more test data. That is a sign you may be taking the wrong approach.

Define fallbacks. If the data does not pass muster, what is the plan?

The Enrichment Loop

This is the core work.

Prepare → Prompt → Process → Evaluate → Repeat

It might be:

A bash script
Laravel
Python
n8n
Low-code
No-code
Or a Claude-built app

Understand that an LLM may be only one part of this loop or may be called repeatedly. Standard normalization techniques that might be part of the loop include:

Calculating Levenshtein distance,
Parsing content with regular expressions
Remote lookups
Memoization

Other specialized models can be fine-tuned and run locally on modest hardware.

Do some basic math. Track the timing.

Test Data & Grading

Tokens cost time and money.

Tweaking your enrichment loop on a small, representative sample of data until accuracy approaches the desired level will get you moving in the right direction. Subtle problems may not reveal themselves until you look at real results.

Use the same grading criteria throughout to track regressions. If grading criteria change, assume you are starting over.

Do not always trust AI to grade itself fairly. Build tooling that allows human evaluators to quickly score results.

Analytics & Cost Control

Track token usage and where it occurs. Storage is cheap, and transparency around cost is always a plus.

Keep the grades of your results. Every tweak to the enrichment loop should be graded against test data to determine whether you are improving. You may find that additional tweaks lead to diminishing returns.

Now that I have covered the components of these types of projects, it is important to agree upfront on how changing requirements will be handled. If you have done your due diligence and are not running into closed doors, the process may be fairly linear.

If access to test data, source systems, or clear answers about change is hard to come by, push back. Working with bad information, unclear goals, or loose procedures sets you up for sadness.

Managing Change & Scope (Avoiding Scope Creep)

Because of the very nature of projects like this (cleaning large sets of data) – it can be prone to an endless refinement loop when minor test data defects show up in the full production output. It’s important to set guardrails about not only delineate the phases of the project (Discovery, Plan Sign off, Test Data Enrichment Complete etc) but to give you an off ramp.

This edge case wasn’t in the test data – If you want we can rewind to that phase of the project and redefine the test data and work back through refinement.. but that might mean we need to modify the scope of work 1 refinement phase to 2

Tooling

There is an abundance of tools available, and AI lowers the barrier to working with them. However, it is difficult to compare which tools will meet both known and unknown needs. Tools like n8n, Zapier, Make, and Pipedream require little to no programming knowledge.

I have been a full-stack developer most of my career, so I stick with the basics. Any interpretive language with reasonable adoption can work well for data cleanup. Be cautious with no-code platforms if you need scale. Latency and cost can become barriers.

Personally, I have enjoyed instructing Claude to write scripts, query databases via MCPs, and analyze data from the command line. Keep track of your loop and what it does. Do not let AI litter your repo with junk as requirements change. You will need to debug it.

Take time during vibe coding to stop and ask the LLM to explain how it is executing certain tasks and whether a design pattern would help.

A naive prompt might generate a quick and dirty script with no batching, no scaling, and poor structure.

A better prompt frames the project, defines artifacts, success criteria, grading, scale requirements, and asks the LLM to explain design decisions.

Hint: Take time while vibe coding to stop and ask the LLM it to review how it’s executing certain tasks and consider if a design pattern or abstraction would work. Certain keywords will trigger different behavior

Write me a script that parses this file and compares column D to G with an LLM

The above may generate quick and dirty script but it will likely decide the language (python or bash) for you, it may attempt to read the file (regardless of size) into the context window, have no batching (loop over every row)

A better example would be:

This is a classification project. We are undergoing multiple rounds of and revision of the data provided in this project / database it meets a threshold.

Here are the artifacts:
<<corpus definition>>
The steps are this
<< guidelines for each row , I need to read column d / g, compare it against the database description store a guess as the value >>

This is what defines success & failure
<< win loss >>

Here is how it’s graded

<< grading>>

Eventually I need to handle <<data quantity / frequency / etc>>

The UAT process will involve grading a sample of the data. Write it in <<language with features you like>> and install libraries to work with all major LLM’s.. the script needs to scale with workers or handle asynchronous jobs. Frame out the project so that it can run on a small VPS, and keep results in a relational database. Use Design patterns, memoization and caches.. as well as the tools that facilitate where feasible audits, handling recursion and allow the project to be audited by a developer. Include documentation and prompt helpers to make concise context. Pause and explain to me pros and cons of design decisions you want to make.

The above prompt is going to clue the LLM that you want a more serious application and likely make it easier for you to understand what is going on and to manually debug.

Take the time to understand every step of the process, and if you can review important business logic with a fine tooth comb. If you are “vibe coding” – it’s tempting to disengage from the minutia of logic but stay “big picture”.. and if you don’t understand what’s going on under the hood – you will eventually get called to court.

Try out different LLM’s and different tiers. Google / Anthropic / Open AI are all likely competitive enough.. but some are subjectively better at some tasks then other. if nothing but for trial and error and experience it’s worthy to consider. Sometimes a higher tier or different model may improve scores on test data.

Final Thoughts

At the end of the day – a data pipeline powered with AI can accomplish tasks that were previously impossible. It doesn’t mean it’s magic. Just because you have unlimited tokens still means you can still mess things up.

Follow sensible project management steps, use artifacts with agreed definitions (invent your own if you don’t like mine) and don’t skip important project structure. (like avoid cowboys that say “where we’re going, we don’t need sign-offs”) Don’t be afraid to immerse yourself in all the new methods now that computers can accomplish tasks presumed couldn’t be algorithmically done in a generic way.

If you’re considering a project like this, drop me a line.