Blog

How to stop AI from inventing your product data

AI tools love to invent product attributes that aren't there. Here's how Merchkit's enrichment guesses less, scores its confidence, and says when it's unsure.

B
Bijan Vaez
Reviewing AI-generated product attributes and confidence scores in Merchkit

A few weeks ago I watched an AI fill in "solid walnut" for the frame material on a dining chair. Confident, clean, and completely made up. The supplier sheet never said what the frame was. The model decided walnut sounded right for a chair at that price, so it wrote it down.

That's the uncomfortable part of AI enrichment. These models are trained to be helpful, and helpful usually means "produce an answer." Ask one to fill an empty cell and it will almost always fill it, whether or not the truth is anywhere in your data. For a marketing blurb, a confident guess is fine. For the material a customer is paying for, a confident guess is a return and a chargeback.

We spent a big chunk of this year rebuilding how Merchkit's AI fills in product data, and most of that work came down to one unglamorous goal: guess less, and say so when it's guessing. Here's what actually changed.

Teach it to say "I don't know"

The first fix sounds almost too simple. We taught the model to admit when it can't actually tell.

When a supplier sheet doesn't list the fabric content on a shirt, the right answer isn't a plausible-sounding fabric. It's "I can't confirm this from what you gave me," plus a flag so a person can finish it. So enrichment now leans hard on two things: the source data you actually gave it, and the acceptable values you set for each attribute. If an answer isn't supported by your sources and doesn't fit your list of valid options, the model raises its hand for review instead of quietly inventing one. That isn't leaving the work undone. It's the difference between a gap you can see and fix, and a confident wrong answer that hides in plain sight until a customer finds it.

Put a confidence score on every value

Blanks help, but they don't tell you which of the filled-in answers to trust. So every value the AI generates now carries a confidence score, and you can filter your catalog down to just the low-confidence ones.

This changes how review actually works. Instead of re-reading 40,000 rows because you don't trust any of them, you filter to the few hundred the model wasn't sure about and look at those. The confident values you spot-check. The shaky ones you fix. Your attention goes where the risk is, which was the whole idea. If you want the longer version of what we think "good" looks like, it's in the enrichment quality guide.

Don't generate what you can calculate

This one took me too long to figure out: a lot of "attributes" aren't judgment calls. They're math, or a lookup, or string work. A sale price derived from MSRP. A SKU built from a brand code and a product ID. A shipping class that follows from weight.

You don't want an AI guessing at any of those. You want the same answer every time, built from columns you already have. So we added formula attributes: spreadsheet-style logic that computes a value from your other fields and recomputes on its own when you bulk-edit or re-import. It can even pull a value off a linked vendor or product. The rule of thumb we use now: if a new teammate could fill the field by following a rule, write the rule. Don't ask the AI to roleplay it.

Classify into your taxonomy, not a made-up one

Category trees are where hallucination gets expensive. A wrong category quietly kills discoverability, and a marketplace will reject the listing outright. The classic failure: ask an AI to categorize a product and it hands back a category that sounds real but doesn't exist anywhere in your taxonomy. "Seating," when your tree says "Dining Chairs."

So classification now walks your tree one level at a time, and at each step it can only pick from the categories that actually exist under the parent. Five to seven levels deep, well over a thousand categories, and every choice lands on a real node. A manual sorting job becomes a single pass, and the output is something you can hand straight to Wayfair or your other channels without another round of cleanup.

Humans in the loop, where it counts

None of this is about taking people out of the process. It's about moving them. The old way had you checking everything because you couldn't trust anything. The version we shipped this year does the confident first pass, leaves honest gaps where it should, and points you at the handful of values that actually deserve a second look. You still own your catalog. You're just not retyping it.

If AI inventing things in your product data is a problem you've lived, that's exactly what we've had our heads down on this year. You can see everything that shipped in the changelog, or bring your own catalog and see how the enrichment holds up.