AI-Powered Product Matching at Scale

The Client

A market leader in B2B product data integration for the building materials sector, connecting hundreds of manufacturers to hundreds of wholesalers and managing a database of millions of unique articles. Their team of data specialists transforms chaotic supplier data into clean, standardized formats, and a huge chunk of their time was being consumed by one bottleneck.

The Problem

Product matching, linking a supplier's product to the correct article in a wholesaler's catalog, was eating roughly 20% of the team's total operational capacity.

The challenge is deceptively complex. One supplier calls it "A-plate horizon board 4xAK 2600x1200x12.5 mm", another lists the exact same product as "GYPSUM PLATE 4XAK 12.5MM 260X120CM". Different abbreviations, different unit formats, different field structures, sometimes different languages entirely.

When product codes exist, matching is straightforward. But when the only information available is a free-text description, which happens frequently, matching becomes a manual puzzle that only domain experts can solve.

The client had already invested in multiple internal matching tools, but none could deliver the automation rate needed to meaningfully reduce the manual workload. Their target was clear: 60-70% of matches handled automatically, humans reviewing the rest.

Our Approach

We structured the project into four phases, each with a go/no-go decision point. Phase 1 was entirely at our own risk, zero commitment required to get started.

Phase 1: Analysis and Discovery

Before writing any code, we analyzed sample datasets, mapped data quality issues, and studied the existing matching tools. We found untapped opportunities: tens of millions of historical match records that had never been fully leveraged, an internal synonym database mapping domain-specific terms, and strong enrichment extraction logic worth building on.

Phase 2: Matching Pipeline

We built a three-layer matching architecture:

Exact matching: pure code comparison on EAN numbers, barcodes, and supplier codes. No AI needed. Deterministic and fast, this alone caught a significant portion of matches.

Semantic matching: for products without matching codes, we implemented a hybrid search combining vector-based similarity (understanding meaning) with keyword-based matching (catching exact terms). Both signals run simultaneously, and the system learns which to trust for each specific match.

Confidence scoring: every match gets a confidence score. High-confidence matches are auto-approved, medium-confidence matches are suggested for review, low-confidence matches go to the manual queue.

Phase 3: Custom Model Training

This is where the client's biggest asset came into play. We trained a custom embedding model specifically on building materials terminology, using the historical matches as ground truth and the synonym database to bootstrap the vector space.

The improvement was dramatic. Generic models do not understand that two different words can mean the same product, or that a domain-specific abbreviation refers to stainless steel. The custom model does, because it learned from decades of expert matching decisions.

We also built dynamic matching strategies that adapt per product category. Electrical components prioritize exact codes. Raw materials prioritize semantic matching. The system routes each product through the most effective strategy for its category.

Phase 4: Production Deployment

A clean web interface: drag-and-drop file upload, real-time processing with live match counter, color-coded review table, and bulk-approve for high-confidence matches. Every specialist action feeds back into training data for periodic model retraining. The system gets measurably better over time.

The Result

The matching automation rate exceeded the 60-70% target. Processing time per supplier onboarding dropped from days of manual matching to minutes. Specialists now focus on genuine edge cases rather than routine matches.

The system improves continuously, every confirmed or rejected match feeds back into retraining. Infrastructure costs remain minimal because the core system runs on open-source technology.

"We needed something practical, not a science project. 60-70% automation was the goal. They delivered that and more, and the system keeps getting better the more we use it." , Director of Operations

Technology

Python for the core matching engine
Sentence Transformers with custom-trained embeddings for building materials
FAISS for vector similarity search at scale (millions of articles)
Streamlit for the web-based review interface
AWS Frankfurt for EU-based cloud deployment