← All Insights Sammalkko

Beyond Keywords: Why Semantic Matching Changes the Sourcing Calculus

2025-07-30 Jaakko Laine

When I was building NLP pipelines at the Nordic job matching platform, we spent a significant amount of time measuring the specific failure modes of keyword-based matching before we built the embedding-based system to replace it. That measurement work was useful — not just for validating the migration, but for understanding exactly what categories of error the keyword approach was generating and which of them the semantic approach would actually resolve. The gap was larger than I expected when I started, and the nature of the failure modes was more systematic than "it doesn't understand synonyms."

This piece is a technical practitioner's account of where keyword matching fails, what embedding-based retrieval does differently, and where the limits of the semantic approach sit. It is aimed at people who are evaluating sourcing tools and want a more precise framework than "AI matching is better than keywords."

How Keyword Matching Fails — Specifically

Keyword matching between a job description and a CV operates on exact or near-exact token overlap. The failure modes fall into a few distinct categories that are worth distinguishing because they have different implications for which candidates are systematically excluded.

The first category is vocabulary fragmentation — the same role or skill described using different terminology by the recruiter and the candidate. "Machine learning engineer" and "ML engineer" are trivially equivalent to a human reader. TF-IDF and similar term-frequency approaches do not handle this well without stemming and synonym expansion, and synonym expansion for technical skill vocabularies is harder to maintain than it sounds because the vocabulary shifts faster than the expansion rules can be updated. When we audited the keyword system at the platform, roughly 18–22% of false negatives in technical roles were attributable to vocabulary fragmentation — candidates who had the relevant skills but used different terminology in their CVs than the job description used.

The second category is experience context collapse. A keyword match on "Python" does not distinguish between a candidate who has used Python for three years as the primary language in a production engineering role, and a candidate who completed a Python course six months ago and listed it on their CV. These are categorically different profiles for most roles that require Python, but keyword matching treats them identically. The mismatch between the keyword match quality and the actual relevance quality creates noise in the sourcing pipeline that sourcers absorb as manual review burden.

The third category is role title heterogeneity. Job titles are not standardized across organizations, industries, or countries. "Senior Data Analyst" at a Nordic telecoms company is a different scope and seniority than "Senior Data Analyst" at a startup, and different from "Senior Analytics Engineer" at a data-heavy tech company that uses engineering titles for analytical roles. A keyword match on titles generates both false positives (titles that match but describe fundamentally different scopes) and false negatives (relevant candidates whose titles are described differently by their employer).

What Embedding-Based Retrieval Does Differently

Sentence transformers and other embedding models represent text as dense vectors in a high-dimensional space where semantic similarity is captured by vector proximity. Two phrases that mean the same thing — even using completely different vocabulary — are embedded near each other. Two phrases that use the same words but mean different things in context are embedded further apart.

The practical consequences for sourcing: vocabulary fragmentation is substantially resolved because the model has learned, from pretraining on large text corpora, that "machine learning engineer" and "ML engineer" are semantically proximate. The synonym expansion problem goes away because the model handles it implicitly rather than requiring explicit rule maintenance.

Experience context is handled better but not perfectly. A candidate whose CV contains "Python" in a section labeled "Skills" — with no surrounding context — will produce a weaker embedding representation for Python expertise than a candidate whose CV describes "built and deployed Python-based data pipelines processing X billion records daily." The contextual signal in the second case is richer, and the embedding captures that. However, the model's ability to distinguish between different experience depths is bounded by the quality and density of contextual signal in the CV text.

Role title heterogeneity is substantially resolved for well-represented role families but less resolved for niche or emerging roles. The embedding model learns title-to-function relationships from pretraining corpora, but niche roles that are underrepresented in those corpora are handled less accurately. When we migrated to the embedding-based system, we saw the largest improvements for common technical roles and smaller improvements for specialized or regional role categories.

The Feedback Loop Requirement

The embedding model is only as good as its pretraining and fine-tuning. A model pretrained on general text corpora and not fine-tuned on recruiting-specific data will have weaker performance on the domain-specific vocabulary and context patterns that appear in CVs and job descriptions. The companies that are building durable sourcing products are the ones that have accumulated recruiter feedback signal — which candidates advanced, which were rejected, which accepted — and are using it to fine-tune the relevance model continuously.

This is the feedback loop I mentioned in other pieces. When we built it at the job platform, the improvement in recruiter acceptance rate from the feedback-tuned model versus the base embedding model was roughly 12–15 percentage points for the role families where we had sufficient labeled data. That gap matters commercially — it is the difference between a sourcing tool that is noticeably better than keyword search and one that is only marginally better but requires more infrastructure investment.

The challenge is cold-start: a new platform has no recruiter feedback signal and therefore no advantage from fine-tuning. The base embedding model is better than keyword search, but the gap to a feedback-tuned model is real. Platforms addressing this cold-start problem — either by using transfer learning from a larger base model that has been fine-tuned on publicly available hiring data, or by structuring the initial product experience to generate labeled signal quickly — are in a better position to close the gap faster for new customers.

Where Semantic Matching Still Fails

We are not saying that embedding-based retrieval solves sourcing. It improves specific, identifiable failure modes in the search quality problem while leaving others untouched.

Cultural fit and team dynamics are not captured in any text representation. A technically qualified candidate who has worked exclusively in flat, highly autonomous environments and is being evaluated for a role in a highly structured, process-heavy organization has a mismatch that no CV-to-JD matching algorithm will detect. This is not a failure of the matching algorithm; it is outside the information boundary of the inputs available to the algorithm.

Long-term potential is similarly opaque to text matching. Embedding similarity measures relevance based on current skills and experience as described. It does not have a model of trajectory — whether this candidate's development over the past five years suggests they will exceed the role requirements in two years, or whether they have plateaued. Trajectory inference from CV text is possible but unreliable with current approaches; the signal is too sparse and too ambiguous.

Sourcing channel distribution is a structural problem the matching algorithm cannot touch. If the sourcing channel reaches predominantly one demographic, the matching algorithm operates on a pre-filtered candidate pool that does not represent the available talent. Better matching within a biased inbound pool does not compensate for the bias introduced at the sourcing stage. Addressing this requires deliberate channel diversification, not algorithmic improvement.

The Practical Implication for Sourcing Teams

The shift from keyword to semantic matching is a real and measurable improvement in search quality for most sourcing workflows. It reduces the manual review burden, improves coverage of qualified candidates who used different vocabulary than the job description, and handles role title heterogeneity more gracefully. These are not small gains — for a sourcing team reviewing hundreds of applications per role, a 15–20% improvement in the fraction of surfaced candidates that are genuinely relevant translates into meaningful time savings and better initial shortlists.

What it does not do is replace sourcer judgment on the dimensions that are outside the information boundary of text — team fit, trajectory, cultural alignment. The tools that are most effective in practice are the ones that use semantic matching to eliminate the clearly irrelevant and surface the genuinely ambiguous for human review, rather than the ones that claim to replace the sourcer entirely. The sourcer's job changes from "reviewing everything" to "reviewing the ambiguous cases and exercising judgment on the dimensions the algorithm cannot assess." That is a better use of sourcer time and expertise, not a displacement of it.

The vendors who describe their products as doing the first thing honestly get better long-term adoption than the ones who overclaim. Sourcers who have spent years learning to read CVs have strong detection for when an algorithm is oversimplifying their job. Trust erodes quickly when the tool's recommended candidates consistently fail on dimensions the algorithm's promoters claimed it could handle.