← All Insights Sammalkko

People Data Infrastructure: The Unsexy Layer That Determines Everything

2024-12-18 Jaakko Laine

Before you can build AI models on top of people data, you need people data that is clean, consistent, and accessible. Most enterprises don't have that. What they have is a collection of systems that were each bought to solve a specific problem — an HRIS for headcount records, an ATS for applicant tracking, an LMS for training completion, a performance module bolted to the HRIS, a payroll system with its own data model, and multiple shadow spreadsheets maintained by department heads because the official systems don't produce the reports they need. Each system has a different employee identifier. Almost none of them talk to each other in real time.

This is the infrastructure problem that sits beneath every AI-native HR product that touches enterprise customers. It is not a problem that the AI layer can solve — it is a prerequisite condition that has to be addressed before the AI layer can function. The founders who understand this are building something different from the ones who don't.

What the Data Actually Looks Like

Jaakko spent four years building NLP infrastructure for a Nordic job matching platform. The first 18 months of that job were mostly data plumbing — cleaning, normalizing, and reconciling inputs from job postings that came in dozens of formats from hundreds of sources. The experience that period produced is directly applicable to the enterprise HR context, because the data quality problem in enterprise HR is structurally similar: heterogeneous sources, inconsistent identifiers, no single version of the truth about any individual record.

A typical mid-size European enterprise — say, 2,000 employees across three countries — might have employee records maintained in: an SAP SuccessFactors instance as the system of record, a separate Workday deployment that was acquired through a subsidiary merger and never fully integrated, a learning management platform with its own user database that maps to employee email addresses rather than employee IDs, a compensation management spreadsheet updated quarterly by the rewards team, and two or three country-specific payroll systems that hold data on working hours, leave balances, and variable pay components.

Running any kind of analytics across this landscape requires resolving the same person appearing under different identifiers in different systems. That is a data engineering problem before it is an AI problem. The AI can't help you if the input is a mess of conflicting records.

Why This Has Not Been Solved Already

The obvious question is: this problem has existed for 20 years. Why hasn't the major HRIS vendor just solved it? The answer is that large HRIS vendors have strong economic incentives to be the system of record and weak economic incentives to become the integration layer that connects all the other systems cleanly. When SAP or Workday adds an integration to a third-party ATS, they are helping their customer get more value from a competitor's product. The integration works well enough to avoid losing the customer but not well enough to eliminate the spreadsheet workarounds.

This creates a genuine market gap for infrastructure-first HR data companies — tools whose value proposition is not "we have a better recruiting module" but "we make your existing HR data stack queryable as a single coherent dataset." The category is sometimes called the people data layer, HR data warehouse, or workforce data platform, depending on who is selling it. The underlying pattern is the same: extract from multiple sources, resolve entity conflicts, apply a consistent schema, and expose a clean API that AI applications can call.

Findem, our most recent portfolio addition, approaches this from the people intelligence angle — the insight that the most important people data an enterprise has is not internal headcount records but the external labor market signal about who their employees are, where they came from, what skills they have that aren't in any internal system, and where they might go. That outside-in data model is a different architectural choice than inside-out HR data warehousing, and the two approaches are probably complementary rather than competing.

What "Infrastructure-First" Means in Practice

When we evaluate companies in the people data space, we look for three things that indicate genuine infrastructure thinking rather than a features-first product that happens to do some data normalization.

First, does the product have a coherent entity resolution layer? Matching employee records across systems with different identifiers is a well-understood ML problem, but it requires explicit design investment. Companies that treat it as a secondary feature tend to produce a data layer that works for simple cases but breaks on the edge cases that matter in enterprise — employees with name changes, part-time workers who appear in only some systems, contractors who are in the ATS but not the HRIS. The founders who have solved entity resolution at scale tend to talk about it as a core architectural component, not a utility function.

Second, is the data model designed around the analytics use cases or around the source systems? This sounds like a philosophical distinction but it has practical implications. A data model designed around source systems produces a clean copy of the existing data; it just happens to be in one place rather than five. A data model designed around analytics use cases makes different normalizations — career trajectory data has to be time-stamped and queryable as a history, not just as a current record; skills data has to be inferrable from behavioral signals as well as explicit tagging. The latter is harder to build and significantly more useful.

Third, what is the GDPR and data residency strategy? This is not a secondary consideration in European enterprise sales. An HR data infrastructure product that consolidates employee records from multiple systems is creating a new personal data processing activity under GDPR, which requires a lawful basis, a data processing agreement, and clear records of retention and deletion. Companies that have thought through this carefully are much easier to sell into European enterprises than ones that treat compliance as an afterthought. We have seen promising products lose Nordic enterprise deals at the legal review stage because the data processing agreement didn't cover the consolidation activity clearly.

The Talent Intelligence Connection

The reason this infrastructure layer matters for AI HR tech specifically — not just for enterprise data management in general — is that the AI applications that create the most value in HR are longitudinal and relational rather than point-in-time. Skills forecasting needs to connect training history, performance signals, career progression, and external labor market data across time. Attrition prediction needs a time series of engagement signals, compensation history, internal mobility activity, and manager change events. None of those applications can function on static snapshot data pulled from a single system.

When we look at portfolio companies that have strong AI layer products — Eightfold's talent intelligence, Retrain.ai's skills forecasting — the ones that are making the most progress in enterprise are the ones where the customer's data infrastructure was good enough to actually train and validate the models on meaningful data. The companies where the deployment stalled are almost uniformly the ones where the first three months of the engagement became a data cleaning project rather than an AI deployment.

This is why we pay close attention to the infrastructure layer even when we are underwriting the AI application layer. The application product thesis can be right and the deployment can still fail if the infrastructure assumption underneath it — that enterprise customers have clean, accessible people data — is wrong. Sometimes the right investment is in the product that fixes the infrastructure. Sometimes it is in the AI application that is also good enough at data normalization to move forward despite the mess. Usually it is some combination of both, and knowing which combination is right for a specific company's stage and market is where the judgment has to be applied.