What is an ID Graph and Why You Want One

By Max Werner

•

On Mar 11, 2024

CDP

Analytics

Business Intelligence

scrabble tiles spelling "who are you" by Brett Jordan from Unsplash

Identity Resolution and ID Graphs The basic principle behind ID Resolution is to create a graph connecting all different kinds of identifiers for a given person. Those identifiers usually cover things like internal user IDs, email addresses, phone numbers, anonymous cookie IDs, 3rd party SaaS tools' internal IDs and more. Anytime there is a group of two or more identifiers in a given data point, your graph can extend. For the purpose of this article we will follow the journey of John Smith and his interactions with our fictional SaaS company.

John sees a Google Ad, clicks on it and lands on our website for the first time. This generates a new anonymous cookie ID (AID for short) from your analytics or CDP SDK of choice. Your regular analytics report would show this user as something like this:

Website Vistor Diagram

Now John decides to sign up for a newsletter. We manage this in Hubspot, so a Hubspot contact is created. We now have two separate identities, a website visitor and a Hubspot contact. Both share the same email address, so we can connect them and start our ID Graph:

Website Vistor Hubspot Diagram

The vid attribute here is Hubspot's internal record ID for the contact. Different tools call them different things, but generally speaking, each tool has its own internal ID for a given record.The form submission is sent to an email marketing tool where a record is created with its own internal ID.

Now we send John an email intended to get him to start a trial. John decides to click on the link in the email,but does so on a device different from the one he submitted the newsletter form with. So John's new device gets a new AID when he lands on the website. Since it's a new device, John would at this point show up twice in most analytics tools: Visitor A from the Paid Search campaign, and Visitor B from the email campaign. That's of course not actually true and reports, targets, KPIs, etc. not reflect what actually happened.

Website Vistors Diagram

John ends up starting a trial, using the same email address and of course gets a User ID (uid for short) from our application. Of course the email address is the same, so we can connect the second website visitor to the Hubspot contact and the other visitor like this:

Website Vistor Signup Diagram

Whether or not the two visitors are reported as one depends on your analytics tool of choice, but a data warehouse has no problems calculating a single entity like this:

Finished ID Graph

Why an ID Graph is Important

So why should you care about this? Well, without the ID Graph, your reporting would likely state:

You had two new visitors today (AID abc-123-def-456 and ghi-789-jkl-0123) from two channels: Paid Search and Email
The signup is attributed to an email campaign since that email clickthrough was the first time this AID was seen.

Cross-device ID resolution is a big issue when relying only on cookies as each person can have multiple different cookies on different devices and browsers. This can lead to all kinds of false conclusions and inflated numbers.

Improving the accuracy of basic reporting is but one of benefits identity resolution can bring to your organization. With the ID Graph in place, you can now properly aggregate data from different tools. Data that was previously isolated. Data such as:

Email open & click through events from Hubspot
Direct communication from sales reps with John from Hubspot
Pageview data from the website tracking
App related data points from the signup like trial start & end dates, onboarding questionnaire answers, features used / not used, etc.

Thanks to the ID Graph, you can build customer 360 table columns – all in one place, all as accurate as it can be. John's customer 360 table might look like this (with the data source in parentheses):

Name: John Smith (newsletter form submission)
Email: [email protected] (newsletter form submission, app signup)
First Seen: 2024-03-10 (website tracking)
Total Sessions: 2 (website tracking)
Total Session Before Signup: 2 (website tracking)
Channels Interacted With: Paid Search, Email (website tracking)
Trial Start Date: 2024-03-11 (app)
Trial End Date: 2024-03-25 (app)
Sales Rep Assigned: Jane Doe (Hubspot)
Last Email Opened: Start a trial now (Hubspot)
Last Email Opened On: 2024-03-11

Beyond a customer 360 table, you may be interested in multi-touch attribution on your conversion KPIs. In the traditional "last-touch" model, the email campaign gets full credit for John's trial start. The person running Paid Search might take exception to that, since without the ad, John wouldn't have been on the website in the first place to fill out the newsletter form to get that email.

With the ID Graph, you can build reporting models that include or exclude as many sessions and touch points as you want, as well as how much credit you want to give each touch point. You can split the conversion 50/50 between Paid Search and Email, give Paid Search more weight for being the first session, include sales rep interactions or now - the choice is yours.

The goal of multi-touch attribution is to give the business the ability to gauge effectiveness of different business initiatives according to what makes sense to their business. A paid search colleague of mine has an incredibly hard time showing values for their business-to-business SaaS client due to a last-touch only reporting model. Turns out not a lot of people see an ad and decide to spend tens of thousands of dollars right away - what a crazy idea! A multi-touch model drastically changes how many conversions my colleague is responsible for.

I only scratched the surface on customer 360 tables and multi-touch attribution here (more to come, stay tuned!) but I hope this gives you an idea of what an ID Graph is and why you should want one.

Considerations and Limitations

When creating the rules of how identities get grouped together, there are a few important things to consider. They mostly boil down to the following two topics:

Preventing false merges
Choosing a deterministic vs. a probabilistic approach

Preventing False Merges

The approach described above, at its most basic, will simply take any matching values and make a graph. This can produce problems when identities are merged that shouldn't be. For example, consider shared or public computers. If cookies aren't cleared between people using it, you can have the same, pre-existing anonymous ID being used by lots of people, so you may end up with one ID Graph that has 10 signups with 10 different email addresses, all connected via the same anonymous ID. Another example would be when bad user-entered data is included. Should you allow phone numbers like 000-000-0000 or 555-555-5555 as an identifier in your graph? What about the email address [email protected]?

It is important to analyze your ID graph and remove those bad data points from collapsing multiple sets of identities that should remain separate.

Deterministic vs Probabilistic Merging

So far we've only talked about deterministic merging, since we only allow connections that we know for certain identify a specific user or device. While this may result in bad merges due to bad data in those specific identifiers, it is not guessing at what identity may be the same as another. That is the probabilistic approach. It usually means that you purposefully include "identifiers" that aren't necessarily unique. Those identifiers include IP Addresses, civic addresses, browser fingerprinting information such as user-agent strings, screen size, etc. Using those data points as identifiers does not guarantee uniqueness.

After all, there is a reason that you sometimes get those annoying "this looks like a new device, please check your SMS / email for a verification code" prompts. Those systems see a probabilistic identifier that is different from what it was before (usually IP address) and they ask you to confirm your identity instead of just locking you out as a potential intruder. I generally advise clients not to use probabilistic identifiers unless they have a very good reason for it. The number of false positives tends to outweigh the benefits by a wide margin.

Conclusion

I hope this article sheds some light on why ID Graphs are a very important tool in more fully understanding a customer journey and making data-driven decisions. If you're interested in more information or on how to implement something like this in your organization, please feel free to reach out!

Last but not least, I'll leave you with this little bit of food for thought: User ID Graphs are great, but what about a second ID Graph that merges up to the account / group level? Especially in the SaaS space, you might want to know if a given account has used a new feature, not necessarily which user on that account.

Stay tuned for future articles on some cool things you can do with ID Graphs!