How to Ingest Email CSVs Into a CDP Like RudderStack

By Max Werner

On Mar 14, 2022

Featured Image with rudderstack logo

Overview & TL;DR

First of all, big shoutout to MacMillan Search's Article on MOZ! It is amazing and the inspiration for this article.

  1. We’ll be using a email trigger which gives you an email address you can run code bits off when it receives an email.
  2. We’ll set our email client of choice (gmail in my case) up to automatically forward emails you get your reports from to this pipedream email
  3. We’ll use just one easy code step to get the data and send it to our CDP of choice

Why Pipedream? What alternatives are there?

Working with emails is a bit trickier than you might think. There aren’t easy ways to save attachments automatically, but pipedream lets you run code in a variety of languages (node14, python3.8, golang, bash at the time of writing).

There are a few things you need to do that pipedream handles for you:

  1. Run code over an email body
  2. Lets you use any dependency / package without hassle
  3. Runs as a trigger rather than a schedule

That last one is the tricky one. There are commercial solutions to work with email and code but for the scope of this use-case the cost tends to outweigh the benefit. Lots of them also only work on schedules, not triggers which pipedream supports very easily and cost effectively, even at a higher scale.

With all that in mind here are a few theoretical alternatives. If you would like help having them implemented, feel free to reach out to us ;)

Apache Airflow

  1. Set up your email system to flag emails from your report sender in some way (e.g. gmail tags)
  2. Use an Airflow DAG to connect to your email account and find emails with this tag (e.g. report_to_be_processed
  3. Do your CDP ingestion
  4. Untag the email and re-tag it with something else to maintain idempotency (e.g. reports_processed)

Cloud Functions & Data Lakes

Similar to Airflow, use a cloud function to do the same logging in, processing, and re-tagging. Either send the data to a CDP or directly to a data lake from a Cloud Provider of choice (S3 DataBricks, etc.)

Pipedream Email Triggers

Log into your pipedream account and create a new workflow with an email trigger: Trigger Overview Screen

This will generate a unique email address ending in for you. Note it down and setup your email client.

Gmail Email Client Setup

Note: This can all be skipped if you simply send the reports directly to the pipedream email. However having it in your inbox is a nice “backup” to have :)

Go to your email settings and Add a forwarding address and enter the pipedream email you got in the step above. It will ask you to confirm the email which is easy since your pipedream tab will record the confirmation email sent to it. Simply open the email text and copy the confirmation link from it.

Gmail Email Forwarding Screen

Not to worry, even though it will look like Gmail will forward all emails to this new address, it will not. It simply makes the address available as a forwarding address.

Next, go to your main Gmail inbox and use the search bar and create a filter to match the reports you are getting. I am using the same MOZ reports MacMillan Search is, so my filter looks like this:

Gmail Search filter screen

Gmail Forwarding Filter

Hit the Create filter button and simply check the Forward it to box and select your pipedream forwarding address.

Now we need pipedream to actually do something with these emails.

Pipedream Code Setup

This section will vary a bit, depending on how the email report you get looks like. MOZ’s reports are great for the tutorial as they are about the most complicated ones I can think of.

MOZ’s reports are not CSVs attached to an email. They are behind a download link. But that link gives you a ZIP file, not even the CSV itself.

So take a look at the code over here on GitHub where all my code samples will live. It includes the NodeJS code for now.

The code does the following:

  1. Parse out the download link from the email body
  2. Download the zip file to /tmp where pipedream lets you write small files during the execution of your function
  3. Unzip the zip file
  4. Loops through all CSV files in /tmp and corrects the encoding (moz reports aren’t UTF-8 for some reason)
  5. Normalizes the header to be lowercase and replaces spaces with underscores (e.g. Start Date -> start_date)
  6. Sends the data to the CDP. I’m using RudderStack here.

CDP Setup & Conclusion

Well, once this event stream hits your CDP, connect it to whatever you want, like a data lake or warehouse destination for example. That’s really it. From now on every report that matches your filter gets sent to pipedream and from there to your CDP. Thanks to headers in most CSV files, we get an automatic object building as event properties and can even re-use the same code for multiple different reports!

Some final thoughts to make this production ready:

  • Security and validation: Right now the code doesn’t check for what emails you would want to process. I’d recommend adding logic at the beginning to check if the email’s subject and sender match where you are expecting data from!
  • Multiple kinds of reports in one pipeline: You can of course re-use the one pipedream workflow for multiple kinds of reports. Simply adjust your logic to dynamically name events based on the report file and of course adjust downloading logic based on the email (i.e. don’t try to unzip a file that doesn’t exist)
  • Python: Pipedream is in the process of adding python support. It is in alpha right now but I will add the code soon nonetheless :)

I hope this helps and if you have any questions, feel free to reach out. Building data pipelines for our clients is what we do :)

© 2024 Obsessive Analytics Consulting. All rights reserved.