Three mistakes data practioners make when thinking about privacy
by Josh Schwartz

Before starting Phaselab, I spent the previous 11 years working in data science and analytics. More than perhaps any other group, data practitioners have enormous effects on their users’ privacy — deciding what data will be stored, where that data will be housed, how it’ll be transformed and incorporated into models, and whether and when it will be deleted. In my experience data practitioners care deeply about respecting users’ privacy, but they’re also ill-equipped to make decisions about privacy (I have yet to meet a data scientist with a CIPT certification…but if you’re out there I’d love to meet!).

If this sounds like you (and it describes me a few years ago!), I’ll gently encourage you to stop and pay attention to privacy laws. Since GDPR came into effect in 2018, we’ve had more than a dozen states and numerous countries pass their own privacy laws. These laws govern how data about those regions’ citizens must be handled and cover companies regardless of where they’re based, meaning that your data practices may be subject to multiple regulations and there are potentially dozens of regulatory bodies that are empowered to investigate you for privacy violations. These regulators are officially not messing around.

And as your organization asks to do more with its data via AI, the risks only multiply.

While I can’t possibly cover all of the issues related to privacy for data practitioners in one blog post, I wanted to highlight a few issues that consistently come up in my conversations with data leaders.

Anonymization isn’t what you think it is

I’ve talked to countless data practitioners who say that their process for anonymizing data (for example, if a user requests their data be deleted under GDPR), is to obfuscate the data by hashing it. For example, a company might have a customer account table structured like:

email, product tier, created_at, …
“josh@phaselab.co”, “Free”, “2023-01-01 00:00:00”

Now let’s say that someone submits a request to that company to delete their personal data. We might want to retain a record that someone created an account on that date, so we might be tempted to turn that row into something like this:

“4ebfe0a26086d677d7b9412dfeb44cfd”, “Free”, “2023-01-01 00:00:00”

where 4ebf… is the md5 hash of the original email address. This seems great on its face — we’ve preserved much of the usable data, even aggregate queries like a COUNT DISTINCT will work, and we’ve deleted the user’s email address such that nobody looking at the table would be able to tell that this record belonged to this user.

There’s a problem, though: from the perspective of privacy, we’ve done effectively nothing. If someone (our marketing department, law enforcement, a hacker) came to us and wanted to know if we had an account matching the email address josh@phaselab.co, it’d be trivial for us to answer that question: we’d just take the email address, apply the same hashing algorithm, do a search for the hash, and easily retrieve the record.

Legally, the hashed email is what’s referred to as pseudonymized data — the personal data has been replaced with a pseudonym. Pseudonymization is a great technique for helping to secure data, but from the perspective of privacy laws pseudonymized data is still considered personal data, so merely pseudonymizing data often isn’t sufficient to be able to say that you’ve deleted a user’s personal data. The definition of what is and isn’t considered truly anonymous data varies somewhat by jurisdiction — giving a full definition is beyond the scope of this article, but for a great start check out this great talk from Katharina Koerner from last year’s PEPR conference.

Depending on your needs, there are a variety of ways to truly anonymize data in a context like this — you could replace the email with a random string or a NULL value, for example, or you may be able to leverage a privacy vault in designing your data systems.

Personal data isn’t just PII

Similarly, I’ve heard many technical folks use the term “PII” (Personally Identifiable Information) when referring to what’s in scope for data protection laws. In reality, laws like the GDPR refer to the significantly broader category of “Personal Information.”

While these terms sound almost identical, the definition of what counts as Personal Information really matters when it comes to privacy laws. For example, online identifiers like IP addresses and cookie-based identifiers do qualify as personal information under GDPR (as well as others like Washington’s My Health My Data and California’s CPRA) even if they aren’t considered PII. I know that many technical folks may be screaming out “but wait, IP addresses aren’t real personal identifiers, there are only 2^32 possible IP addresses and there are almost 2^33 people!” And while you’re right (when it comes to IPv4 addresses at least!), the regulators don’t agree with your point: in a number of cases companies have gotten in trouble specifically for their handling of IP addresses.

The upshot is that it’s very important you work with privacy experts to understand the totality of data you have that counts as personal data and map your privacy policies across all of that data.

Deletion is harder than it looks

Some of the most fundamental rights given consumers under GDPR and other laws are the right of access (getting a copy of the data a company holds about them) and the right of deletion (asking a company to delete the data the company holds about them). While these rights are simple to articulate, they’re much harder to actually put in practice. A few examples:

  • Do you store some personal information in flat files in S3? How will you delete a single user’s data from a file upon request?
  • Do you have a large scale data warehouse? How will you execute deletes against your largest tables in a performant way?
  • Do you have a complex graph of relationships between tables, for example a Users table where a User is a foreign key on a number of child tables? If a User is deleted from the base table, how will you ensure that all children tables are appropriately updated?

And those questions are just about your more structured data: how are you handling unstructured data like chat logs from your support team, where personal data might be anywhere?

Finally, I’ll note that building a deletion pipeline is only half the battle. Getting stakeholders to integrate and maintain those integrations over time is consistently a top problem we've heard from privacy officers.

In short, if you aren’t designing your systems with the idea of user-level deletion in mind, you might find that deleting isn’t as simple as issuing a couple of delete statements when requests start coming in.

So, what’s next?

I hope that this post has given you a lot to think about. If you’re interested in learning more, I’d recommend Nishant Bhajaria’s excellent book for a thorough rundown of the basics, Debra Farber’s Shifting Privacy Left podcast for detailed dives into the cutting edge, and the PEPR conference to learn about the latest in research. And if you’re interested in working to improve your privacy practices, I’d love to chat about what we’re building at Phaselab and see how we can help. My email is josh@phaselab.co or you can reach me at the button below.

Learn More
Get in Touch