On DTNS we try to balance the idea that companies do definitely need to improve data protection with the idea that sharing data at all isn’t a bad thing in fact when done right can be a very good thing.
Not just for companies but academic research and nonprofits benefit from research on datasets. However just taking data, even when names are stripped off, can lead to trouble. As far back as 2000, researchers were showing that the right analysis of raw data sets could deduce who people were even when the data was anonymized. In 2000, Latanya Sweeney showed that 87% of people in the US could be identified from ZIP code, birthdate and sex.
One attempt to make data workable is called differential privacy. Apple mentioned the use of differential privacy in its 2016 WWDC keynote.
What is differential privacy?
An algorithm is differentially private if you can’t tell who anybody is by looking at the output.
Here’s a simple example. Let’s say you want to publish the aggregate sales data of businesses by category. Stores want to keep sales data private. So you agree that only the total sales for a category will be published. That way you can’t tell how much came from which businesses. Which is great until you come to the category of Shark repellent sales. There’s only one shark repellent business in your region. If you publish that category you won’t be saying the name of the business but it will be easy to tell who it is.
So, you have an algorithm that looks for categories where that’s a problem and maybe it deletes them or maybe it folds them into another category.
This can get trickier if, say, there’s a total sales number for the region and only one category was deleted. You just add all the published categories and subtract it from the published total and the difference is the missing business.
And remember there’s other data out there to use. Some attacks on data use data from elsewhere to deduce identities. Let’s say you study how people walk through a park and you discover that of 100 people observed 40 walk on the path and 60 cut through the grass. Seems private enough right. There’s no leakage of data in the published results.
But an adversary discovers the names of the people who participated in the study. And they want to find out of Bob walks on the grass so they can embarrass him. They also found out that of the 99 people in the study who weren’t Bob, 40 walked the path and 59 walked on the grass. BINGO! Bob is a grass walker. Now I know it sounds unrealistic that the adversary got that much info without just getting all of it. But differential privacy would protect Bob’s identity even if the adversary had all that info.
So what do we do? How do we do this differential privacy thing?
In 2003 Kobbi Nissim and Irit Dinur demonstrated that, mathematically speaking, you can’t publish arbitrary queries of a database without revealing some amount of private info. Thus the Fundamental Law of Information Recovery, which says that privacy cannot be protected without injecting noise. In 2006 Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam D. Smith published an article formalizing the amount of noise that needed to be added and how to do it. That work used the term differential privacy.
A little bit on what that means.
The ultimate protection of privacy is to not have your data in the dataset. Differential privacy gives each individual roughly the same protection as having their data removed. In more formal terms statistical functions should not depend on the data of any one individual.
So you introduce noise. The fewer data points you have the more noise you need. The noise has to be random so it won’t affect the accuracy of the aggregate data. But it will make it hard to tell what data is real and therefore figure out who is whom.
How does that work?
One example that may help is you ask somebody “have you ever stolen a car?” Then you have them flip a coin. If it’s heads they flip the coin again but ignore the outcome of the coin and just answer honestly whether they stole a car or not. But if the first throw is tails, you pay attention to the second coin flip and answer yes if heads and no if tails. This way 50% of your answers are accurate and 50% are just evenly random and won’t affect the end percentage.
You can answer honestly because a “yes” could be incriminating but it could also be because you flipped the coin wrong.
Back to our store example earlier. Lets say I add noise to the sales data in the categories. The percentage of sales is still accurate but I can’t tell how much the one store had any more because there’s a bunch of fake sales numbers in the shark repellant category.
Finally back to Bob. The revelation of Bob’s grass walking came by knowing that there were 40 path walkers and 60 grass walkers and knowing the walking habits of the 99 other participants. So what if the published results said there were 61 grass walkers. or 59? You wouldn’t be able to determine for sure which one was Bob because your inside info wouldn’t match up. But the overall percentage of grass walkers to path walkers stayed around 60%.
You may have noticed in our examples that you are losing some data accuracy in the service of differential privacy. For instance in our company sales example percentages are still accurate but the totals in each category would not be. These are simplified examples and there are ways around these particular issues but overall it’s true the the more you want to protect the privacy of a dataset the less specificity you will be able to get. This is known as the privacy loss parameter or “epsilon.”
You may remember stories back in 2017 saying the Apple’s implementation of Differential Privacy wasn’t good enough. This was because researchers reverse engineered Apple differential privacy and believed its privacy loss parameters, it’s epsilon, allowed for too much specificity.
This doesn’t mean differential privacy doesn’t work but it does mean it’s not a magic word. You can’t just say “differential privacy” exists in a dataset and therefore it is now magically 100% secure. Anybody following privacy and security probably already guessed that. It’s a tradeoff. And companies who implement it have to decide that tradeoff. And companies, like Apple, that don’t publish their epsilon number make it more difficult for anyone to know how private a dataset is.
This is just an introduction to the concept. There are lots of great explainers out there if you really want to understand Differential Privacy and how it works mathematically.
For all DTNS shows, please SUBSCRIBE HERE.
A special thanks to all our supporters–without you, none of this would be possible.
If you are willing to support the show or give as little as 10 cents a day on Patreon. Thank you!
Big thanks to Mustafa A. from thepolarcat.com for the DTNS logo and Ryan Officer for the DTNS Labs take!