April 24, 2020
Thanks to the latest advances in synthetic data generation, compliant data management is no longer a matter of safeguarding data privacy at the expense of data integrity. Tonic’s partnership with Spirion exemplifies how robust data discovery capabilities paired with the proper approach to data synthesis enables companies to guarantee their customers’ privacy without losing the business value of the data they need to protect. But what is the “proper approach” to data synthesis when it comes to achieving these goals? After all, data anonymization comes in many flavors. We’re here to explain how our approach at Tonic takes anonymization and synthesis algorithms to the next level by enhancing them with a property that makes privacy a guarantee.
How Tonic Anonymizes Data Securely
Tonic’s anonymization capabilities run the gamut from the simplest data masking techniques (NULL substitutions, character scrambles, etc.) to complex cross-database data synthesis whose output is as realistic as its source. Our generators allow you to anonymize while preserving your data’s constraints, interdependencies, and distributions by maintaining consistency for entities across tables, linking unlimited numbers of columns (such as events in a time series), and flagging real-time schema changes to ensure that your output is representative of the most up-to-date input available. All this makes the data you generate with Tonic integral, realistic, and of as much value to your business as the original data on which it is based. But in terms of security, the jewel in our crown is our ability to generate data that is differentially private.
The Role of Differential Privacy
Put simply, differential privacy is a mathematical guarantee about the privacy of a process. In our case, the process is what Tonic does, namely, generating realistic data based on an existing dataset, stripped clean of PII. A process that is differentially private is guaranteed to never reveal anything attributable to a specific member of the original dataset it involves. Instead, differentially private processes are only able to reveal information that is broadly knowable about a dataset.
For example, a differentially private process could never reveal the weight of patient #18372, but it can reveal the average weight of all patients. For a more detailed example: a differentially private process can reveal the average weight of all patients within a given zip code, adjusting its output appropriately given the number of patients in the zip code. For zip codes with many patients, little adjustment may be necessary; for zip codes with few patients, it may give a result that’s less accurate to protect the privacy of its members.
The Key Tonic Difference
One way Tonic achieves this is by implementing a differentially private histogram in our algorithms. This histogram is capable of making detailed approximations of the frequency distributions of numerical and categorical data. The weight in the above example is numerical data, and the zip codes are categorical data. Other examples of numerical data include age, salary, and household size, and categorical data includes SSN, gender, and birth date.
Differentially private processes have a number of valuable properties. The first, and perhaps most important, is that no amount of post processing or additional knowledge can break the guarantees of differential privacy. The same cannot be said for other data anonymization techniques, like k-anonymity, which are susceptible to a variety of attacks. What’s more, differentially private data can be combined with other differentially private data without losing its protection. In short, data protected by a process with differential privacy cannot be reverse engineered, re-identified, or otherwise compromised, no matter the adversary.
Tonic has built differential privacy into its offering to give customers mathematical guarantees around the safety of their sensitive data coupled with statistical guarantees of the realistic, representative nature of the synthetic output data we generate. This is a fundamental distinction as compared to other approaches to data synthesis. There’s a common misconception that, so long as it’s derived from the statistics of a source dataset, synthetic data is automatically safe. This is not the case. The proverbial key to secure data synthesis is differential privacy.
Why is Differential Privacy Important for a Company’s Goals?
Inputs and outputs aside, what does this mean for a company’s overall privacy and productivity goals? It means respecting each customers’ data processing preferences without hindering the development and rollout of the products and services they rely on. Finding and removing PII from a dataset does not have to mean blanket redactions and the total loss of the data’s business value. Differentially private data synthesis safeguards sensitive datasets and their utility at the same time. Privacy and integrity preserved.
The Best of Both Worlds
Our unique privacy solution partnership can be most succinctly described as follows: Spirion discovers. Tonic cures. A tonic by definition is something that restores your sense of well-being. It makes you feel better. When it comes to data, this has been our goal from the start: to make you feel better about your data management, anonymization, and synthesis. We’re excited to integrate Tonic’s capabilities with Spirion’s to ensure your sense of well-being in the realms of data privacy and data integrity.
This is a guest post from Ian Coe, the Founder of Tonic. Tonic is an integration partner with Spirion.