What is data integrity? How to ensure it while protecting sensitive data

Organizations rely on data to make strategic decisions—but if you can’t trust the accuracy and integrity of your data, how much growth and revenue are you leaving on the table?

Big data can provide organizations with insights to improve efficiencies and make strategic decisions. But before you utilize your data, you have to be sure that you can trust its accuracy—which is where data integrity comes into the picture.

Organizations are trending towards pairing data and analytics to drive decision-making. A recent Deloitte report found that organizations with the strongest business analytics culture were twice as likely to exceed their business goals. With so much growth on the horizon, enterprises leave a lot on the table by not ensuring data integrity. If you can’t rely on your organization’s data in terms of accuracy or comprehensiveness, then using that data becomes more of a guessing game than a strategic play.

So, how does your organization protect data integrity while simultaneously ensuring data privacy compliance? Here, we explain what data integrity is, the most common threats, and the steps your organization can take to protect data integrity.

What is data integrity?

Data integrity is the overall accuracy, comprehensiveness, reliability and consistency of data. In simpler terms, it is the trustworthiness of data. Can you rely on it? Are you certain that it’s true?

Often, data integrity is grouped into two categories: physical and logical. Physical data integrity refers to the data’s accuracy and completeness as a physical file. You might think of it in terms of how it’s stored, transferred, accessed or received. Typical scenarios where physical data integrity may be compromised includes natural disasters, such as floods, damaging physical equipment where data is stored, or a cyberattack on a database.

The other category of data integrity, logical data integrity, refers to the data’s rationality and accuracy. For example, let’s say a certain data field on a form contains values that should be represented as a percentage. If your system or software allows users to change how that value is represented—as a dollar amount or unit of measurement—then it does not retain its logical integrity because it’s no longer understood with the original intent.

Common data integrity threats

There are several factors that can have a negative impact on a data’s integrity.

  1. Human error. Unintentional but unavoidable, human error accounts for many instances of damaged data integrity. Even the most minor mistakes can chip away at a data’s integrity.
  2. Transfer errors. Data does not always configure properly when transferred from one location to another and when this happens, integrity is compromised.
  3. Malware and viruses. If data is altered, deleted or lost, that has a huge impact on data integrity.
  4. Cyberattacks. When data is accessed by unauthorized parties, data integrity is compromised.
  5. Compromised hardware. This has a greater impact on physical data integrity and can include scenarios such as an unexpected computer crash or accidental liquid spill on hardware.

Simple steps you can take to protect data integrity

Your team can quickly implement precautions and safeguards to minimize common data integrity threats.

  1. Regularly perform back-ups. Backing up your data is critical and can save your team tons of trouble should an event like an outage or cyberattack occur. Daily back-ups are best practice and can help your team avoid permanent data loss.
  2. Log changes to data. Whenever data is added, updated, modified, or deleted, your team should be keeping record of those events. In the event where logical data integrity is compromised, it can save your team loads of time to have that paper trail to reference.
  3. Conduct internal audits. Audits help keep your team on track and accountable for any data protocols or procedures your organization has implemented. Additionally, having audits on record are helpful in the event of a cyberattack or other malicious event.
  4. Update user permissions. Who has access to your data and to what extent do they have access? Implementing a user permissions model is a simple step that pays dividends.
  5. Verify and validate data. Set up processes to ensure that data is accurate. For example, you might create a process that alerts you to validate data accuracy when a third-party user accesses data.

Data integrity tools: how to ensure it

Thanks to the latest advances in synthetic data generation, compliant data management is no longer a matter of safeguarding data privacy at the expense of data integrity. When data discovery iis paired with the proper approach to data synthesis, companies can guarantee their customers’ privacy without losing the business value of that protected data. But what is the “proper approach” to data synthesis when it comes to achieving data integrity and data privacy compliance?

Anonymize data securely

Data anonymization is usually a company’s first approach but how do you maximize your results? After all, data anonymization comes in many flavors.

There’s simple data masking techniques, such as NULL substitutions and character scrambles, and then there’s complex methods such as cross-database data synthesis. Ideally, you would utilize a combination of methods.

In partnership with Spirion, Tonic’s anonymization capabilities run the gamut from simple to advanced with generators that allow you to anonymize while preserving your data’s constraints, interdependencies and distributions. This is achieved by maintaining consistency for entities across tables, linking unlimited numbers of columns (such as events in a time series) and flagging real-time schema changes to ensure that your output is representative of the most up-to-date input available.

These actions make the data you generate integral, realistic and of as much value to your business as the original data on which it is based.

Implement differentially private processes

Simply put, differential privacy is a mathematical guarantee about the privacy of a process. A process that is differentially private is guaranteed to never reveal anything attributable to a specific individual of the original dataset it involves. Instead, differentially private processes are only able to reveal information that is broadly known about a dataset.

For example, a differentially private process could never reveal the weight of patient #18372, but it can reveal the average weight of all patients. For a more detailed example: a differentially private process can reveal the average weight of all patients within a given zip code, adjusting its output appropriately given the number of patients in the zip code. For zip codes with many patients, little adjustment may be necessary. For zip codes with few patients, it may give a result that’s less accurate to protect the privacy of its members.

So, how does differential privacy work for your organization in terms of data integrity and your customers’ privacy? With Tonic and Spirion solutions paired together, Tonic is able to generate realistic data based on an existing dataset stripped clean of personally identifiable information (PII).

Why is differential privacy key for ensuring data integrity?

Differentially private processes have a number of valuable properties. The first, and perhaps most important, is that no amount of post processing or additional knowledge can break the guarantees of differential privacy. The same cannot be said for other data anonymization techniques, like k-anonymity, which are susceptible to a variety of attacks.

What’s more, differentially private data can be combined with other differentially private data without losing its protection. In short, data protected by a process with differential privacy cannot be reverse engineered, re-identified, or otherwise compromised, no matter the adversary.

Data integrity tools to preserve integrity and privacy

To ensure data integrity and privacy, you need the right tools working in synchrony. The first step to preserving integrity and privacy is sensitive data discovery. Data discovery lays a solid foundation for your organization’s data privacy, integrity and security measures—because how can you take action if you’re not completely aware of the data you’re dealing with at hand?

Data discovery can be a heavy lift, especially for enterprises who utilize multiple endpoints from the cloud to on-premise systems. Spirion’s Sensitive Data Platform uses proprietary technologies along with traditional keyword, dictionary and RegEx scanning for accurate discovery, classification and remediation of sensitive data and PII on all endpoints. With Spirion, you get real-time insights and a truie view of your complete data landscape.

Once data discovery is taken care of, implementing data anonymization and differentially private processes are key to preserving your data’s integrity. Tonic has built differential privacy into its offering to give customers mathematical guarantees around the safety of their sensitive data coupled with statistical guarantees of the realistic, representative nature of the synthetic output data generated.

The ideal data integrity and privacy solution can be summed up as: Spirion discovers, Tonic cures. To learn more about how these tools work together to ensure strong data integrity at the enterprise scale, contact us today.