Take control of your unruly data with privacy-preserving classification

Thanks to increasing data privacy regulations, such as the General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA), data classification is receiving renewed interest from organizations around the world. A key driver for this is the requirement that organizations must classify sensitive personal information from European Union or California citizens.

What used to be a simple process that involved applying data to a few buckets to streamline data management is evolving into a much more sophisticated process to meet organizations’ intensifying data privacy and security demands.

A recent Gartner report on using classification to improve unstructured data security highlights the benefits of data classification within the broader context of security and compliance. The report also identifies the key criteria and limitations of today’s classification tools and provides expert analysis and guidance. In this blog, we’ll discuss what we consider to be the key takeaways from Gartner’s findings.

The role of classification in the data life cycle

Spirion has long advocated that data discovery and data classification go hand in hand: one cannot exist without the other. Data discovery is the process of collecting data from databases and silos and consolidating it into a single source that can be easily and instantly accessed.

Once you have located all of your data, the application of labels, tags, and visual marker classifications help both humans and computers determine its sensitivity and treat it consistently. Such data classifications ensure that only appropriate parties can access sensitive data as it moves through the organization. They also support information sharing with other data protection controls.

The Gartner report stresses, “Understanding the sensitive data estate is a prerequisite for effective security and data privacy compliance. Data loss prevention (DLP) is not sufficient as privacy is more than a data loss problem. Security and risk management technical professionals should use data classification to support these requirements.” They further add, “Data classification capabilities are therefore not only useful; they are often necessary to achieve compliance and make data-centric security controls effective.”

The business cases for classification

Why classify? By itself, data classification may not be all that useful, but in concert with a holistic data life cycle, it becomes a key enabler and necessary component for effective data governance and compliance programs.

Gartner explains, “Classification is used to provide insight and either deliver or support control activities. In doing this, it supports a variety of business drivers for data-focused security. The most commonly seen of these drivers that are grounded in security within Gartner inquiry is privacy, which forms the bulk of data security risk for many clients.”

Some of the most common business cases for data classification include:

  • A responsibility to uphold privacy and regulatory compliance
  • Confidentiality for organizations subject to regulations about the control of data (such as contractual obligations)
  • Data retention
  • Technical control support for tools, such as DLP and data access governance, which benefit from pre-classification and labeling of data

Classification Policy: The prerequisite first step

There is a critical need to classify data according to organization policy. While there are many schemas that organizations can use for classifying data, there are two key categories for security:

  • Confidential – Sensitive data that could negatively impact operations if compromised, including harming the company, its customers, partners, or employees. Examples include vendor contracts, employee reviews and salaries, and customer information.
  • Restricted – Highly sensitive corporate data that could put the organization at financial, legal, regulatory, and reputation risk if compromised. Examples include customers’ Personally Identifiable Information (PII), Protected Health Information (PHI), and credit card information.

Gartner emphasizes: “Even if you do nothing else, put an information classification policy in place. If it demands that users treat and mark data in a certain way, then you have provided the foundation for user education, technical control and compliance.” They add, “Confidentiality labels will form part of an information classification policy, which should be accompanied by a data handling policy, which provides the logical basis for security standards and control requirements. Without such a policy, classification programs are likely to fail as there is no common understanding, classification labels can proliferate, and misclassification will be more frequent.”

Data classification solution landscape

Although there are few sole data classification tools, it is common for classification to span multiple product categories—ranging from Data Access Governance (DAG), Data Loss Prevention (DLP), User-Driven Classification (UDC), and Software as a Service (SaaS) point solutions.

Let’s look at use cases and tools to consider:

Use Case Tools commonly used
Immediate control action only, no recording of the classification DLP
Insight into data within SaaS and on-premises environments DAG, SaaS, File Analysis
Insight into data within user endpoints UDC, DLP
Tagging based on user input UDC, SaaS
Automatic tagging UDC, SaaS, DAG

Essential classification capabilities

Because data classification is an enabler for other aspects of the data lifecycle—whether triggering other data security controls or imparting strategic insights—it must meet a broad range of capabilities.

Gartner suggests, “In order for classification to work, the following conditions must be met:

  • The data should meet some criteria that enable a decision to be made about what classification applies.
  • The presence of one of the following capabilities:
    • An automated system that can analyze the data and apply rules to make that decision
    • An interface for users to create, verify, or override a classification.
    • Discovery in a variety of data storage environments is a key capability for automated systems.
    • The provision of a recording of that classification that allows other systems and processes to leverage that decision.
    • The inclusion of a log, dashboard, or other method to allow data and security administrators to understand the data estate for a variety of reasons.”

They also recommend the following criteria when evaluating classification tools:

  • Storage locations – With more than 20 different types of storage where sensitive data can hide, Gartner highlights the differences of classification tools when it comes to discovering key file types. While file stores are well-covered, “tools for mobile endpoints such as tablets and mobile phones are lacking.” However, they say that “some vendors such as Bitglass and Spirion are actively combining capabilities to help deliver in this space.”
  • Recording classification – Gartner states, “There is no point classifying a document if you’re not going to do something as a result, and in order to do that you need to record the outcome (unless you’re taking an immediate action, such as within DLP). There are two tagging methods available, tagging the document or recording the outcome in a metadata repository.”
  • Repositories – Gartner points out, “tools that use this outcome usually are automated and provide a wealth of metadata about not only the data, but the context in which it was found.”
  • Dashboards and reports – Gartner states, “All classification tools provide dashboards and reporting capabilities, but their depth varies considerably depending on the tool itself.” Some of the key elements that you will want to have visibility into include: access rights, data ownership, file usage recording and duplication reporting.
  • Tagging (Labeling) – Gartner provides solid guidance when it comes to tagging. Proper tagging aids in the ability to find data when responding to a consumer’s “right to know” or “right to be forgotten” that is common among data privacy regulations. As a best practice, they suggest identifying “where the tag is to be used, and provide only enough information to allow that control to work correctly.”

Data classification technology strengths

Today’s classification tools, which encompass Data Access Governance (DAG), Data Loss Prevention (DLP), User-Driven Classification (UDC), and Software as a Service (SaaS) point solutions, generally perform well when it comes to the following capabilities:

  • Coverage of file types
  • Metadata reporting
  • Workflow and privacy
    • Specifically, Gartner emphasizes, “As privacy requirements expand globally, vendors are introducing privacy workflow and data subject access request (DSAR) support. These are common themes in most privacy regulations.”

Data classification technology weaknesses

These same tools also carry some inherent limitations:

  • Data tagging limitations – Gartner states that “tagging has a significant limitation in that it is not possible to tag all data objects. Some file types have room or even formal support for tagging in headers or document properties. However, the vast majority of file types have no such capability or are so limited as to be effectively useless for tagging or other forms of labeling.” They also offer the guidance: “If the requirement is to track data across multiple dimensions (for example, internal policy, privacy, health and department), then use file analysis tools for visibility. Use classification tagging tools for policy-dependent control.”
  • Classification change – They caution, “Classification tools support changing classifications. But automated tools are not good at automatic reclassification based on external conditions rather than changes in content.” They advise, “focus on data that might leave the company rather than moving internally, except where internal sharing would be absolutely unacceptable. Keep your most restrictive classification label reserved for exceptional rather than common use cases.”
  • Encrypted data – Encryption often interferes with the detection of data. Gartner recommends as a best practice: “The safest classification approach is to treat individual encrypted files that cannot be accessed by the classification tool as sensitive and use controls to prevent their movement or access.” They further add, “the best approach for any unreadable file is to use the discovery of these files to support a review of the business process and sensitive data handling.”

User-driven classification and Zero Trust security

While user-driven classification has its place in data security, complete reliance on user inputs can introduce friction into the classification process while also increasing error rates. Alternatively, using a combination of user-driven classification and AI data classification allows for consistent data protection without the need for manual user input every time sensitive data is involved. These automated processes can then support the user and create guardrails within the labeling process.

By reducing the impact of user-driven classification in the data security process, solutions can be data-focused rather than employee-focused. This, in turn, makes a Zero Trust framework more feasible to implement.

In Zero Trust environments, users must be continually verified for access to all resources. While this is ideal for data security, the process can create roadblocks for users accessing sensitive information on a day-to-day basis. AI-enabled data classification allows for greater efficiency by limiting end-user participation within the classification process and instead uses contextual information to categorize and recognize files more efficiently and rapidly.

Data classification trends

There are several data classification technology trends emerging right now. Among them are automated tools that use machine learning and AI to build privacy workflows and classify.

  • Automated tools – While Gartner suggests, “Automated classification tools must be configurable so that their output is reliable for a given client problem. Except in the simplest of use cases, 100% accuracy isn’t possible, the best tools will enable enough precision based on the data in the document.” They point out, “Automated tools get best results with well-known standard data types, such as driving license numbers, proper names, and social security numbers. If your intellectual property is consistently well-formatted (such as with an account number or project coding system), then automated systems will succeed there.”
  • Machine learning (ML) – Gartner acknowledges, “The ideal automated classification solution would use powerful ML and artificial intelligence capabilities to determine the sensitivity or other categorization of data. Machine learning in data classification is improving, but has some way to go.”

Gartner provides the following recommendations for technical professionals responsible for data security:

  • “Ensure that a data classification policy is in place as it is the root of data security governance. It provides clarity and authority, supports control standards and underpins user awareness efforts.
  • Align your security classification with any broader data governance programs. It is easy to confuse users and create technical complexity and possibly conflict. Focus on high-risk and high-value data, especially regulated data, to support such alignment.
  • Use automated data classification or AI-powered data classification to provide users with a baseline and you with insight. User classification is less expensive, but harder to introduce. Use both together for best results.
  • Aim for “good enough” solutions, recognizing that the automated technology has limits in precision. Phase your implementation carefully to avoid diminishing returns.
  • Use at least two labels. Sensitivity and either “owner” or “project/department.” Identify where the tag is to be used and provide only enough information to allow that control to work correctly.”

Redefining automated data classification

Automating data classification overcomes many limitations by making the process reliable, accurate, and persistent. A sophisticated platform, such as Spirion Sensitive Data Platform, can spot personal information by looking for data patterns, such as names, dates of birth, addresses, phone numbers, financial information, health information, and Social Security numbers. From there, it can classify these data files with context-rich labeling, ensuring that the right security measures get applied and your organization remains compliant with the corresponding privacy regulations. More importantly, automated systems can also re-classify data as needed, such as for updates and changes within the business or from changes in compliance regulations.

Every organization should assess the data classification automation options available in the marketplace and determine which solutions can provide them with the capabilities they need to take their data privacy protection to the next level. Ideally, organizations should choose a platform that is purposely built to deliver the key functions of deploying robust data privacy programs. The right automation system can aid in streamlining data classification, automatically analyzing and categorizing data based on predetermined parameters continually and in real-time.

Want to dive deeper?

Data classification in an infrastructure should be paramount. Regulations, like CCPA and GDPR, now require it. In this white paper, learn how data classification has moved from a nice-to-have to a necessity in data privacy management and why you should expect data protection software to have automated data classification capabilities.

Download now