Three Big Data Indexing Challenges — and Three Best Practices

By Gabe Gumbs, Chief Innovation Officer

Gaining rapid data access and analysis with indexing is a no-brainer, right? Not so fast. While, done right, data indexing can deliver big data benefits. Done wrong, it can create big data headaches.

Data indexing gained popularity in the mid-2000s — just as big data began to overwhelm organizations, as the volume and variety of data grew exponentially. By 2017, a trillion, quintillion bytes of data were produced daily around the world. The rapid expansion of data is making fast access to and analysis of data across enterprises more difficult.

In this environment, indexing has become a trendy pursuit. At the same time, indexing technologies have improved since the mid-2010s. However, deploying indexing is not a silver-bullet solution to the problem of accessing and analyzing data. Without considering the multiple challenges, businesses can create more problems than they solve — especially around today’s critical demands for data privacy, security, and compliance.

Data Indexing Defined

Simply stated, indexing is a data structure technique that collects, parses, and stores data to enhance the speed and performance of retrieving and analyzing relevant documents. Indexes are used to quickly locate data without having to search every row in a database table every time a table is accessed.

Without an index, searches will scan every document across an organization, which requires considerable time and computing power. While an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours.

For example, an index could be created for customers by their last names. When anyone in the organization searches for a customer by his or her last name, the search engine would only need to scan the appropriate index, instead of searching all data across the entire organization.

Indexing Challenges and Best Practices

Data indexes can result in better query performance and lower resource consumption — but only if an organization can overcome the three key challenges of data indexing.

Challenge 1 — Not knowing exactly what you want to achieve with data indexing

Best Practice: To ensure that the time and effort an organization spends on indexing achieves the desired results, the organization needs to begin with what they want to achieve from indexing. Indexing should fulfill a specific business purpose, rather than just be pursued because it’s the latest hot trend in data management.

As with anything that falls within the purview of data security, privacy, and compliance, the narrower the scope, the better. For example, if an organization does not need an inventory of subjects in order to perform rapid subject rights response to fulfill a compliance obligation, then they should not spend time and resources creating that index.

Examples of valuable business goals to maintain data compliance with indexing include: gaining timely response to internal audit requests, external compliance audit needs, and customer right request processing and fulfillment.

To ensure indexing initiatives perform a valid business function, consider these questions:

  • Will indexing shore up a weakness in our organization?
  • How large will our indexes be and how will they be managed?
  • Is there a better solution than indexing to solve our business problems?

Challenge 2 — Choosing the right indexing technology

Best Practice: Executing indexing tasks across an organization’s stores of data can initially look like the organization is solving critical business problems. But if they don’t understand the pros and cons of an indexing application’s functionality, they could be complicating their big data management problems, rather than simplifying them.

Today’s indexing technologies have different architectures and solve different problems. So companies need to consider the functionality and match it to their specific needs.

Here are just a few questions about indexing technology capabilities to consider:

  • Are you more interested in fast access to text or better data analytical queries, or both?
  • Will your indexes be more text oriented or subject oriented?
  • What do you want to achieve in terms of scaling and clustering with your index creation?

Challenge 3 — Creating more data that unnecessarily expands the data footprint, and puts data security at risk

Best Practice: Having a large and ever-expanding data footprint is one of the underlying problems in achieving data security, privacy, and compliance. Organizations contribute to this issue by inadvertently duplicating existing data, in many cases creating multiple copies of the same data throughout the organization without knowing where it’s located and who has access to each version.

In the same way, when not done right, indexing can create more copies of data and, therefore, put data at risk of exposure. To avoid data privacy, security and compliance risks, organizations must know what data needs to be lockdown before they can enable index searches without putting themselves at risk.

They can accomplish this critical goal by deploying the appropriate data management tool features:

  • Data discovery — know what data exists across the organization, where it resides, and who has access to it — and maintain this level of control over any data added to indexes.
  • Data classification — appropriately label all data to organize it according to critical issues, such as all relevant compliance regulations.
  • Data security — ensure that appropriate and effective data security functionality is installed and practices are followed to keep all data secure and private.

Indexing is a powerful tool in the fight to better access and analyze big data. However, it’s not a silver bullet. Sometimes there’s a better solution to a specific problem. As with any data management initiative, organizations need to bring an informed approach to indexing their data. Understanding the stakes will ensure that indexing achieves the desired goals without creating new business challenges.