Understand common pitfalls, and create strategies to avoid them, and develop robust, scalable, and efficient database infrastructure.
Introduction
The use of vector databases within modern applications is steadily increasing, especially in the era of generative AI. With the increased adoption of vector databases, it becomes even more important for IT leaders and decision-makers to understand the common pitfalls encountered when leveraging them.
Understanding these pitfalls, why and when they occur, and knowledge of strategies to avoid them leads to the development of robust, scalable, and efficient database infrastructure.
This article covers the following:
What is a Vector Database?Applications of Vector Databases.Common Pitfalls and Avoidance Strategies.
What is a Vector Database?
A vector database is a specialised data storage solution that stores, indexes and enables efficient querying of vectors for information retrieval through similarity-based search. Vectors are high-dimensional numerical representations of data objects such as images, audio, video, text, etc.. Vectors are created by using an embedding model that takes unstructured data as input and outputs a vector containing a multi-dimensional array of numbers which captures the semantics, context and meaning of the input data object.
The array size of vectors is referred to as the number of dimensions of the vector embedding, which can be as large as several thousand dimensions; the increase in dimension corresponds to the level of context captured by the embedding model and stored as a vector.
When multiple vector embeddings are projected in a high-dimensional space, referred to as a vector space, the distance between vectors represents the similarity of one vector based on the captured context. Vector databases have built-in capabilities for efficiently executing similarity distance calculation between vectors to retrieve information based on a query vector or semantic similarity search. This information retrieval method differs from search methods observed in traditional databases, which utilise keyword matches to conduct information retrieval.
Popular Vector Databases
Common examples of top vector databases are Qdrant, Milvus, Pinecone, MongoDB Atlas Vector Search, Databricks Vector Search, ChromaDB, Vespa, etc. There are database providers who operate as specialized vector databases like Pinecone and Qdrant. In contrast, some database solution providers like MongoDB and Databricks also offer a vector database solution in addition to an operational database solution for traditional database functionality to handle transactional, analytical and operational data.
Applications of Vector Databases
Vector databases provide various functionality within software applications. Conducting relationship mappings based on the semantics and context of captured data is helpful in the use cases and scenarios explored below.
Recommendation Systems: Retrieving a collection of content based on a query vector can be used to develop a recommendation system where an item or several pieces of information about an entity retrieve semantically similar content within a vector database to provide content, product or item recommendation.Retrieval Augmented Generation (RAG): Foundation models or LLMs (Large Language Models) within generative AI applications are limited to their parametric knowledge. The RAG design pattern uses an embedding model of a prompt to retrieve semantically similar and up-to-date information relevant to the query. The retrieved information is added to the initial query and serves as input to the foundation model, grounding the LLM’s response in factual and relevant data. Vector databases are used in this use case to store the relevant information and conduct efficient information retrieval based on vector similarity computations.Image and Facial Recognition Systems: An embedding model captures the features of a data object. Deep learning models using convolutional neural networks, specialized neural network architectures designed to capture hierarchical spatial features at varying levels of abstraction, can provide a vector embedding output that captures an image’s unique patterns and features. As seen in security systems, these vector embeddings can be stored within a vector database and incorporated within a facial recognition system. It can be used for identity verification by passing the embedding of a processed face image and conducting a similarity search based on the vector embeddings in the vector database.
Common Pitfalls
This section covers some common pitfalls encountered when working with vector databases. Although not an exhaustive list, the pitfalls covered in this article are critical areas where attention to detail can significantly impact the performance and security of your vector database infrastructure. These include:
Choosing the wrong indexing strategy can lead to suboptimal search performance and increased operational costs.Underestimating scalability needs, potentially resulting in system bottlenecks and degraded user experience as data volume and query loads grow.Overlooking metadata and operational data, missing opportunities to optimize query performance and system monitoring.Inefficient query construction can result in slow response times and irrelevant search results, undermining the effectiveness of the database.
Pitfall 1: Choosing the Wrong Indexing Strategy
Explanation
An index is a data structure constructed to retrieve information in a database efficiently. It comprises the fields or columns of the database table that have been indexed and organized in an ordered manner. Each value in the index is a key with an entry that points to the location of the value held in the database; these entries are described as pointers. Pointers make indexes efficient; instead of scanning every row in a table to find matches, the database engine can use the index to locate the rows that meet the query criteria quickly. A database’s lack of an index leads to scanning every record held whenever a search query is executed.
An indexing strategy refers to creating a database index to ensure the efficient retrieval of records from the database. This involves selecting the data’s columns, attributes, or features for efficient retrieval based on common search patterns. Having the right indexing strategy becomes increasingly important within a vector database as the volume of data and the complexity of queries made to retrieve records increase. Utilizing an appropriate indexing strategy is crucial in vector databases that deal with high-dimensional data, where selecting an inappropriate indexing strategy might lead to inefficient retrieval operations.
A list of common indexing strategies and algorithms used in vector databases to facilitate efficient similarity search and nearest neighbour search operations are Hierarchical Navigable Small World(HNSW), Inverted File Index (IVF), Product Quantization (PQ), Annoy (Approximate Nearest Neighbors Oh Yeah) and Locality-Sensitive Hashing (LSH).
Implications
Some of the implications of having the wrong indexing strategy in a vector database are as follows: increased query latency and operational cost, scalability issues, decreased query accuracy, and result relevance.
Increased Query Latency
Query latency refers to the time taken for the database server to process a query and return the results to the client during an information retrieval process. If the indexing strategy leveraged within a vector database does not appropriately reflect the patterns of queries made by the client and received by the database server, it could lead to an increase in the duration of query execution by the database server, overall leading to increased latency.
The increase in query latency due to using an inappropriate indexing strategy can result from the database server scanning a larger than required segment of the index to find the appropriate record or, in some cases, reverting to scanning the entire database to retrieve the relevant records.
Increased Operational Cost
Operational cost refers to the monetary investment and expense required to maintain components of a vector database, such as hardware and software resources, licenses, cloud service platforms, human resources, and more.
The wrong indexing strategy can increase the operational cost of a vector database solution. For example, an inefficient indexing strategy that reverts to a complete database scan results in an operational cost that includes the full scan of records and the additional cost of storing the index itself. In this scenario, the benefit of indexing isn’t reaped, but the storage cost of having an index or multiple indexes remains.
Poor Scalability
When dealing with a few hundred thousand records within a vector database, the negative implications of having the wrong indexing strategy may not be obvious. However, within a production environment where thousands and millions of records are processed in a single query, the downsides of the wrong indexing strategy become more apparent.
The implications of indexing strategies are amplified as the volume of data held begins to increase. Vector databases store embeddings with high dimensions and require operations considering hundreds or thousands of vector embeddings in a single operation. The wrong indexing strategy in this scenario quickly becomes a bottleneck, especially as the size of the data increases, which consequently increases the number of vector embedding candidates considered during a single vector search operation.
Avoidance Strategy
Data characteristics
Understanding data characteristics, such as volume, size, dimensionality, distribution, attributes, and more, can effectively enable the selection of the appropriate indexing strategy within a vector database.
Consider the scenario of a large-scale image retrieval system where the vector database contains millions of vector embeddings of high-resolution images; in this scenario, each vector embedding has a very large dimension size, which means that the sheer volume and size of data should be considered when selecting an indexing strategy. Such systems are standard in airport security systems, where system latency and accuracy are critical.
In this scenario, leveraging an indexing strategy that optimizes for faster query execution time, memory usage, and accuracy is appropriate. HNSW, in this scenario, would be a proper indexing strategy to use for the database due to the following reasons:
HSNW is appropriate for handling vector similarity searches with high volume and size datasets due to its search properties, which navigate through vector embedding using a multi-layered graph structure.HSNW enables the incremental addition of new data points, enabling the vector database’s scalability alongside the indexing strategy used. This means that new images can be added without degradation of indexing performance in the example scenario that reflects a similar system found in high-level security systems.
Resource constraints
Awareness of the hardware and software constraints on the vector database can avoid the common pitfalls of using the wrong indexing strategy. Vector database solution providers typically offer a tiered-based service. This means that various tiers correspond to an associated storage resource, computational power, software versions and other available functionalities.
When selecting an indexing strategy, it’s essential to consider the vector database’s memory capacity and computational performance; the processing power available on the vector database can limit the choices of indexing strategies. Indexes are stored on disk, meaning a high volume of data can require a complex indexing strategy, which can use more data storage capacity than a simpler indexing strategy.
In some cases, a dense index structure, which references every record in the database, is beneficial for efficient record lookup and information retrieval. However, the disadvantage of using a dense index structure, which is the increased data storage required for the index itself, must be considered. An awareness of the volume of the data and how it scales enables an efficient selection of indexing strategy.
Pitfall 2: Underestimating Scalability Needs
Explanation
Scalability in the context of vector databases refers to maintaining database operations such as query latency, lookup performance, and information retrieval relevance as the volume of data held in the database grows or query requests and information retrieval operations increase.
A common pitfall when using vector databases is not considering future scenarios where the volume of data and query load of the applications grows. The term ‘Scalability needs’ refers to the vector database’s capacity to expand in terms of compute resources and maintain expected performance when there is an increase in workload and their management–workload in this context refers to query or vector search operations conducted on the database server.
Scalability in a database can be viewed from two perspectives:
Vertical Scalability: This involves increasing the computational resources of any existing hardware or software component of the vector database by adding more resources to a single node or server, such as processing power(CPU) or storage capacity(RAM). Vertical scalability is called ‘scaling up’ because increased scalability is applied to existing resources.Horizontal Scalability: Adding more server instances to distribute the workload can increase a vector database solution’s computational or storage capacity. This is typically referred to as ‘scaling out,’ as the increase in computation is accomplished by adding more hardware components to the existing infrastructure, such as servers.
Implications
Tons of e-commerce platforms exist on the internet today, typically providing personalized recommendations and services to their customers. The personalized product recommendations seen in e-commerce platforms can be powered by semantic similarity searches enabled by vector databases.
There are certain times of the year when e-commerce platforms receive increased traffic, which can be due to seasonal or holiday events. During these periods, the amount of query requests made to a vector database increases the workload the database executes. A common pitfall when using vector databases is not considering the impact of seasonal and spike events on the overall application infrastructure, including the vector database. In the worst scenario, this can lead to an increased overhead in database operations, bottlenecks and, eventually, system failure.
Slow Query Response
As mentioned in the e-commerce scenario above, when the volume of data within a vector database increases along with the frequency of query requests made by the client, the overall database infrastructure performance might degrade because the infrastructure is not configured to adapt to increased workload.
For mission-critical or user-facing applications, having a vector database that doesn’t scale in conduction to workload can affect the user experience of the application’s functionalities. This is mainly because, under increased workload, the vector database will take longer to conduct information retrieval and vector search operations as it distributes its compute resources to handle more workload, which increases the latency of each request made, resulting in the user waiting longer for information retrieval results.
Performance Bottlenecks
A vector database comprises components that can act as performance bottlenecks if they do not scale adequately alongside other areas of the database. Such components are network bandwidth, disk I/O operations, CPU limitations, memory capacity, etc. There are instances where the indexing strategies used within a vector database work well with a small volume of data. Still, as the database’s record size increases, the initial indexing strategy might not scale appropriately and begin to slow down the execution of database operations, which inherently acts as a bottleneck of the entire database infrastructure.
Also, if the number of queries made to the database increases and, as a result, the workload increases and there isn’t a workload distribution mechanism across servers and nodes, then again, there could be a backlog of unprocessed queries, which becomes another point of bottleneck within the database infrastructure.
Avoidance Strategy
Capacity Planning
The database team must conduct capacity planning to mitigate against scalability issues, such as slower query response time and database infrastructure bottlenecks, that arise when query frequency increases alongside data volume.
Capacity planning predicts future increases in query loads and data volumes based on the patterns collected and observed via database metrics such as query execution time, throughput, resource utilization, latency, and connection count. The capacity planning process enables the allocating of appropriate resources to ensure the database can handle future demands efficiently. This involves adding more hardware or computing power and optimizing existing resources for better efficiency, redesigning database schemas, or updating indexing strategies to improve performance.
This avoidance strategy can only be effective if your chosen vector database solution provider offers metrics visualization and database monitoring tools as part of the service. Vector database providers such as MongoDB, Pinecone, and Weaviate provide a suite of monitoring tools to observe key database usage metrics.
Scalable Architecture
Vector databases handle high-dimensional data that can scale rapidly. In practical scenarios, there are instances where the need for an increase or decrease in the embedding dimension may arise, mainly to leverage better, more efficient embedding models.
A scalable architecture strategy should focus on choosing a vector database designed for adaptability in dimensional expansion for vector embeddings stored within the system. This approach ensures that as the complexity and size of the data grow, the database architecture can dynamically adjust, maintaining high performance and responsiveness without significant manual intervention or restructuring.
When selecting a vector database provider and prioritizing using a database solution with a scalable architecture, ensure the vector database provider offers auto-scale functionality. Auto-scaling is a feature that enables the database resources to increase or decrease appropriately automatically based on the current state of key metrics such as network traffic, query processing time, data volume increase, etc. The key advantage of the auto-scaling feature is not having to manually upgrade the database infrastructure during sudden spikes in resource usage. The functionality of the vector database to scale down and up is important, as increased memory and compute power, when not required, will increase operational costs.
Data Sharding and Partitioning
Data sharding refers to splitting up a dataset into smaller parts or chunks and holding each part in separate databases across distributed servers. Data partitioning refers to segmenting the dataset based on distinct attributes and features and holding each segment within the same database. Both data sharding and partitioning can be used as an avoidance strategy against the implications of underestimating scalability needs within vector databases.
Data sharding can enhance the performance and scalability of the vector database by distributing the dataset and query workload across several servers or nodes. This strategy reduces the load on a single server when there is an increased frequency of workload and network traffic.
Pitfall 3: Overlooking Metadata and Operational data
Explanation
Metadata typically accompanies vector embeddings when stored in the vector database; metadata associated with vector embeddings are attributes or features that describe the embeddings’ context, origin or characteristics. Particular examples of metadata are the creation date of the embedding, input values to the embedding model to create the embedding, tags, categories, etc.
{
“vectorEmbeddingMetadata”: {
“embeddingId”: “123456789”,
“creationDate”: “2024-03-31”,
“source”: “Image Processing”,
“modelUsed”: “ResNet-50”,
“modelVersion”: “1.4”,
“dimensionality”: 512,
“preprocessing”: [“Normalized”, “Grayscale”],
“tags”: [“image”, “feature_extraction”, “deep_learning”],
“description”: “Vector embedding generated from an image using ResNet-50 model for feature extraction.”,
“additionalInfo”: {
“dataset”: “ImageNet”,
“usage”: “Recommendation System”
},
“embeddingVector”: [0.00123, 0.0345, 0.6789, …, 0.0123]
}
}
Operational data refers to data created due to transactional processes or business operations within the database systems or connected components of the overall system infrastructure, such as applications, servers, etc. Examples of operational data generated by transactional processes are user session data, sales transactions, real-time data, etc. Examples of operational data from business operations can include error and system logs, database usage statistics, query performance metrics, etc.
{
“operationalData”: {
“embeddingId”: “123456789”,
“queryPerformance”: {
“queryId”: “query_456”,
“responseTimeMs”: 200,
“status”: “success”
},
“systemHealth”: {
“CPUUsagePercent”: 85,
“MemoryUsagePercent”: 65
},
“usageStats”: {
“queryCount”: 450,
“peakUsageTime”: “2024-03-31T11:00:00Z”
},
“errors”: [
{
“errorId”: “error_789”,
“message”: “Timeout during query processing”,
“time”: “2024-03-31T10:45:00Z”
}
]
}
}
A common pitfall when utilizing vector databases and working with metadata and operational data is the underutilization or lack of use of these additional metadata to conduct efficient queries. This oversight results in overlooked opportunities to optimize query performance, improve database infrastructure and detect critical issues before they propagate to other infrastructure components.
Implications
Overlooking and not using metadata and operational data in vector databases can result in missed opportunities for optimization, such as improving query performance by refining indexing strategies based on metadata characteristics or diagnosing system health issues through operational data analysis.
Inefficient Query Performance
Most vector database providers offer a crucial feature that allows for the inclusion of metadata filtering in addition to vector search during information retrieval tasks based on an embedding query. This means that the scope of candidates for vector search similarity can be reduced by adding a prefiltering stage based on metadata before the vector search is conducted, thereby reducing the computational load. As a result, search operations become more efficient and targeted, focusing only on the subset of embeddings that meet the specified metadata criteria.
However, not taking advantage of metadata filtering stages in vector search information retrieval operations can lead to unnecessarily broad searches that consume more computational resources and time. This inefficiency can result in slower query response times, increased operational costs, and potentially less relevant search results.
Reduced System Reliability
Vector databases typically provide metrics associated with their performance and overall health; these metrics and indicators can be leveraged to identify problems before they become critical. Issues such as slow query response times, system overloads, memory leaks, and inefficient indexing can be identified by monitoring and logging operational data and metrics such as query execution times, CPU and memory usage, error rates, disk I/O operations, and network bandwidth utilization.
When these operational data are not leveraged or ignored, they can lead to issues that degrade system performance and reduce system reliability. Ignoring the operational data can actually lead to missed opportunities for vector database performance optimization.
Avoidance Strategy
Compound Indexes
Indexing ensures that the database is optimized in terms of record structure for retrieving information based on regular query operations and patterns. Vector databases require the definition of a vector search index based on the field that holds the vector embeddings. In addition to specifying the vector embedding field in the vector search index definition, top vector database providers also enable the composition of a vector search index that includes the specification of one or more metadata fields.
Below is an example of a compound index that includes vector embedding and a metadata field.
{
“definition”: {
“mappings”: {
“dynamic”: true,
“fields”: {
“text_embedding_field_name”: {
“dimensions”: 1536,
“similarity”: “cosine”,
“type”: “knnVector”
},
“address.country”: {
“type”: “filter”
}
}
}
},
“name”: “vector_search_index_name_text”
}
The approach of adding metadata filter criteria to the vector search index definition facilitates multi-faceted search capabilities, enabling queries to efficiently filter and sort results not just based on the similarity of vector embeddings but also according to relevant metadata attributes.
Using Metadata for Query Optimization
The significance of utilizing metadata to streamline vector search operations and enhance query performance in vector databases is undeniably crucial and deserves strong emphasis. Leveraging this simple strategy can reduce application latency and operational costs while improving retrieval performance.
Vector databases with a full suite of features allow users to use metadata to conduct search operations such as keyword matching, conditional filtering, and range queries, allowing for more refined and efficient data retrieval processes.
Below is an example of a MongoDB aggregation pipeline with a vector search operation. However, before the vector search operation is conducted, the descriptive metadata of the dataset ‘country’ is used to reduce the scope of the search to a subset of data. Also, a conditional query was added to reduce further the results returned from the vector search operation. This ensures the vector search operation is more focused and efficient, potentially reducing computational load and improving response times by narrowing down the number of documents to be considered based on the specified metadata criteria.
import pprint
import re
def vector_search(user_query, db, collection):
“””
Perform a vector search in the MongoDB collection based on the user query.
Args:
user_query (str): The user’s query string.
db (MongoClient.database): The database object.
collection (MongoCollection): The MongoDB collection to search.
Returns:
list: A list of matching documents.
“””
# Generate embedding for the user query
query_embedding = get_embedding(user_query)
if query_embedding is None:
return “Invalid query or embedding generation failed.”
match_stage = {
“$match”: {
“search_path”: re.compile(r”United States”),
“accommodates”: { “$gt”: 2, “$lt”: 7}
}
}
# Define the vector search stage
vector_search_stage = {
“$vectorSearch”: {
“index”: “vector_search_index_name_text”,
“queryVector”: query_embedding,
“path”: “text_embedding_field_name”,
“numCandidates”: 150,
“limit”: 20
}
}
# Define the aggregate pipeline with the match stage followed by the vector search stage
pipeline = [match_stage, vector_search_stage]
# Execute the search
results = collection.aggregate(pipeline)
return list(results)
Read more about MongoDB Atlas vector search here.
Monitoring Operational Data
Modern vector database providers have web- or application-based user interfaces that provide visualization or logs of the key metrics that indicate the overall vector database health and performance. Taking full advantage of your database solution provider’s monitoring capabilities is a straightforward and easy-to-implement strategy. Proactively selecting a vector database provider based on its monitoring and logging capabilities can save on infrastructure costs and investments.
Mature database providers offer alerting systems that notify administrators of potential issues, such as performance bottlenecks, extensive query latency, and resource limitations, when alert thresholds are met. Apart from database-native monitoring solutions for visualizing operational data, there are other standalone monitoring solution providers, such as OpenMetrics and DataDog.
Pitfall 4: Inefficient Query Construction
Explanation
Inefficient query construction in the context of vector databases refers to composing a vector search query that does not utilize the full capabilities of the database search and retrieval features and leads to an outcome where the retrieval process is objectively slow and produces inaccurate results. This is an occurrence that typically occurs in a few cases, including the following:
Using the wrong similarity distance measure: When conducting vector search operations, there are a variety of distance functions to use to compute the distance, or the closeness, between two or more vector embeddings. Cosine similarity distance measure leverages angle-based comparison, Euclidean distance measure uses magnitude-based comparison, and dot product distance measure evaluates the direct product of two vectors, serving as a proxy for measuring vector alignment in multi-dimensional space. Each measure has its specific applications and implications for query accuracy and efficiency.
Mismatched Embedding Dimension:In Vector Search Index Definition: Creating a vector search index with dimensions that are not the same as the actual embeddings leads to inefficient indexing and, consequently, slower searches and query operation failure. In practice, this is common and can occur if there is a misalignment or if the index was created based on an outdated embedding model version.For Embedding the Query Vector: Using query vectors of dimensions different from those stored in the database can result in inaccurate retrieved information, as the dimensionality directly impacts the calculation of distances or similarities. More often than not, there is an error output when a vector search query operation has a mismatch between the dimension of the query vector and the vector embeddings in the database.
Implications
Increased Response Time and Database Load
Inefficient query construction, especially using mismatched embedding dimensions or inappropriate similarity measures, can significantly increase the time it takes for the vector database to execute and complete vector search queries and return results to the client. The database must perform additional computations to navigate through records and retrieve appropriate results based on the query.
It’s also possible to notice an increase in database load when the queries are not optimised for the database’s capabilities. Inefficient indexing due to mismatched dimensions demands more processing and computational resources to compute the distance measures between the query vector embedding and the embeddings stored in the database. This scenario results in increased latency and overall database performance degradation.
Avoidance Strategy
Understanding Similarity Measures
The best method to select an appropriate similarity distance measure is to understand the dataset’s content, characteristics and makeup and benchmark sample vector search query results against various distance measures during the early phase of the database creation processes and before data ingestion operations. This strategy aims to identify the best distance measure to use for the specific use case and the dataset at hand.
Regular Review and Refinement
To avoid dimensionality mismatches in the vector search index definition or the query vector embedding, it is important to enforce the practice of regular updates, reviews, and refinements to database convergence and key areas. Whenever there’s a significant model update, a review process should be conducted to confirm whether the vector search index is to be updated.
Additionally, on the client side, precautionary steps and checks can be made to ensure that query vectors match the dimensionality specified in the vector search index to maintain consistency and accuracy in search results.
Conclusion
This article explores the common pitfalls encountered when working with databases. Awareness and understanding of these pitfalls and their avoidance strategies are crucial for leveraging vector databases effectively. By addressing these areas, organizations can ensure their vector database infrastructure is performant, efficient, secure, and scalable.
For IT leaders and database administrators, the next step is to thoroughly review their current vector database strategies, identify any potential pitfalls, and implement the necessary measures to avoid them. Continuous learning and adaptation are crucial to maintaining a robust vector database ecosystem.
Several more approaches can be taken to improve vector database usage and performance, such as fine-tuning a custom embedding model to create more appropriate vector embedding or taking a step back and observing best practices when it comes to data collection, processing, and preparation.