A Visual Guide to Vector Search and K-Nearest Neighbors

May 1, 2024

Generative AI has sparked interest in similarity search and vector databases, particularly their role in Retrieval-Augmented Generation (RAG) systems. Vector databases enable fast similarity search across numerous documents, grounding large language models (LLMs) with real-world data retrieved by similarity search, improving understanding and response accuracy of these LLMs.

This post provides an overview of the long-standing problem of similarity search in computer science and dives into the inner workings of the primary way to solve it: K-Nearest Neighbors (KNN). This post is part one of a visual series covering multiple topics related to vector search. If you find this article interesting, stay tuned!

Some understanding of basic computer science concepts like data structures and time complexity will be helpful to the understanding of this article. To those who need a refresher, I recommend checking out the following resources:

The Problem of Similarity Search

Similarity search, a process that identifies similar items within a large dataset, is crucial in applications such as recommendation systems, search engines, and information retrieval systems. This process is often applied to diverse data types, including text documents, images, and audio files. However, the mechanism of actually comparing two different peices of information to see how “similar” they are may seem unintuitive at first.

For example, how would you compare two text documents or two images to see how similar they are? Unlike numbers, where you can subtract two numbers to get the difference between them, there is no straightforward way to compare two text documents or images.

To remedy this, we can attempt to convert these items into a numerical representation known as a vector, and then compare the resulting vectors. This conversion allows us to perform mathematical comparisons across diverse data more straightforwardly.

For example, let’s consider an example where we want to figure out which two of the three following sentences are most similar to each other:

“Ham Ham Ham Ham Ham"
"Spam Spam Spam Spam Spam"
"Ham Ham Spam Spam Spam”

We can compare these sentences by counting the frequency of each word, therefore converting each sentence into its numerical representation/vector:

“Ham Ham Ham Ham Ham” -> [5, 0]
“Spam Spam Spam Spam Spam” -> [0, 5]
“Ham Ham Spam Spam Spam” -> [2, 3]

Given the above vectors, let us revisit the question—which sentence above, 1 or 2, is most similar to sentence 3? The answer to this example is pretty obvious, but let’s work through this example mathematically. We could calculate the distance between the vectors to determine their difference (or, conversely, similarity). The smaller the distance, the more similar the two vectors are.

One method to calculate the distance between vectors is using the Euclidean distance, or the “straight-line” distance between two vectors.

For the two-dimensional vectors we created above, it’s easy for us to use the Euclidian distance to visualize the distance between the vectors by plotting the vectors as points on a 2D plot:

The animation above makes it easy to see that the distance between vectors 2 and 3 (we’ve labeled vector 3 as q for ‘query vector’) is less than the distance between vectors 1 and 3, which makes sense. Sentence 3 is more similar to sentence 2 than to sentence 1 because it has more instances of “Spam” than “Ham.”

While the Euclidian distance is a simple and intuitive way to measure distance between vectors, it is not always the best choice. High-dimensional spaces can suffer from the “curse of dimensionality,” where noisy or unrelated values in the vector can overshadow valuable values, skewing the distance between vectors. This issue can be partially mitigated with data normalization and dimension weighting. However, other distance metrics are less susceptible to these issues.

One such distance metric that is commonly used is Cosine similarity. This can be a highly performant metric, as it is a fast calculation that ignores the magnitude of the vectors and focuses on the direction.

In our examples on this page, we will continue to use Euclidean distance as it leads to simple visualizations.

Methods to Generate Vectors from Data

There are many techniques for generating vectors from data. In the above example, we used the bag-of-words method to generate a vector from a text document, where each number in the vector represents the frequency of a particular word. More sophisticated methods, such as the use of embeddings generated from large language models (LLM) as vectors, can also be used. The choice of which method is used to generate a vector representation depends on your specific use case.

Hey Neighbour, Can I Borrow Some Sugar?

As the example above shows, a similarity search with vectors is intuitive. Suppose we plot the vectors in a multi-dimensional space. Similar vectors should be close to each other for any distance metric.

This is the intuition behind the k-nearest neighbors search (KNN) algorithm. It calculates the distance between the query vector and all the other vectors in the dataset and returns the k vectors with the smallest distance to the query vector.

Below is an illustration of KNN (with k = 1, k representing the number of similar vectors to return), where we have query vector (q) with a value of [2,3] that we want to find the closest vector to from the following set:

1(6, 1)2(5, 4)3(5, 6)4(1, 5)5(0, 1)q(2, 3)
Searching for the closest node to q.... 

We can see that the query vector is closest to vector 4, so the KNN algorithm would return vector 4 as the nearest neighbor.

Is KNN All You Need?

Despite its simplicity and intuitiveness, the k-nearest neighbors (KNN) algorithm has limitations, primarily due to poor performance in high-dimensional and large-scale datasets due to its linear time complexity.

To touch on algorithmic complexity in the naive KNN brute force search:

Let n represent the number of points in the dataset
Let d represent the data dimensionality (how large a vector is)
Let k be the number of neighbors considered for voting.

Given that, the time complexity for:

finding one nearest neighbor is O(n * d)
for finding k nearest neighbors, it is O(n * d * k)

As you can see in the demo below, the larger the number of vectors we have in a dataset, the slower the search process is.

What strategies can we utilize to make similarity search faster? If we are willing to make the trade off between accuracy of the closest vectors for speed, we can use another class of algorithms known as Approximate Nearest Neighbor (ANN) algorithms, which can run in faster than linear time.

In our next post in this series, we will talk about a specific approximate nearest neighbor algorithm, Hierarchical Navigable Small World (HNSW), and describe how it speeds up vector search via interactive visualizations.