Skip to content

Add more docs #1455

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 24, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
checkpoint
  • Loading branch information
montanalow committed May 11, 2024
commit ff32794e7c8202d1043fcd299481e75e200668b0
5 changes: 3 additions & 2 deletions pgml-cms/docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,9 +57,10 @@
* [In-database Generation]()
* [Dimensionality Reduction]()
* [Re-ranking nearest neighbors]()
* [Indexing w/ IVFFLat vs HNSW]()
* [Indexing w/ pgvector]()
* [Aggregation]()
* [Personalization]()
* [Vector Similarity](guides/embeddings/vector_similarity.md)
* [Normalizing]()
* [Search]()
* [Keyword Search]()
* [Vector Search]()
Expand Down
71 changes: 63 additions & 8 deletions pgml-cms/docs/guides/embeddings/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,27 +4,82 @@ description: Embeddings are a key building block with many applications in moder

# Embeddings

As the demand for sophisticated data analysis and machine learning capabilities within databases grows, so does the need for efficient and scalable solutions. PostgresML offers a powerful platform for integrating machine learning directly into PostgreSQL, enabling you to perform complex computations and predictive analytics without ever leaving your database.
As the demand for sophisticated data analysis and machine learning capabilities within databases grows, so does the need for efficient and scalable solutions. PostgresML offers a powerful platform for integrating machine learning directly into PostgreSQL, enabling you to perform complex computations and predictive analytics without ever leaving your database.

Embeddings are a key building block with many applications in modern AI/ML systems. They are particularly valuable for handling various types of unstructured data like text, images, and more, providing a pathway to richer insights and improved performance. A common use case for embeddings is to provide semantic search capabilities that go beyond traditional keyword matching to the underlying meaning in the data.
Embeddings are a key building block with many applications in modern AI/ML systems. They are particularly valuable for handling various types of unstructured data like text, images, and more, providing a pathway to richer insights and improved performance. They allow computers to operate on natural language and other high level concepts by reducing them to billions of simple arithmetic operations.

## Applications of Embeddings

- **Search and Information Retrieval**: Embeddings can transform search queries and documents into vectors, making it easier to find the most relevant documents for a given query based on semantic similarity.
- **Personalization**: In recommendation systems, embeddings can help understand user queries and preferences, enhancing the accuracy of recommendations.
- **Text Generation**: Large language models use embeddings to generate coherent and contextually relevant text, which can be applied in scenarios ranging from chatbots to content creation.
- **Natural Language Understanding (NLU)**: Embeddings enable models to perform tasks such as sentiment analysis, named entity recognition, and summarization by understanding the context and meaning of texts.
- **Translation**: In machine translation, embeddings help models understand the semantic and syntactic structures of different languages, facilitating the translation process.

This guide will introduce you to the fundamentals of embeddings within PostgresML. Whether you are looking to enhance text processing capabilities, improve image recognition functions, or simply incorporate more advanced machine learning models into your database, embeddings can play a pivotal role. By integrating these capabilities directly within PostgreSQL, you benefit from streamlined operations, reduced data movement, and the ability to leverage the full power of SQL alongside advanced machine learning techniques.

Throughout this guide, we will cover:
In this guide, we will cover:

* [In-database Generation]()
* [Dimensionality Reduction]()
* [Re-ranking nearest neighbors]()
* [Indexing w/ IVFFLat vs HNSW]()
* [Indexing w/ pgvector]()
* [Aggregation]()
* [Personalization]()

# Embeddings are Vectors
## Embeddings are Vectors

In the context of large language models, embeddings are representations of words, phrases, or even entire sentences. Each word or text snippet is mapped to a vector in a high-dimensional space. These vectors capture semantic and syntactic nuances, meaning that similar words have vectors that are close together in this space. For instance, "king" and "queen" would be represented by vectors that are closer together than "king" and "apple".

Embeddings are represented mathematically as a vector, and can be stored in the native Postgres [`ARRAY[]`](https://www.postgresql.org/docs/current/arrays.html) datatype which is compatible with many application programming languages' native datatype. Modern CPUs and GPUs offer hardware acceleration for common array operations, which can give substantial performance benefits when operating at scale, but which are typically unused in a Postgres database. This is referred to as "vectorization" to enable these instruction sets. You'll need to ensure you're compiling your full stack with support for your hardware to get the most bang for your buck, or you can get full acceleration in a PostgresML cloud database.
Vectors can be stored in the native Postgres [`ARRAY[]`](https://www.postgresql.org/docs/current/arrays.html) datatype which is compatible with many application programming languages' native datatypes. Modern CPUs and GPUs offer hardware acceleration for common array operations, which can give substantial performance benefits when operating at scale, but which are typically not enabled in a Postgres database. You'll need to ensure you're compiling your full stack with support for your hardware to get the most bang for your buck, or you can leave that up to us, and get full hardware acceleration in a PostgresML cloud database.

!!! warning

Other cloud providers claim to offer embeddings "inside the database", but if you benchmark their calls you'll see that their implementations are 10-100x slower than PostgresML. The reason is they aren't actually running inside the database. They are thin wrapper functions that making network calls to other datacenters to compute the embeddings. PostgresML is the only cloud that puts GPU hardware in the database for full acceleration, and it shows.
Other cloud providers claim to offer embeddings "inside the database", but [benchmarks](../../resources/benchmarks/mindsdb-vs-postgresml.md) show that they are orders of magnitude slower than PostgresML. The reason is they don't actually run inside the database with hardware acceleration. They are thin wrapper functions that make network calls to remote service providers. PostgresML is the only cloud that puts GPU hardware in the database for full acceleration, and it shows.

!!!

## Vectors support arithmetic

Vectors can be operated on mathematically with simple equations. For example, vector addition is defined as the sum of all the pairs of elements in the two vectors. This might be useful to combine two concepts into a single embedding. For example "frozen" + "rain" should be similar to (≈) "snow" if the embedding model has encoded the nuances of natural language.

Most vector operations are simple enough to implement in a few lines of code. Here's a naive implementation (no hardware acceleration) of vector addition in some popular languages:

{% tabs %}
{% tab title="JavaScript" %}

```javascript
function add_vectors(x, y) {
let result = [];
for (let i = 0; i < x.length; i++) {
result[i] = x[i] + y[i];
}
return result;
}

let x = [1, 2, 3];
let y = [1, 2, 3];
add(x, y)
```

{% endtab %}

{% tab title="Python" %}

```python
def add_vectors(x, y):
return [x+y for x,y in zip(x,y)]

x = [1, 2, 3]
y = [1, 2, 3]
add(x, y)
```

{% endtab %}
{% endtabs %}

### Vector Similarity

Similar embeddings should represent similar concepts. Embedding similarity (≈) is defined as the distance between their two vectors. There are 3 primary ways to measure the distance between two vectors, that have slight tradeoffs in performance and accuracy.

If two vectors are identical (=), then the distance between them is 0. If the distance is small, then they are similar (≈). You can read more about some of the available [difference metrics](vector_similarity.md) if you'd like to build a stronger intuition, but the defaults are usually a good starting place.

Empty file.
181 changes: 181 additions & 0 deletions pgml-cms/docs/guides/embeddings/vector_similarity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# Vector Distances
There are many distance functions that can be used to measure the similarity or differences between vectors. We list a few of the more common ones here with details on how they work, to help you choose. They are listed here in order of computational complexity, although modern hardware accelerated implementations can typically compare on the order of 100,000 vectors per second per processor. Modern CPUs may have tens to hundreds of cores, and GPUs have tens of thousands.

## Manhattan Distance

You can think of this distance metric as how long it takes you to walk from one building in Manhattan to another, when you can only walk along streets that go the 4 cardinal directions, with no diagonals. It's the fastest distance measure to implement, because it just adds up all the pairwise element differences. It's also referred to as the L1 distance.

{% tabs %}
{% tab title="JavaScript" %}

```javascript
function manhattanDistance(x, y) {
let result = 0;
for (let i = 0; i < x.length; i++) {
result += x[i] - y[i];
}
return result;
}

let x = [1, 2, 3];
let y = [1, 2, 3];
manhattanDistance(x, y)
```

{% endtab %}

{% tab title="Python" %}

```python
def manhattan_distance(x, y):
return sum([x-y for x,y in zip(x,y)])

x = [1, 2, 3]
y = [1, 2, 3]
manhattan_distance(x, y)
```

{% endtab %}
{% endtabs %}

## Euclidean Distance

This is a simple refinement of Manhattan Distance that applies the Pythagorean theorem to find the length of the straight line between the two points. It involves squaring the differences and then taking the final square root, which is a more expensive operation, so it may be slightly slower, but is also a more accurate representation in high dimensional spaces. It's also referred to as the L2 distance.

{% tabs %}
{% tab title="JavaScript" %}

```javascript
function euclideanDistance(x, y) {
let result = 0;
for (let i = 0; i < x.length; i++) {
result += Math.pow(x[i] - y[i], 2);
}
return Math.sqrt(result);
}

let x = [1, 2, 3];
let y = [1, 2, 3];
euclideanDistance(x, y)
```

{% endtab %}

{% tab title="Python" %}

```python
def euclidean_distance(x, y):
return math.sqrt(sum([(x-y) * (x-y) for x,y in zip(x,y)]))

x = [1, 2, 3]
y = [1, 2, 3]
euclidean_distance(x, y)
```

{% endtab %}
{% endtabs %}

## Inner product

The inner product (the dot product in Euclidean space) can be used to find how similar any two vectors are, by measuring the overlap of each element, which compares the direction they point. Two completely different (orthogonal) vectors have an inner product of 0. If vectors point in opposite directions, the inner product will be negative. Positive numbers indicate the vectors point in the same direction, and are more similar.

This metric is as fast to compute as the Euclidean Distance, but may provide more relevant results if all vectors are normalized. If vectors are not normalized, it will bias results toward vectors with larger magnitudes, and you should consider using the cosine distance instead.

{% tabs %}
{% tab title="JavaScript" %}

```javascript
function innerProduct(x, y) {
let result = 0;
for (let i = 0; i < x.length; i++) {
result += x[i] * y[i];
}
return result;
}

let x = [1, 2, 3];
let y = [1, 2, 3];
innerProduct(x, y)
```

{% endtab %}

{% tab title="Python" %}

```python
def inner_product(x, y):
return sum([x*y for x,y in zip(x,y)])

x = [1, 2, 3]
y = [1, 2, 3]
inner_product(x, y)
```

{% endtab %}
{% endtabs %}


## Cosine Distance

Cosine distance is a popular metric, because it normalizes the vectors, which means it only considers the difference of the angle between the two vectors, not their magnitudes. If you don't know that your vectors have been normalized, this may be a safer bet than the inner product. It is one of the more complicated algorithms to implement, but differences may be negligible w/ modern hardware accelerated instruction sets depending on your workload profile.

You can also use PostgresML to [normalize all your vectors](vector_normalization.md) as a separate processing step to pay that cost only at indexing time, and then the inner product will provide equivalent distance measures.

{% tabs %}
{% tab title="JavaScript" %}

```javascript
function cosineDistance(a, b) {
let dotProduct = 0;
let normA = 0;
let normB = 0;

for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}

normA = Math.sqrt(normA);
normB = Math.sqrt(normB);

if (normA === 0 || normB === 0) {
throw new Error("Norm of one or both vectors is 0, cannot compute cosine similarity.");
}

const cosineSimilarity = dotProduct / (normA * normB);
const cosineDistance = 1 - cosineSimilarity;

return cosineDistance;
}
```

{% endtab %}

{% tab title="Python" %}

```python
def cosine_distance(a, b):
dot_product = 0
normA = 0
normB = 0

for a, b in zip(a, b):
dot_product += a * b
normA += a * a
normB += b * b

normA = math.sqrt(normA)
normB = math.sqrt(normB)

if normA == 0 or normB == 0:
raise ValueError("Norm of one or both vectors is 0, cannot compute cosine similarity.")

cosine_similarity = dot_product / (normA * normB)
cosine_distance = 1 - cosine_similarity

return cosine_distance
```

{% endtab %}
{% endtabs %}