Skip to content

Add more docs #1455

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 24, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
normalization
  • Loading branch information
montanalow committed May 11, 2024
commit ccad0b76cb4aed4c1b58dc4a4744944c39d65b09
2 changes: 1 addition & 1 deletion pgml-cms/docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@
* [Indexing w/ pgvector]()
* [Aggregation]()
* [Vector Similarity](guides/embeddings/vector_similarity.md)
* [Normalizing]()
* [Vector Normalization](guides/embeddings/vector_normalization.md)
* [Search]()
* [Keyword Search]()
* [Vector Search]()
Expand Down
6 changes: 3 additions & 3 deletions pgml-cms/docs/guides/embeddings/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Other cloud providers claim to offer embeddings "inside the database", but [benc

## Vectors support arithmetic

Vectors can be operated on mathematically with simple equations. For example, vector addition is defined as the sum of all the pairs of elements in the two vectors. This might be useful to combine two concepts into a single embedding. For example "frozen" + "rain" should be similar to (≈) "snow" if the embedding model has encoded the nuances of natural language.
Vectors can be operated on mathematically with simple equations. For example, vector addition is defined as the sum of all the pairs of elements in the two vectors. This might be useful to combine two concepts into a single new embedding. For example "frozen" + "rain" should be similar to (≈) "snow" if the embedding model has encoded the nuances of natural language and precipitation.

Most vector operations are simple enough to implement in a few lines of code. Here's a naive implementation (no hardware acceleration) of vector addition in some popular languages:

Expand Down Expand Up @@ -79,7 +79,7 @@ add(x, y)

### Vector Similarity

Similar embeddings should represent similar concepts. Embedding similarity (≈) is defined as the distance between their two vectors. There are 3 primary ways to measure the distance between two vectors, that have slight tradeoffs in performance and accuracy.
Similar embeddings should represent similar concepts. If we have one embedding created from a user query and a bunch of other embeddings from documents, we can find documents that are most similar to the query by calculating the similarity between the query and each document. Embedding similarity (≈) is defined as the distance between the two vectors.

If two vectors are identical (=), then the distance between them is 0. If the distance is small, then they are similar (≈). You can read more about some of the available [difference metrics](vector_similarity.md) if you'd like to build a stronger intuition, but the defaults are usually a good starting place.
There are several ways to measure the distance between two vectors, that have tradeoffs in latency and accuracy. If two vectors are identical (=), then the distance between them is 0. If the distance is small, then they are similar (≈). You can read more about some of the available [distance metrics](vector_similarity.md) if you'd like to build a stronger intuition, but the defaults are usually a good starting place.

89 changes: 89 additions & 0 deletions pgml-cms/docs/guides/embeddings/vector_normalization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Vector Normalization

The purpose of vector normalization is to convert a vector into a unit vector — that is, a vector that retains the same direction but has a magnitude (or length) of 1. This process is essential for various computational techniques where the magnitude of a vector may influence the outcome undesirably, such as when calculating cosine similarity or when needing to compare vectors based solely on direction.

## Purpose and Benefits

- **Cosine Similarity**: In machine learning and data science, normalized vectors are crucial when using the inner product, instead of the more expensive cosine similarity metric. Inner product inherently requires vectors of unit length to accurately measure angles between vectors. Normalized L2 vectors indexed with the inner product can give a 3x speedup compared to computing the cosine similarity, while yielding otherwise identical results.

- **Directionality**: Normalization strips away the magnitude of the vector, leaving a descriptor of direction only. This is useful when direction matters more than length, such as in feature scaling in machine learning where you want to normalize features to have equal influence regardless of their absolute values.

- **Stability in Computations**: When vectors are normalized, numerical computations involving them are often more stable and less susceptible to problems due to very large or very small scale factors.

## Storing and Normalizing Data

Assume you've created a table in your database that stores embeddings generated using [pgml.embed()](../../api/sql-extension/pgml.embed.md), although you can normalize any vector.

```sql
-- Create a table to store your text data and its vector representation
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
body TEXT,
embedding FLOAT[] GENERATED ALWAYS AS (pgml.embed('intfloat/e5-small-v2', body)) STORED
);
```

```sql
-- Example of inserting text and its corresponding embedding
INSERT INTO documents (body)
VALUES -- Normalized embedding vectors are automatically generated
('Example text data'),
('Another example document'),
('Some other thing');
```

You could create a new table from your documents and their embeddings, that uses normalized embeddings.

```sql
CREATE TABLE documents_normalized_vectors AS
SELECT
id AS document_id,
pgml.normalize_l2(embedding) AS normalized_l2_embedding
FROM documents;
```

Another valid approach would be to just store the normalized embedding in the documents table.

```sql
-- Create a table to store your text data and its normalized vector representation
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
body TEXT,
embedding FLOAT[] GENERATED ALWAYS AS (pgml.normalize_l2(pgml.embed('intfloat/e5-small-v2', body))) STORED
);
```

## Normalization Functions
Normalization is critical for ensuring that the magnitudes of feature vectors do not distort the performance of machine learning algorithms.

- **L1 Normalization (Manhattan Norm)**: This function scales the vector so that the sum of the absolute values of its components is equal to 1. It's useful when differences in magnitude are important but the components represent independent dimensions.
```sql
SELECT pgml.normalize_l1(embedding) FROM documents;
```
- **L2 Normalization (Euclidean Norm)**: Scales the vector so that the sum of the squares of its components is equal to 1. This is particularly important for cosine similarity calculations in machine learning.
```sql
SELECT pgml.normalize_l2(embedding) FROM documents;
```
- **Max Normalization**: Scales the vector such that the maximum absolute value of any component is 1. This normalization is less common but can be useful when the maximum value represents a bounded capacity.
```sql
SELECT pgml.normalize_max(embedding) FROM documents;
```

## Querying and Using Normalized Vectors
After normalization, you can use these vectors for various applications, such as similarity searches, clustering, or as input for further machine learning models within PostgresML.

```sql
-- Querying for similarity using cosine similarity
WITH normalized_vectors AS (
SELECT id, pgml.normalize_l2(embedding) AS norm_vector
FROM documents
)
SELECT a.id, b.id, pgml.dot_product(a.norm_vector, b.norm_vector)
FROM normalized_vectors a, normalized_vectors b
WHERE a.id <> b.id;
```

## Considerations and Best Practices

- **Performance**: Normalization can be computationally intensive, especially with large datasets. Consider batch processing and appropriate indexing.
- **Storage**: Normalized vectors might not need to be persisted if they are only used transiently, which can save storage.
2 changes: 2 additions & 0 deletions pgml-extension/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,8 @@ sentence-transformers
rouge
sacrebleu
sacremoses
evaluate
trl

# Utils
datasets
Expand Down