normalization

postgresml · montanalow · May 24, 2024 · May 10, 2024 · May 11, 2024 · May 11, 2024
commit ccad0b76cb4aed4c1b58dc4a4744944c39d65b09
diff --git a/pgml-cms/docs/SUMMARY.md b/pgml-cms/docs/SUMMARY.md
@@ -60,7 +60,7 @@
   * [Indexing w/ pgvector]()
   * [Aggregation]()
   * [Vector Similarity](guides/embeddings/vector_similarity.md)
-  * [Normalizing]()
+  * [Vector Normalization](guides/embeddings/vector_normalization.md)
 * [Search]()
   * [Keyword Search]()
   * [Vector Search]()

diff --git a/pgml-cms/docs/guides/embeddings/README.md b/pgml-cms/docs/guides/embeddings/README.md
@@ -40,7 +40,7 @@ Other cloud providers claim to offer embeddings "inside the database", but [benc
 
 ## Vectors support arithmetic
 
-Vectors can be operated on mathematically with simple equations. For example, vector addition is defined as the sum of all the pairs of elements in the two vectors. This might be useful to combine two concepts into a single embedding. For example "frozen" + "rain" should be similar to (≈) "snow" if the embedding model has encoded the nuances of natural language. 
+Vectors can be operated on mathematically with simple equations. For example, vector addition is defined as the sum of all the pairs of elements in the two vectors. This might be useful to combine two concepts into a single new embedding. For example "frozen" + "rain" should be similar to (≈) "snow" if the embedding model has encoded the nuances of natural language and precipitation. 
 
 Most vector operations are simple enough to implement in a few lines of code. Here's a naive implementation (no hardware acceleration) of vector addition in some popular languages:
 
@@ -79,7 +79,7 @@ add(x, y)
 
 ### Vector Similarity
 
-Similar embeddings should represent similar concepts. Embedding similarity (≈) is defined as the distance between their two vectors. There are 3 primary ways to measure the distance between two vectors, that have slight tradeoffs in performance and accuracy. 
+Similar embeddings should represent similar concepts. If we have one embedding created from a user query and a bunch of other embeddings from documents, we can find documents that are most similar to the query by calculating the similarity between the query and each document. Embedding similarity (≈) is defined as the distance between the two vectors. 
 
-If two vectors are identical (=), then the distance between them is 0. If the distance is small, then they are similar (≈). You can read more about some of the available [difference metrics](vector_similarity.md) if you'd like to build a stronger intuition, but the defaults are usually a good starting place. 
+There are several ways to measure the distance between two vectors, that have tradeoffs in latency and accuracy. If two vectors are identical (=), then the distance between them is 0. If the distance is small, then they are similar (≈). You can read more about some of the available [distance metrics](vector_similarity.md) if you'd like to build a stronger intuition, but the defaults are usually a good starting place. 
 
diff --git a/pgml-cms/docs/guides/embeddings/vector_normalization.md b/pgml-cms/docs/guides/embeddings/vector_normalization.md
@@ -0,0 +1,89 @@
+# Vector Normalization
+
+The purpose of vector normalization is to convert a vector into a unit vector — that is, a vector that retains the same direction but has a magnitude (or length) of 1. This process is essential for various computational techniques where the magnitude of a vector may influence the outcome undesirably, such as when calculating cosine similarity or when needing to compare vectors based solely on direction.
+
+## Purpose and Benefits
+
+- **Cosine Similarity**: In machine learning and data science, normalized vectors are crucial when using the inner product, instead of the more expensive cosine similarity metric. Inner product inherently requires vectors of unit length to accurately measure angles between vectors. Normalized L2 vectors indexed with the inner product can give a 3x speedup compared to computing the cosine similarity, while yielding otherwise identical results. 
+
+- **Directionality**: Normalization strips away the magnitude of the vector, leaving a descriptor of direction only. This is useful when direction matters more than length, such as in feature scaling in machine learning where you want to normalize features to have equal influence regardless of their absolute values.
+
+- **Stability in Computations**: When vectors are normalized, numerical computations involving them are often more stable and less susceptible to problems due to very large or very small scale factors.
+
+## Storing and Normalizing Data
+
+Assume you've created a table in your database that stores embeddings generated using [pgml.embed()](../../api/sql-extension/pgml.embed.md), although you can normalize any vector.
+
+```sql
+-- Create a table to store your text data and its vector representation
+CREATE TABLE documents (
+   id SERIAL PRIMARY KEY,
+   body TEXT,
+   embedding FLOAT[] GENERATED ALWAYS AS (pgml.embed('intfloat/e5-small-v2', body)) STORED
+);
+```
+
+```sql
+-- Example of inserting text and its corresponding embedding
+INSERT INTO documents (body)
+VALUES -- Normalized embedding vectors are automatically generated
+    ('Example text data'),
+    ('Another example document'),
+    ('Some other thing');
+```
+
+You could create a new table from your documents and their embeddings, that uses normalized embeddings.  
+
+```sql
+CREATE TABLE documents_normalized_vectors AS 
+SELECT 
+    id AS document_id, 
+    pgml.normalize_l2(embedding) AS normalized_l2_embedding
+FROM documents;
+```
+
+Another valid approach would be to just store the normalized embedding in the documents table.
+
+```sql
+-- Create a table to store your text data and its normalized vector representation
+CREATE TABLE documents (
+   id SERIAL PRIMARY KEY,
+   body TEXT,
+   embedding FLOAT[] GENERATED ALWAYS AS (pgml.normalize_l2(pgml.embed('intfloat/e5-small-v2', body))) STORED
+);
+```
+
+## Normalization Functions
+   Normalization is critical for ensuring that the magnitudes of feature vectors do not distort the performance of machine learning algorithms.
+
+- **L1 Normalization (Manhattan Norm)**: This function scales the vector so that the sum of the absolute values of its components is equal to 1. It's useful when differences in magnitude are important but the components represent independent dimensions.
+```sql
+SELECT pgml.normalize_l1(embedding) FROM documents;
+```
+- **L2 Normalization (Euclidean Norm)**: Scales the vector so that the sum of the squares of its components is equal to 1. This is particularly important for cosine similarity calculations in machine learning.
+```sql
+SELECT pgml.normalize_l2(embedding) FROM documents;
+```
+- **Max Normalization**: Scales the vector such that the maximum absolute value of any component is 1. This normalization is less common but can be useful when the maximum value represents a bounded capacity.
+```sql
+SELECT pgml.normalize_max(embedding) FROM documents;
+```
+
+## Querying and Using Normalized Vectors
+   After normalization, you can use these vectors for various applications, such as similarity searches, clustering, or as input for further machine learning models within PostgresML.
+
+```sql
+-- Querying for similarity using cosine similarity
+WITH normalized_vectors AS (
+   SELECT id, pgml.normalize_l2(embedding) AS norm_vector
+   FROM documents
+)
+SELECT a.id, b.id, pgml.dot_product(a.norm_vector, b.norm_vector)
+FROM normalized_vectors a, normalized_vectors b
+WHERE a.id <> b.id;
+```
+
+## Considerations and Best Practices
+
+- **Performance**: Normalization can be computationally intensive, especially with large datasets. Consider batch processing and appropriate indexing.
+- **Storage**: Normalized vectors might not need to be persisted if they are only used transiently, which can save storage.
diff --git a/pgml-extension/requirements.txt b/pgml-extension/requirements.txt
@@ -45,6 +45,8 @@ sentence-transformers
 rouge
 sacrebleu
 sacremoses
+evaluate
+trl
 
 # Utils
 datasets