From b15340b7f858b7e423ee6cb5c74498a97535ce0e Mon Sep 17 00:00:00 2001 From: Montana Low Date: Fri, 10 May 2024 13:43:10 -0700 Subject: [PATCH 01/11] Revert "Revert "outline"" This reverts commit 2f781807d5eb656489140b4d2e4943644450c044. --- pgml-cms/docs/SUMMARY.md | 49 +++++++++++++++++++++++ pgml-cms/docs/guides/embeddings/README.md | 30 ++++++++++++++ 2 files changed, 79 insertions(+) create mode 100644 pgml-cms/docs/guides/embeddings/README.md diff --git a/pgml-cms/docs/SUMMARY.md b/pgml-cms/docs/SUMMARY.md index 6c463d16c..48568ea8e 100644 --- a/pgml-cms/docs/SUMMARY.md +++ b/pgml-cms/docs/SUMMARY.md @@ -51,6 +51,55 @@ * [Semantic Search](api/client-sdk/tutorials/semantic-search.md) * [Semantic Search Using Instructor Model](api/client-sdk/tutorials/semantic-search-1.md) +## Guides + +* [Embeddings](guides/embeddings/README.md) + * [In-database Generation]() + * [Dimensionality Reduction]() + * [Re-ranking nearest neighbors]() + * [Indexing w/ IVFFLat vs HNSW]() + * [Aggregation]() + * [Personalization]() +* [Search]() + * [Keyword Search]() + * [Vector Search]() + * [Hybrid Search]() + * [Ranking]() +* [Transformers & LLMs]() + * [Text Generation]() + * [Prompt Engineering]() + * [Unified RAG]() +* [Personalization]() +* [Recommendations]() +* [Forecasting]() + * [Time series]() + * [Events]() +* [Fraud Detection]() +* [Incentive Optimization]() +* [Sentiment Analysis]() +* [Summarization]() + +## Reference + +* [SQL]() + * [Explain plans]() + * [Composition]() +* [Machine Learning]() + * [Feature Engineering]() + * [Regression]() + * [Classification]() + * [Clustering]() + * [Matrix Decomposition]() +* [Natural Language Processing]() + * [Tokenization]() + * [Chunking]() + * [Text Generation]() +* [LLMs]() + * [LLama]() + * [GPT]() + * [Facon]() +* [Glossary]() + ## Product * [Cloud database](product/cloud-database/README.md) diff --git a/pgml-cms/docs/guides/embeddings/README.md b/pgml-cms/docs/guides/embeddings/README.md new file mode 100644 index 000000000..87356d7e8 --- /dev/null +++ b/pgml-cms/docs/guides/embeddings/README.md @@ -0,0 +1,30 @@ +--- +description: Embeddings are a key building block with many applications in modern AI/ML systems. They are particularly valuable for handling various types of unstructured data like text, images, and more, providing a pathway to richer insights and improved performance. A common use case for embeddings is to provide semantic search capabilities that go beyond traditional keyword matching to the underlying meaning in the data. +--- + +# Embeddings + +As the demand for sophisticated data analysis and machine learning capabilities within databases grows, so does the need for efficient and scalable solutions. PostgresML offers a powerful platform for integrating machine learning directly into PostgreSQL, enabling you to perform complex computations and predictive analytics without ever leaving your database. + +Embeddings are a key building block with many applications in modern AI/ML systems. They are particularly valuable for handling various types of unstructured data like text, images, and more, providing a pathway to richer insights and improved performance. A common use case for embeddings is to provide semantic search capabilities that go beyond traditional keyword matching to the underlying meaning in the data. + +This guide will introduce you to the fundamentals of embeddings within PostgresML. Whether you are looking to enhance text processing capabilities, improve image recognition functions, or simply incorporate more advanced machine learning models into your database, embeddings can play a pivotal role. By integrating these capabilities directly within PostgreSQL, you benefit from streamlined operations, reduced data movement, and the ability to leverage the full power of SQL alongside advanced machine learning techniques. + +Throughout this guide, we will cover: + +* [In-database Generation]() +* [Dimensionality Reduction]() +* [Re-ranking nearest neighbors]() +* [Indexing w/ IVFFLat vs HNSW]() +* [Aggregation]() +* [Personalization]() + +# Embeddings are Vectors + +Embeddings are represented mathematically as a vector, and can be stored in the native Postgres [`ARRAY[]`](https://www.postgresql.org/docs/current/arrays.html) datatype which is compatible with many application programming languages' native datatype. Modern CPUs and GPUs offer hardware acceleration for common array operations, which can give substantial performance benefits when operating at scale, but which are typically unused in a Postgres database. This is referred to as "vectorization" to enable these instruction sets. You'll need to ensure you're compiling your full stack with support for your hardware to get the most bang for your buck, or you can get full acceleration in a PostgresML cloud database. + +!!! warning + +Other cloud providers claim to offer embeddings "inside the database", but if you benchmark their calls you'll see that their implementations are 10-100x slower than PostgresML. The reason is they aren't actually running inside the database. They are thin wrapper functions that making network calls to other datacenters to compute the embeddings. PostgresML is the only cloud that puts GPU hardware in the database for full acceleration, and it shows. + +!!! From ff32794e7c8202d1043fcd299481e75e200668b0 Mon Sep 17 00:00:00 2001 From: Montana Low Date: Fri, 10 May 2024 17:47:31 -0700 Subject: [PATCH 02/11] checkpoint --- pgml-cms/docs/SUMMARY.md | 5 +- pgml-cms/docs/guides/embeddings/README.md | 71 ++++++- .../guides/embeddings/vector_normalization.md | 0 .../guides/embeddings/vector_similarity.md | 181 ++++++++++++++++++ 4 files changed, 247 insertions(+), 10 deletions(-) create mode 100644 pgml-cms/docs/guides/embeddings/vector_normalization.md create mode 100644 pgml-cms/docs/guides/embeddings/vector_similarity.md diff --git a/pgml-cms/docs/SUMMARY.md b/pgml-cms/docs/SUMMARY.md index 48568ea8e..facbfe742 100644 --- a/pgml-cms/docs/SUMMARY.md +++ b/pgml-cms/docs/SUMMARY.md @@ -57,9 +57,10 @@ * [In-database Generation]() * [Dimensionality Reduction]() * [Re-ranking nearest neighbors]() - * [Indexing w/ IVFFLat vs HNSW]() + * [Indexing w/ pgvector]() * [Aggregation]() - * [Personalization]() + * [Vector Similarity](guides/embeddings/vector_similarity.md) + * [Normalizing]() * [Search]() * [Keyword Search]() * [Vector Search]() diff --git a/pgml-cms/docs/guides/embeddings/README.md b/pgml-cms/docs/guides/embeddings/README.md index 87356d7e8..494934992 100644 --- a/pgml-cms/docs/guides/embeddings/README.md +++ b/pgml-cms/docs/guides/embeddings/README.md @@ -4,27 +4,82 @@ description: Embeddings are a key building block with many applications in moder # Embeddings -As the demand for sophisticated data analysis and machine learning capabilities within databases grows, so does the need for efficient and scalable solutions. PostgresML offers a powerful platform for integrating machine learning directly into PostgreSQL, enabling you to perform complex computations and predictive analytics without ever leaving your database. +As the demand for sophisticated data analysis and machine learning capabilities within databases grows, so does the need for efficient and scalable solutions. PostgresML offers a powerful platform for integrating machine learning directly into PostgreSQL, enabling you to perform complex computations and predictive analytics without ever leaving your database. -Embeddings are a key building block with many applications in modern AI/ML systems. They are particularly valuable for handling various types of unstructured data like text, images, and more, providing a pathway to richer insights and improved performance. A common use case for embeddings is to provide semantic search capabilities that go beyond traditional keyword matching to the underlying meaning in the data. +Embeddings are a key building block with many applications in modern AI/ML systems. They are particularly valuable for handling various types of unstructured data like text, images, and more, providing a pathway to richer insights and improved performance. They allow computers to operate on natural language and other high level concepts by reducing them to billions of simple arithmetic operations. + +## Applications of Embeddings + +- **Search and Information Retrieval**: Embeddings can transform search queries and documents into vectors, making it easier to find the most relevant documents for a given query based on semantic similarity. +- **Personalization**: In recommendation systems, embeddings can help understand user queries and preferences, enhancing the accuracy of recommendations. +- **Text Generation**: Large language models use embeddings to generate coherent and contextually relevant text, which can be applied in scenarios ranging from chatbots to content creation. +- **Natural Language Understanding (NLU)**: Embeddings enable models to perform tasks such as sentiment analysis, named entity recognition, and summarization by understanding the context and meaning of texts. +- **Translation**: In machine translation, embeddings help models understand the semantic and syntactic structures of different languages, facilitating the translation process. This guide will introduce you to the fundamentals of embeddings within PostgresML. Whether you are looking to enhance text processing capabilities, improve image recognition functions, or simply incorporate more advanced machine learning models into your database, embeddings can play a pivotal role. By integrating these capabilities directly within PostgreSQL, you benefit from streamlined operations, reduced data movement, and the ability to leverage the full power of SQL alongside advanced machine learning techniques. -Throughout this guide, we will cover: +In this guide, we will cover: * [In-database Generation]() * [Dimensionality Reduction]() * [Re-ranking nearest neighbors]() -* [Indexing w/ IVFFLat vs HNSW]() +* [Indexing w/ pgvector]() * [Aggregation]() -* [Personalization]() -# Embeddings are Vectors +## Embeddings are Vectors + +In the context of large language models, embeddings are representations of words, phrases, or even entire sentences. Each word or text snippet is mapped to a vector in a high-dimensional space. These vectors capture semantic and syntactic nuances, meaning that similar words have vectors that are close together in this space. For instance, "king" and "queen" would be represented by vectors that are closer together than "king" and "apple". -Embeddings are represented mathematically as a vector, and can be stored in the native Postgres [`ARRAY[]`](https://www.postgresql.org/docs/current/arrays.html) datatype which is compatible with many application programming languages' native datatype. Modern CPUs and GPUs offer hardware acceleration for common array operations, which can give substantial performance benefits when operating at scale, but which are typically unused in a Postgres database. This is referred to as "vectorization" to enable these instruction sets. You'll need to ensure you're compiling your full stack with support for your hardware to get the most bang for your buck, or you can get full acceleration in a PostgresML cloud database. +Vectors can be stored in the native Postgres [`ARRAY[]`](https://www.postgresql.org/docs/current/arrays.html) datatype which is compatible with many application programming languages' native datatypes. Modern CPUs and GPUs offer hardware acceleration for common array operations, which can give substantial performance benefits when operating at scale, but which are typically not enabled in a Postgres database. You'll need to ensure you're compiling your full stack with support for your hardware to get the most bang for your buck, or you can leave that up to us, and get full hardware acceleration in a PostgresML cloud database. !!! warning -Other cloud providers claim to offer embeddings "inside the database", but if you benchmark their calls you'll see that their implementations are 10-100x slower than PostgresML. The reason is they aren't actually running inside the database. They are thin wrapper functions that making network calls to other datacenters to compute the embeddings. PostgresML is the only cloud that puts GPU hardware in the database for full acceleration, and it shows. +Other cloud providers claim to offer embeddings "inside the database", but [benchmarks](../../resources/benchmarks/mindsdb-vs-postgresml.md) show that they are orders of magnitude slower than PostgresML. The reason is they don't actually run inside the database with hardware acceleration. They are thin wrapper functions that make network calls to remote service providers. PostgresML is the only cloud that puts GPU hardware in the database for full acceleration, and it shows. !!! + +## Vectors support arithmetic + +Vectors can be operated on mathematically with simple equations. For example, vector addition is defined as the sum of all the pairs of elements in the two vectors. This might be useful to combine two concepts into a single embedding. For example "frozen" + "rain" should be similar to (≈) "snow" if the embedding model has encoded the nuances of natural language. + +Most vector operations are simple enough to implement in a few lines of code. Here's a naive implementation (no hardware acceleration) of vector addition in some popular languages: + +{% tabs %} +{% tab title="JavaScript" %} + +```javascript +function add_vectors(x, y) { + let result = []; + for (let i = 0; i < x.length; i++) { + result[i] = x[i] + y[i]; + } + return result; +} + +let x = [1, 2, 3]; +let y = [1, 2, 3]; +add(x, y) +``` + +{% endtab %} + +{% tab title="Python" %} + +```python +def add_vectors(x, y): + return [x+y for x,y in zip(x,y)] + +x = [1, 2, 3] +y = [1, 2, 3] +add(x, y) +``` + +{% endtab %} +{% endtabs %} + +### Vector Similarity + +Similar embeddings should represent similar concepts. Embedding similarity (≈) is defined as the distance between their two vectors. There are 3 primary ways to measure the distance between two vectors, that have slight tradeoffs in performance and accuracy. + +If two vectors are identical (=), then the distance between them is 0. If the distance is small, then they are similar (≈). You can read more about some of the available [difference metrics](vector_similarity.md) if you'd like to build a stronger intuition, but the defaults are usually a good starting place. + diff --git a/pgml-cms/docs/guides/embeddings/vector_normalization.md b/pgml-cms/docs/guides/embeddings/vector_normalization.md new file mode 100644 index 000000000..e69de29bb diff --git a/pgml-cms/docs/guides/embeddings/vector_similarity.md b/pgml-cms/docs/guides/embeddings/vector_similarity.md new file mode 100644 index 000000000..3bf484075 --- /dev/null +++ b/pgml-cms/docs/guides/embeddings/vector_similarity.md @@ -0,0 +1,181 @@ +# Vector Distances +There are many distance functions that can be used to measure the similarity or differences between vectors. We list a few of the more common ones here with details on how they work, to help you choose. They are listed here in order of computational complexity, although modern hardware accelerated implementations can typically compare on the order of 100,000 vectors per second per processor. Modern CPUs may have tens to hundreds of cores, and GPUs have tens of thousands. + +## Manhattan Distance + +You can think of this distance metric as how long it takes you to walk from one building in Manhattan to another, when you can only walk along streets that go the 4 cardinal directions, with no diagonals. It's the fastest distance measure to implement, because it just adds up all the pairwise element differences. It's also referred to as the L1 distance. + +{% tabs %} +{% tab title="JavaScript" %} + +```javascript +function manhattanDistance(x, y) { + let result = 0; + for (let i = 0; i < x.length; i++) { + result += x[i] - y[i]; + } + return result; +} + +let x = [1, 2, 3]; +let y = [1, 2, 3]; +manhattanDistance(x, y) +``` + +{% endtab %} + +{% tab title="Python" %} + +```python +def manhattan_distance(x, y): + return sum([x-y for x,y in zip(x,y)]) + +x = [1, 2, 3] +y = [1, 2, 3] +manhattan_distance(x, y) +``` + +{% endtab %} +{% endtabs %} + +## Euclidean Distance + +This is a simple refinement of Manhattan Distance that applies the Pythagorean theorem to find the length of the straight line between the two points. It involves squaring the differences and then taking the final square root, which is a more expensive operation, so it may be slightly slower, but is also a more accurate representation in high dimensional spaces. It's also referred to as the L2 distance. + +{% tabs %} +{% tab title="JavaScript" %} + +```javascript +function euclideanDistance(x, y) { + let result = 0; + for (let i = 0; i < x.length; i++) { + result += Math.pow(x[i] - y[i], 2); + } + return Math.sqrt(result); +} + +let x = [1, 2, 3]; +let y = [1, 2, 3]; +euclideanDistance(x, y) +``` + +{% endtab %} + +{% tab title="Python" %} + +```python +def euclidean_distance(x, y): + return math.sqrt(sum([(x-y) * (x-y) for x,y in zip(x,y)])) + +x = [1, 2, 3] +y = [1, 2, 3] +euclidean_distance(x, y) +``` + +{% endtab %} +{% endtabs %} + +## Inner product + +The inner product (the dot product in Euclidean space) can be used to find how similar any two vectors are, by measuring the overlap of each element, which compares the direction they point. Two completely different (orthogonal) vectors have an inner product of 0. If vectors point in opposite directions, the inner product will be negative. Positive numbers indicate the vectors point in the same direction, and are more similar. + +This metric is as fast to compute as the Euclidean Distance, but may provide more relevant results if all vectors are normalized. If vectors are not normalized, it will bias results toward vectors with larger magnitudes, and you should consider using the cosine distance instead. + +{% tabs %} +{% tab title="JavaScript" %} + +```javascript +function innerProduct(x, y) { + let result = 0; + for (let i = 0; i < x.length; i++) { + result += x[i] * y[i]; + } + return result; +} + +let x = [1, 2, 3]; +let y = [1, 2, 3]; +innerProduct(x, y) +``` + +{% endtab %} + +{% tab title="Python" %} + +```python +def inner_product(x, y): + return sum([x*y for x,y in zip(x,y)]) + +x = [1, 2, 3] +y = [1, 2, 3] +inner_product(x, y) +``` + +{% endtab %} +{% endtabs %} + + +## Cosine Distance + +Cosine distance is a popular metric, because it normalizes the vectors, which means it only considers the difference of the angle between the two vectors, not their magnitudes. If you don't know that your vectors have been normalized, this may be a safer bet than the inner product. It is one of the more complicated algorithms to implement, but differences may be negligible w/ modern hardware accelerated instruction sets depending on your workload profile. + +You can also use PostgresML to [normalize all your vectors](vector_normalization.md) as a separate processing step to pay that cost only at indexing time, and then the inner product will provide equivalent distance measures. + +{% tabs %} +{% tab title="JavaScript" %} + +```javascript +function cosineDistance(a, b) { + let dotProduct = 0; + let normA = 0; + let normB = 0; + + for (let i = 0; i < a.length; i++) { + dotProduct += a[i] * b[i]; + normA += a[i] * a[i]; + normB += b[i] * b[i]; + } + + normA = Math.sqrt(normA); + normB = Math.sqrt(normB); + + if (normA === 0 || normB === 0) { + throw new Error("Norm of one or both vectors is 0, cannot compute cosine similarity."); + } + + const cosineSimilarity = dotProduct / (normA * normB); + const cosineDistance = 1 - cosineSimilarity; + + return cosineDistance; +} +``` + +{% endtab %} + +{% tab title="Python" %} + +```python +def cosine_distance(a, b): + dot_product = 0 + normA = 0 + normB = 0 + + for a, b in zip(a, b): + dot_product += a * b + normA += a * a + normB += b * b + + normA = math.sqrt(normA) + normB = math.sqrt(normB) + + if normA == 0 or normB == 0: + raise ValueError("Norm of one or both vectors is 0, cannot compute cosine similarity.") + + cosine_similarity = dot_product / (normA * normB) + cosine_distance = 1 - cosine_similarity + + return cosine_distance +``` + +{% endtab %} +{% endtabs %} From ccad0b76cb4aed4c1b58dc4a4744944c39d65b09 Mon Sep 17 00:00:00 2001 From: Montana Low Date: Fri, 10 May 2024 21:47:44 -0700 Subject: [PATCH 03/11] normalization --- pgml-cms/docs/SUMMARY.md | 2 +- pgml-cms/docs/guides/embeddings/README.md | 6 +- .../guides/embeddings/vector_normalization.md | 89 +++++++++++++++++++ pgml-extension/requirements.txt | 2 + 4 files changed, 95 insertions(+), 4 deletions(-) diff --git a/pgml-cms/docs/SUMMARY.md b/pgml-cms/docs/SUMMARY.md index facbfe742..1c5a04cb7 100644 --- a/pgml-cms/docs/SUMMARY.md +++ b/pgml-cms/docs/SUMMARY.md @@ -60,7 +60,7 @@ * [Indexing w/ pgvector]() * [Aggregation]() * [Vector Similarity](guides/embeddings/vector_similarity.md) - * [Normalizing]() + * [Vector Normalization](guides/embeddings/vector_normalization.md) * [Search]() * [Keyword Search]() * [Vector Search]() diff --git a/pgml-cms/docs/guides/embeddings/README.md b/pgml-cms/docs/guides/embeddings/README.md index 494934992..000888470 100644 --- a/pgml-cms/docs/guides/embeddings/README.md +++ b/pgml-cms/docs/guides/embeddings/README.md @@ -40,7 +40,7 @@ Other cloud providers claim to offer embeddings "inside the database", but [benc ## Vectors support arithmetic -Vectors can be operated on mathematically with simple equations. For example, vector addition is defined as the sum of all the pairs of elements in the two vectors. This might be useful to combine two concepts into a single embedding. For example "frozen" + "rain" should be similar to (≈) "snow" if the embedding model has encoded the nuances of natural language. +Vectors can be operated on mathematically with simple equations. For example, vector addition is defined as the sum of all the pairs of elements in the two vectors. This might be useful to combine two concepts into a single new embedding. For example "frozen" + "rain" should be similar to (≈) "snow" if the embedding model has encoded the nuances of natural language and precipitation. Most vector operations are simple enough to implement in a few lines of code. Here's a naive implementation (no hardware acceleration) of vector addition in some popular languages: @@ -79,7 +79,7 @@ add(x, y) ### Vector Similarity -Similar embeddings should represent similar concepts. Embedding similarity (≈) is defined as the distance between their two vectors. There are 3 primary ways to measure the distance between two vectors, that have slight tradeoffs in performance and accuracy. +Similar embeddings should represent similar concepts. If we have one embedding created from a user query and a bunch of other embeddings from documents, we can find documents that are most similar to the query by calculating the similarity between the query and each document. Embedding similarity (≈) is defined as the distance between the two vectors. -If two vectors are identical (=), then the distance between them is 0. If the distance is small, then they are similar (≈). You can read more about some of the available [difference metrics](vector_similarity.md) if you'd like to build a stronger intuition, but the defaults are usually a good starting place. +There are several ways to measure the distance between two vectors, that have tradeoffs in latency and accuracy. If two vectors are identical (=), then the distance between them is 0. If the distance is small, then they are similar (≈). You can read more about some of the available [distance metrics](vector_similarity.md) if you'd like to build a stronger intuition, but the defaults are usually a good starting place. diff --git a/pgml-cms/docs/guides/embeddings/vector_normalization.md b/pgml-cms/docs/guides/embeddings/vector_normalization.md index e69de29bb..18c9bab64 100644 --- a/pgml-cms/docs/guides/embeddings/vector_normalization.md +++ b/pgml-cms/docs/guides/embeddings/vector_normalization.md @@ -0,0 +1,89 @@ +# Vector Normalization + +The purpose of vector normalization is to convert a vector into a unit vector — that is, a vector that retains the same direction but has a magnitude (or length) of 1. This process is essential for various computational techniques where the magnitude of a vector may influence the outcome undesirably, such as when calculating cosine similarity or when needing to compare vectors based solely on direction. + +## Purpose and Benefits + +- **Cosine Similarity**: In machine learning and data science, normalized vectors are crucial when using the inner product, instead of the more expensive cosine similarity metric. Inner product inherently requires vectors of unit length to accurately measure angles between vectors. Normalized L2 vectors indexed with the inner product can give a 3x speedup compared to computing the cosine similarity, while yielding otherwise identical results. + +- **Directionality**: Normalization strips away the magnitude of the vector, leaving a descriptor of direction only. This is useful when direction matters more than length, such as in feature scaling in machine learning where you want to normalize features to have equal influence regardless of their absolute values. + +- **Stability in Computations**: When vectors are normalized, numerical computations involving them are often more stable and less susceptible to problems due to very large or very small scale factors. + +## Storing and Normalizing Data + +Assume you've created a table in your database that stores embeddings generated using [pgml.embed()](../../api/sql-extension/pgml.embed.md), although you can normalize any vector. + +```sql +-- Create a table to store your text data and its vector representation +CREATE TABLE documents ( + id SERIAL PRIMARY KEY, + body TEXT, + embedding FLOAT[] GENERATED ALWAYS AS (pgml.embed('intfloat/e5-small-v2', body)) STORED +); +``` + +```sql +-- Example of inserting text and its corresponding embedding +INSERT INTO documents (body) +VALUES -- Normalized embedding vectors are automatically generated + ('Example text data'), + ('Another example document'), + ('Some other thing'); +``` + +You could create a new table from your documents and their embeddings, that uses normalized embeddings. + +```sql +CREATE TABLE documents_normalized_vectors AS +SELECT + id AS document_id, + pgml.normalize_l2(embedding) AS normalized_l2_embedding +FROM documents; +``` + +Another valid approach would be to just store the normalized embedding in the documents table. + +```sql +-- Create a table to store your text data and its normalized vector representation +CREATE TABLE documents ( + id SERIAL PRIMARY KEY, + body TEXT, + embedding FLOAT[] GENERATED ALWAYS AS (pgml.normalize_l2(pgml.embed('intfloat/e5-small-v2', body))) STORED +); +``` + +## Normalization Functions + Normalization is critical for ensuring that the magnitudes of feature vectors do not distort the performance of machine learning algorithms. + +- **L1 Normalization (Manhattan Norm)**: This function scales the vector so that the sum of the absolute values of its components is equal to 1. It's useful when differences in magnitude are important but the components represent independent dimensions. +```sql +SELECT pgml.normalize_l1(embedding) FROM documents; +``` +- **L2 Normalization (Euclidean Norm)**: Scales the vector so that the sum of the squares of its components is equal to 1. This is particularly important for cosine similarity calculations in machine learning. +```sql +SELECT pgml.normalize_l2(embedding) FROM documents; +``` +- **Max Normalization**: Scales the vector such that the maximum absolute value of any component is 1. This normalization is less common but can be useful when the maximum value represents a bounded capacity. +```sql +SELECT pgml.normalize_max(embedding) FROM documents; +``` + +## Querying and Using Normalized Vectors + After normalization, you can use these vectors for various applications, such as similarity searches, clustering, or as input for further machine learning models within PostgresML. + +```sql +-- Querying for similarity using cosine similarity +WITH normalized_vectors AS ( + SELECT id, pgml.normalize_l2(embedding) AS norm_vector + FROM documents +) +SELECT a.id, b.id, pgml.dot_product(a.norm_vector, b.norm_vector) +FROM normalized_vectors a, normalized_vectors b +WHERE a.id <> b.id; +``` + +## Considerations and Best Practices + +- **Performance**: Normalization can be computationally intensive, especially with large datasets. Consider batch processing and appropriate indexing. +- **Storage**: Normalized vectors might not need to be persisted if they are only used transiently, which can save storage. diff --git a/pgml-extension/requirements.txt b/pgml-extension/requirements.txt index 625af5e1a..a66c35e65 100644 --- a/pgml-extension/requirements.txt +++ b/pgml-extension/requirements.txt @@ -45,6 +45,8 @@ sentence-transformers rouge sacrebleu sacremoses +evaluate +trl # Utils datasets From 3d3e3d410d6051cfd7b74d135c0a918c03c0edb4 Mon Sep 17 00:00:00 2001 From: Montana Low Date: Sat, 11 May 2024 13:23:54 -0700 Subject: [PATCH 04/11] benchmarks --- .../guides/embeddings/vector_normalization.md | 31 ++- .../guides/embeddings/vector_similarity.md | 178 +++++++++++++++++- 2 files changed, 189 insertions(+), 20 deletions(-) diff --git a/pgml-cms/docs/guides/embeddings/vector_normalization.md b/pgml-cms/docs/guides/embeddings/vector_normalization.md index 18c9bab64..0574dcefc 100644 --- a/pgml-cms/docs/guides/embeddings/vector_normalization.md +++ b/pgml-cms/docs/guides/embeddings/vector_normalization.md @@ -1,10 +1,10 @@ # Vector Normalization -The purpose of vector normalization is to convert a vector into a unit vector — that is, a vector that retains the same direction but has a magnitude (or length) of 1. This process is essential for various computational techniques where the magnitude of a vector may influence the outcome undesirably, such as when calculating cosine similarity or when needing to compare vectors based solely on direction. +Vector normalization converts a vector into a unit vector — that is, a vector that retains the same direction but has a magnitude (or length) of 1. This process is essential for various computational techniques where the magnitude of a vector may influence the outcome undesirably, such as when calculating the inner product instead of cosine similarity or when needing to compare vectors based solely on direction. ## Purpose and Benefits -- **Cosine Similarity**: In machine learning and data science, normalized vectors are crucial when using the inner product, instead of the more expensive cosine similarity metric. Inner product inherently requires vectors of unit length to accurately measure angles between vectors. Normalized L2 vectors indexed with the inner product can give a 3x speedup compared to computing the cosine similarity, while yielding otherwise identical results. +- **Cosine Similarity**: In machine learning and data science, normalized vectors are crucial when using the inner product, instead of the more expensive cosine similarity metric. Inner product inherently requires vectors of unit length to accurately measure angles between vectors. L2 Normalized vectors indexed with the inner product can reduce computational complexity 3x in the inner loop compared to cosine similarity, while yielding otherwise identical results. - **Directionality**: Normalization strips away the magnitude of the vector, leaving a descriptor of direction only. This is useful when direction matters more than length, such as in feature scaling in machine learning where you want to normalize features to have equal influence regardless of their absolute values. @@ -15,7 +15,6 @@ The purpose of vector normalization is to convert a vector into a unit vector Assume you've created a table in your database that stores embeddings generated using [pgml.embed()](../../api/sql-extension/pgml.embed.md), although you can normalize any vector. ```sql --- Create a table to store your text data and its vector representation CREATE TABLE documents ( id SERIAL PRIMARY KEY, body TEXT, @@ -23,8 +22,9 @@ CREATE TABLE documents ( ); ``` +Example of inserting text and its corresponding embedding + ```sql --- Example of inserting text and its corresponding embedding INSERT INTO documents (body) VALUES -- Normalized embedding vectors are automatically generated ('Example text data'), @@ -45,7 +45,6 @@ FROM documents; Another valid approach would be to just store the normalized embedding in the documents table. ```sql --- Create a table to store your text data and its normalized vector representation CREATE TABLE documents ( id SERIAL PRIMARY KEY, body TEXT, @@ -57,23 +56,23 @@ CREATE TABLE documents ( Normalization is critical for ensuring that the magnitudes of feature vectors do not distort the performance of machine learning algorithms. - **L1 Normalization (Manhattan Norm)**: This function scales the vector so that the sum of the absolute values of its components is equal to 1. It's useful when differences in magnitude are important but the components represent independent dimensions. -```sql -SELECT pgml.normalize_l1(embedding) FROM documents; -``` + ```sql + SELECT pgml.normalize_l1(embedding) FROM documents; + ``` - **L2 Normalization (Euclidean Norm)**: Scales the vector so that the sum of the squares of its components is equal to 1. This is particularly important for cosine similarity calculations in machine learning. -```sql -SELECT pgml.normalize_l2(embedding) FROM documents; -``` + ```sql + SELECT pgml.normalize_l2(embedding) FROM documents; + ``` - **Max Normalization**: Scales the vector such that the maximum absolute value of any component is 1. This normalization is less common but can be useful when the maximum value represents a bounded capacity. -```sql -SELECT pgml.normalize_max(embedding) FROM documents; -``` + ```sql + SELECT pgml.normalize_max(embedding) FROM documents; + ``` ## Querying and Using Normalized Vectors After normalization, you can use these vectors for various applications, such as similarity searches, clustering, or as input for further machine learning models within PostgresML. ```sql --- Querying for similarity using cosine similarity +-- Querying for similarity using l2 normalized dot product, which is equivalent to cosine similarity WITH normalized_vectors AS ( SELECT id, pgml.normalize_l2(embedding) AS norm_vector FROM documents @@ -86,4 +85,4 @@ WHERE a.id <> b.id; ## Considerations and Best Practices - **Performance**: Normalization can be computationally intensive, especially with large datasets. Consider batch processing and appropriate indexing. -- **Storage**: Normalized vectors might not need to be persisted if they are only used transiently, which can save storage. +- **Storage**: Normalized vectors might not need to be persisted if they are only used transiently, which can save storage or IO latency. diff --git a/pgml-cms/docs/guides/embeddings/vector_similarity.md b/pgml-cms/docs/guides/embeddings/vector_similarity.md index 3bf484075..748e70fe4 100644 --- a/pgml-cms/docs/guides/embeddings/vector_similarity.md +++ b/pgml-cms/docs/guides/embeddings/vector_similarity.md @@ -1,11 +1,29 @@ # Vector Distances -There are many distance functions that can be used to measure the similarity or differences between vectors. We list a few of the more common ones here with details on how they work, to help you choose. They are listed here in order of computational complexity, although modern hardware accelerated implementations can typically compare on the order of 100,000 vectors per second per processor. Modern CPUs may have tens to hundreds of cores, and GPUs have tens of thousands. +There are many distance functions that can be used to measure the similarity or differences between vectors. We list a few of the more common ones here with details on how they work, to help you choose. It's worth taking the time to understand the differences between these simple formulas, because they are the inner loop that accounts for almost all computation when doing nearest neighbor search. They are listed here in order of computational complexity, although modern hardware accelerated implementations can typically compare on the order of 100,000 vectors per second per processor. Modern CPUs may also have tens to hundreds of cores, and GPUs have tens of thousands, to further parallelize searches of large numbers of vectors. + +!!! note + +If you just want the cliff notes: [Normalize your vectors](vector_normalization.md) and use the inner product as your distance metric between two vectors. This is implemented as: `pgml.dot_product(a, b)` + +!!! + +All of these distance measures are implemented by PostgresML for the native Postgres `ARRAY[]` types, and separately implemented by pgvector as operators for its `VECTOR` types using operators. ## Manhattan Distance You can think of this distance metric as how long it takes you to walk from one building in Manhattan to another, when you can only walk along streets that go the 4 cardinal directions, with no diagonals. It's the fastest distance measure to implement, because it just adds up all the pairwise element differences. It's also referred to as the L1 distance. +!!! tip + +Most applications should use Euclidean Distance instead, unless accuracy has relatively little value, and nanoseconds are important to your user experience. + +!!! + +**Algorithm** + + {% tabs %} + {% tab title="JavaScript" %} ```javascript @@ -38,9 +56,38 @@ manhattan_distance(x, y) {% endtab %} {% endtabs %} +An optimized version is provided by: + +!!! code_block time="1191.069 ms" + +```sql +WITH query AS ( + SELECT vector + FROM test_data + LIMIT 1 +) +SELECT id, pgml.distance_l1(query.vector, test_data.vector) +FROM test_data, query +ORDER BY distance_l1; +``` + +!!! + +The equivalent pgvector operator is `<+>`. + + ## Euclidean Distance -This is a simple refinement of Manhattan Distance that applies the Pythagorean theorem to find the length of the straight line between the two points. It involves squaring the differences and then taking the final square root, which is a more expensive operation, so it may be slightly slower, but is also a more accurate representation in high dimensional spaces. It's also referred to as the L2 distance. +This is a simple refinement of Manhattan Distance that applies the Pythagorean theorem to find the length of the straight line between the two points. It's also referred to as the L2 distance. It involves squaring the differences and then taking the final square root, which is a more expensive operation, so it may be slightly slower, but is also a more accurate representation in high dimensional spaces. When finding nearest neighbors, the final square root can computation can be omitted, but there are still twice as many operations in the inner loop. + + +!!! tip + +Most applications should use Inner product for better accuracy with less computation, unless you can't afford to normalize your vectors before indexing for some extremely write heavy application. + +!!! + +**Algorithm** {% tabs %} {% tab title="JavaScript" %} @@ -75,12 +122,39 @@ euclidean_distance(x, y) {% endtab %} {% endtabs %} +An optimized version is provided by: + +!!! code_block time="1359.114 ms" + +```sql +WITH query AS ( + SELECT vector + FROM test_data + LIMIT 1 +) +SELECT id, pgml.distance_l2(query.vector, test_data.vector) +FROM test_data, query +ORDER BY distance_l2; +``` + +!!! + +The equivalent pgvector operator is `<->`. + ## Inner product The inner product (the dot product in Euclidean space) can be used to find how similar any two vectors are, by measuring the overlap of each element, which compares the direction they point. Two completely different (orthogonal) vectors have an inner product of 0. If vectors point in opposite directions, the inner product will be negative. Positive numbers indicate the vectors point in the same direction, and are more similar. This metric is as fast to compute as the Euclidean Distance, but may provide more relevant results if all vectors are normalized. If vectors are not normalized, it will bias results toward vectors with larger magnitudes, and you should consider using the cosine distance instead. +!!! tip + +This is probably the best all around distance metric. It's computationally simple, but also twice as fast due to optimized assembly intructions. It's also able to places more weight on the dominating dimensions of the vectors which can improve relevance during recall. As long as [your vectors are normalized](vector_normalization.md). + +!!! + +**Algorithm** + {% tabs %} {% tab title="JavaScript" %} @@ -114,12 +188,37 @@ inner_product(x, y) {% endtab %} {% endtabs %} +An optimized version is provided by: + +!!! code_block time="498.649 ms" + +```sql +WITH query AS ( + SELECT vector + FROM test_data + LIMIT 1 +) +SELECT id, pgml.dot_product(query.vector, test_data.vector) +FROM test_data, query +ORDER BY dot_product; +``` + +!!! + +The equivalent pgvector operator is `<#>`. + ## Cosine Distance Cosine distance is a popular metric, because it normalizes the vectors, which means it only considers the difference of the angle between the two vectors, not their magnitudes. If you don't know that your vectors have been normalized, this may be a safer bet than the inner product. It is one of the more complicated algorithms to implement, but differences may be negligible w/ modern hardware accelerated instruction sets depending on your workload profile. -You can also use PostgresML to [normalize all your vectors](vector_normalization.md) as a separate processing step to pay that cost only at indexing time, and then the inner product will provide equivalent distance measures. +!!! tip + +Use PostgresML to [normalize all your vectors](vector_normalization.md) as a separate processing step to pay that cost only at indexing time, and then switch to the inner product which will provide equivalent distance measures, at 1/3 of the computation in the inner loop. _That's not exactly true on all platforms_, because the inner loop is implemented with optimized assembly that can take advantage of additional hardware acceleration, so make sure to always benchmark on your own hardware. On our hardware, the performance difference is negligible. + +!!! + +**Algorithm** {% tabs %} {% tab title="JavaScript" %} @@ -149,7 +248,6 @@ function cosineDistance(a, b) { return cosineDistance; } ``` - {% endtab %} {% tab title="Python" %} @@ -179,3 +277,75 @@ def cosine_distance(a, b): {% endtab %} {% endtabs %} + +The optimized version is provided by: + +!!! code_block time="508.587 ms" + +```sql +WITH query AS ( + SELECT vector + FROM test_data + LIMIT 1 +) +SELECT id, 1 - pgml.cosine_similarity(query.vector, test_data.vector) AS cosine_distance +FROM test_data, query +ORDER BY cosine_distance; +``` + +!!! + +Or you could reverse order by `cosine_similarity` for the same ranking: + +!!! code_block time="502.461 ms" + +```sql +WITH query AS ( + SELECT vector + FROM test_data + LIMIT 1 +) +SELECT id, pgml.cosine_similarity(query.vector, test_data.vector) +FROM test_data, query +ORDER BY cosine_similarity DESC; +``` + +!!! + +The equivalent pgvector operator is `<=>`. + +## Benchmarking + +You should benchmark and compare the computational cost of these distance metrics to see how much they algorithmic differences matters for latency using the same vector sizes as your own data. We'll create some test data to demonstrate the relative costs associated with each distance metric. + +!!! code_block + +```sql +\timing on +``` + +!!! + +!!! code_block + +```sql +CREATE TABLE test_data ( + id BIGSERIAL NOT NULL, + vector FLOAT4[] +); +``` + +!!! + +Insert 10k vectors, that have 1k dimensions each + +!!! code_block + +```sql +INSERT INTO test_data (vector) +SELECT array_agg(random()) +FROM generate_series(1,10000000) i +GROUP BY i % 10000; +``` + +!!! From e5f74405901d80d34467f87fef97ddf7038abcb3 Mon Sep 17 00:00:00 2001 From: Montana Low Date: Sat, 11 May 2024 13:59:46 -0700 Subject: [PATCH 05/11] aggregates --- pgml-cms/docs/SUMMARY.md | 6 +- .../guides/embeddings/vector_aggregation.md | 98 +++++++++++++++++++ .../guides/embeddings/vector_normalization.md | 7 +- 3 files changed, 107 insertions(+), 4 deletions(-) create mode 100644 pgml-cms/docs/guides/embeddings/vector_aggregation.md diff --git a/pgml-cms/docs/SUMMARY.md b/pgml-cms/docs/SUMMARY.md index 1c5a04cb7..e6cd65e58 100644 --- a/pgml-cms/docs/SUMMARY.md +++ b/pgml-cms/docs/SUMMARY.md @@ -58,9 +58,9 @@ * [Dimensionality Reduction]() * [Re-ranking nearest neighbors]() * [Indexing w/ pgvector]() - * [Aggregation]() - * [Vector Similarity](guides/embeddings/vector_similarity.md) - * [Vector Normalization](guides/embeddings/vector_normalization.md) + * [Aggregation](guides/embeddings/vector_aggregation.md) + * [Similarity](guides/embeddings/vector_similarity.md) + * [Normalization](guides/embeddings/vector_normalization.md) * [Search]() * [Keyword Search]() * [Vector Search]() diff --git a/pgml-cms/docs/guides/embeddings/vector_aggregation.md b/pgml-cms/docs/guides/embeddings/vector_aggregation.md new file mode 100644 index 000000000..e5f1cd721 --- /dev/null +++ b/pgml-cms/docs/guides/embeddings/vector_aggregation.md @@ -0,0 +1,98 @@ +--- +description: Vector aggregation is extensively used across various machine learning applications, including NLP, Image Processing, Recommender Systems, Time Series Analysis with strong benefits. +--- + +# Vector Aggregation + +Vector aggregation in the context of embeddings refers to the process of combining multiple vector representations into a single, unified vector. This technique is particularly useful in machine learning and data science, especially when dealing with embeddings from natural language processing (NLP), image processing, or any domain where objects are represented as high-dimensional vectors. + +## Understanding Vector Aggregation +Embeddings are dense vector representations of objects (like words, sentences, or images) that capture their underlying semantic properties in a way that is understandable by machine learning models. When dealing with multiple such embeddings, it might be necessary to aggregate them to produce a single representation that captures the collective properties of all the items in the set. + +## Applications in Machine Learning +Vector aggregation is extensively used across various machine learning applications. + +### Natural Language Processing +**Sentence or Document Embedding**: Individual word embeddings within a sentence or document can be aggregated to form a single vector representation of the entire text. This aggregated vector can then be used for tasks like text classification, sentiment analysis, or document clustering. + +**Information Retrieval**: Aggregated embeddings can help in summarizing multiple documents or in query refinement, where the query and multiple documents' embeddings are aggregated to improve search results. + +### Image Processing +**Feature Aggregation**: In image recognition or classification, features extracted from different parts of an image (e.g., via convolutional neural networks) can be aggregated to form a global feature vector. + +### Recommender Systems +**User or Item Profiles**: Aggregating item embeddings that a user has interacted with can create a dense representation of a user's preferences. Similarly, aggregating user embeddings for a particular item can help in understanding the item’s appeal across different user segments. + +### Time Series Analysis +**Temporal Data Aggregation**: In scenarios where temporal dynamics are captured via embeddings at different time steps (e.g., stock prices, sensor data), these can be aggregated to form a representation of the overall trend or to capture cyclical patterns. + +## Benefits of Vector Aggregation +- **Dimensionality Reduction**: Aggregation can reduce the complexity of handling multiple embeddings, making the data easier to manage and process. +- **Noise Reduction**: Averaging and other aggregation methods can help mitigate the effect of noise in individual data points, leading to more robust models. +- **Improved Learning Efficiency**: By summarizing data, aggregation can speed up learning processes and improve the performance of machine learning algorithms on large datasets. + +## Available Methods of Vector Aggregation + +### Example Data +```sql +CREATE TABLE documents ( + id SERIAL PRIMARY KEY, + body TEXT, + embedding FLOAT[] GENERATED ALWAYS AS (pgml.embed('intfloat/e5-small-v2', body)) STORED +); +``` + +Example of inserting text and its corresponding embedding + +```sql +INSERT INTO documents (body) +VALUES -- embedding vectors are automatically generated + ('Example text data'), + ('Another example document'), + ('Some other thing'); +``` + +### Summation +Adding up all the vectors element-wise. This method is simple and effective, preserving all the information from the original vectors, but can lead to large values if many vectors are summed. + +```sql +SELECT id, pgml.sum(embedding) +FROM documents +GROUP BY id; +``` + +### Averaging (Mean) +Computing the element-wise mean of the vectors. This is probably the most common aggregation method, as it normalizes the scale of the vectors against the number of vectors being aggregated, preventing any single vector from dominating the result. + +```sql +SELECT id, pgml.divide(pgml.sum(embedding), count(*)) AS avg +FROM documents +GROUP BY id; +``` + +### Weighted Average +Similar to averaging, but each vector is multiplied by a weight that reflects its importance before averaging. This method is useful when some vectors are more significant than others. + +```sql +SELECT id, pgml.divide(pgml.sum(pgml.multiply(embedding, id)), count(*)) AS id_weighted_avg +FROM documents +GROUP BY id; +``` + +### Max Pooling +Taking the maximum value of each dimension across all vectors. This method is particularly useful for capturing the most pronounced features in a set of vectors. + +```sql +SELECT id, pgml.max_abs(embedding) +FROM documents +GROUP BY id; +``` + +### Min Pooling +Taking the minimum value of each dimension across all vectors, useful for capturing the least dominant features. + +```sql +SELECT id, pgml.min_abs(embedding) +FROM documents +GROUP BY id; +``` \ No newline at end of file diff --git a/pgml-cms/docs/guides/embeddings/vector_normalization.md b/pgml-cms/docs/guides/embeddings/vector_normalization.md index 0574dcefc..b54513e58 100644 --- a/pgml-cms/docs/guides/embeddings/vector_normalization.md +++ b/pgml-cms/docs/guides/embeddings/vector_normalization.md @@ -26,7 +26,7 @@ Example of inserting text and its corresponding embedding ```sql INSERT INTO documents (body) -VALUES -- Normalized embedding vectors are automatically generated +VALUES -- embedding vectors are automatically generated ('Example text data'), ('Another example document'), ('Some other thing'); @@ -56,14 +56,19 @@ CREATE TABLE documents ( Normalization is critical for ensuring that the magnitudes of feature vectors do not distort the performance of machine learning algorithms. - **L1 Normalization (Manhattan Norm)**: This function scales the vector so that the sum of the absolute values of its components is equal to 1. It's useful when differences in magnitude are important but the components represent independent dimensions. + ```sql SELECT pgml.normalize_l1(embedding) FROM documents; ``` + - **L2 Normalization (Euclidean Norm)**: Scales the vector so that the sum of the squares of its components is equal to 1. This is particularly important for cosine similarity calculations in machine learning. + ```sql SELECT pgml.normalize_l2(embedding) FROM documents; ``` + - **Max Normalization**: Scales the vector such that the maximum absolute value of any component is 1. This normalization is less common but can be useful when the maximum value represents a bounded capacity. + ```sql SELECT pgml.normalize_max(embedding) FROM documents; ``` From 13db9c8a0408145c2033ddeb2bb7d0d9746470f2 Mon Sep 17 00:00:00 2001 From: Montana Low Date: Sat, 11 May 2024 19:37:19 -0700 Subject: [PATCH 06/11] update css --- pgml-cms/docs/SUMMARY.md | 27 +- pgml-cms/docs/guides/embeddings/README.md | 18 +- .../embeddings/in-database-generation.md | 328 +++++++++++++++++ ...r_aggregation.md => vector-aggregation.md} | 0 ...rmalization.md => vector-normalization.md} | 0 ...tor_similarity.md => vector-similarity.md} | 15 +- ...s-with-open-source-models-in-postgresml.md | 343 ------------------ pgml-cms/docs/use-cases/fraud-detection.md | 3 - .../docs/use-cases/recommendation-engine.md | 3 - .../docs/use-cases/time-series-forecasting.md | 2 - pgml-dashboard/src/utils/markdown.rs | 2 - .../static/css/scss/base/_base.scss | 4 - .../css/scss/components/_admonitions.scss | 3 - .../static/css/scss/components/_code.scss | 3 +- .../static/css/scss/pages/_docs.scss | 60 ++- 15 files changed, 406 insertions(+), 405 deletions(-) create mode 100644 pgml-cms/docs/guides/embeddings/in-database-generation.md rename pgml-cms/docs/guides/embeddings/{vector_aggregation.md => vector-aggregation.md} (100%) rename pgml-cms/docs/guides/embeddings/{vector_normalization.md => vector-normalization.md} (100%) rename pgml-cms/docs/guides/embeddings/{vector_similarity.md => vector-similarity.md} (79%) delete mode 100644 pgml-cms/docs/use-cases/fraud-detection.md delete mode 100644 pgml-cms/docs/use-cases/recommendation-engine.md delete mode 100644 pgml-cms/docs/use-cases/time-series-forecasting.md diff --git a/pgml-cms/docs/SUMMARY.md b/pgml-cms/docs/SUMMARY.md index e6cd65e58..acc24ce3f 100644 --- a/pgml-cms/docs/SUMMARY.md +++ b/pgml-cms/docs/SUMMARY.md @@ -54,13 +54,13 @@ ## Guides * [Embeddings](guides/embeddings/README.md) - * [In-database Generation]() + * [In-database Generation](guides/embeddings/in-database-generation.md) * [Dimensionality Reduction]() * [Re-ranking nearest neighbors]() * [Indexing w/ pgvector]() - * [Aggregation](guides/embeddings/vector_aggregation.md) - * [Similarity](guides/embeddings/vector_similarity.md) - * [Normalization](guides/embeddings/vector_normalization.md) + * [Aggregation](guides/embeddings/vector-aggregation) + * [Similarity](guides/embeddings/vector-similarity) + * [Normalization](guides/embeddings/vector-normalization) * [Search]() * [Keyword Search]() * [Vector Search]() @@ -77,14 +77,6 @@ * [Events]() * [Fraud Detection]() * [Incentive Optimization]() -* [Sentiment Analysis]() -* [Summarization]() - -## Reference - -* [SQL]() - * [Explain plans]() - * [Composition]() * [Machine Learning]() * [Feature Engineering]() * [Regression]() @@ -95,6 +87,14 @@ * [Tokenization]() * [Chunking]() * [Text Generation]() + * [Sentiment Analysis]() + * [Summarization]() + +## Reference + +* [SQL]() + * [Explain plans]() + * [Composition]() * [LLMs]() * [LLama]() * [GPT]() @@ -125,9 +125,6 @@ * [Personalize embedding results with application data in your database](use-cases/embeddings/personalize-embedding-results-with-application-data-in-your-database.md) * [Supervised Learning](use-cases/supervised-learning.md) * [Natural Language Processing](use-cases/natural-language-processing.md) -* [Fraud Detection](use-cases/fraud-detection.md) -* [Recommendation Engine](use-cases/recommendation-engine.md) -* [Time-series Forecasting](use-cases/time-series-forecasting.md) ## Resources diff --git a/pgml-cms/docs/guides/embeddings/README.md b/pgml-cms/docs/guides/embeddings/README.md index 000888470..888582144 100644 --- a/pgml-cms/docs/guides/embeddings/README.md +++ b/pgml-cms/docs/guides/embeddings/README.md @@ -8,7 +8,7 @@ As the demand for sophisticated data analysis and machine learning capabilities Embeddings are a key building block with many applications in modern AI/ML systems. They are particularly valuable for handling various types of unstructured data like text, images, and more, providing a pathway to richer insights and improved performance. They allow computers to operate on natural language and other high level concepts by reducing them to billions of simple arithmetic operations. -## Applications of Embeddings +## Applications of embeddings - **Search and Information Retrieval**: Embeddings can transform search queries and documents into vectors, making it easier to find the most relevant documents for a given query based on semantic similarity. - **Personalization**: In recommendation systems, embeddings can help understand user queries and preferences, enhancing the accuracy of recommendations. @@ -20,15 +20,17 @@ This guide will introduce you to the fundamentals of embeddings within PostgresM In this guide, we will cover: -* [In-database Generation]() +* [In-database Generation](guides/embeddings/in-database-generation.md) * [Dimensionality Reduction]() * [Re-ranking nearest neighbors]() * [Indexing w/ pgvector]() -* [Aggregation]() +* [Aggregation](guides/embeddings/vector-aggregation) +* [Similarity](guides/embeddings/vector-similarity) +* [Normalization](guides/embeddings/vector-normalization) -## Embeddings are Vectors +## Embeddings are vectors -In the context of large language models, embeddings are representations of words, phrases, or even entire sentences. Each word or text snippet is mapped to a vector in a high-dimensional space. These vectors capture semantic and syntactic nuances, meaning that similar words have vectors that are close together in this space. For instance, "king" and "queen" would be represented by vectors that are closer together than "king" and "apple". +In the context of large language models (LLMs), embeddings are representations of words, phrases, or even entire sentences. Each word or text snippet is mapped to a vector in a high-dimensional space. These vectors capture semantic and syntactic nuances, meaning that similar words have vectors that are close together in this space. For instance, "king" and "queen" would be represented by vectors that are closer together than "king" and "apple". Vectors can be stored in the native Postgres [`ARRAY[]`](https://www.postgresql.org/docs/current/arrays.html) datatype which is compatible with many application programming languages' native datatypes. Modern CPUs and GPUs offer hardware acceleration for common array operations, which can give substantial performance benefits when operating at scale, but which are typically not enabled in a Postgres database. You'll need to ensure you're compiling your full stack with support for your hardware to get the most bang for your buck, or you can leave that up to us, and get full hardware acceleration in a PostgresML cloud database. @@ -77,9 +79,5 @@ add(x, y) {% endtab %} {% endtabs %} -### Vector Similarity - -Similar embeddings should represent similar concepts. If we have one embedding created from a user query and a bunch of other embeddings from documents, we can find documents that are most similar to the query by calculating the similarity between the query and each document. Embedding similarity (≈) is defined as the distance between the two vectors. - -There are several ways to measure the distance between two vectors, that have tradeoffs in latency and accuracy. If two vectors are identical (=), then the distance between them is 0. If the distance is small, then they are similar (≈). You can read more about some of the available [distance metrics](vector_similarity.md) if you'd like to build a stronger intuition, but the defaults are usually a good starting place. +If we pass the vectors for "snow" and "rain" into this function, we'd hope to get a vector similar to "snow" as the result, depending on the quality of the model that was used to create the word embeddings. diff --git a/pgml-cms/docs/guides/embeddings/in-database-generation.md b/pgml-cms/docs/guides/embeddings/in-database-generation.md new file mode 100644 index 000000000..c4badaa7e --- /dev/null +++ b/pgml-cms/docs/guides/embeddings/in-database-generation.md @@ -0,0 +1,328 @@ +# In-database Embedding Generation + +PostgresML makes it easy to generate embeddings from text in your database using a large selection of state-of-the-art models with one simple call to **`pgml.embed`**`(model_name, text)`. + +## Introduction + +Different models have been trained on different types of text and with different algorithms. Each one has its own tradeoffs, generally latency vs quality, although recent progress in the LLMs . + +You can train your own model from scratch, or download + +## It always starts with data + +Most general purpose databases are full of all sorts of great data for machine learning use cases. Text data has historically been more difficult to deal with using complex Natural Language Processing techniques, but embeddings created from open source models can effectively turn unstructured text into structured features, perfect for more straightforward implementations. + +In this example, we'll demonstrate how to generate embeddings for products on an e-commerce site. We'll use a public dataset of millions of product reviews from the [Amazon US Reviews](https://huggingface.co/datasets/amazon\_us\_reviews). It includes the product title, a text review written by a customer and some additional metadata about the product, like category. With just a few pieces of data, we can create a full-featured and personalized product search and recommendation engine, using both generic embeddings and later, additional fine-tuned models trained with PostgresML. + +PostgresML includes a convenience function for loading public datasets from [HuggingFace](https://huggingface.co/datasets) directly into your database. To load the DVD subset of the Amazon US Reviews dataset into your database, run the following command: + +!!! code\_block + +```postgresql +SELECT * +FROM pgml.load_dataset('amazon_us_reviews', 'Video_DVD_v1_00'); +``` + +!!! + +It took about 23 minutes to download the 7.1GB raw dataset with 5,069,140 rows into a table within the `pgml` schema (where all PostgresML functionality is name-spaced). Once it's done, you can see the table structure with the following command: + +!!! generic + +!!! code\_block + +```postgresql +\d pgml.amazon_us_reviews +``` + +!!! + +!!! results + +| Column | Type | Collation | Nullable | Default | +| ------------------ | ------- | --------- | -------- | ------- | +| marketplace | text | | | | +| customer\_id | text | | | | +| review\_id | text | | | | +| product\_id | text | | | | +| product\_parent | text | | | | +| product\_title | text | | | | +| product\_category | text | | | | +| star\_rating | integer | | | | +| helpful\_votes | integer | | | | +| total\_votes | integer | | | | +| vine | bigint | | | | +| verified\_purchase | bigint | | | | +| review\_headline | text | | | | +| review\_body | text | | | | +| review\_date | text | | | | + +!!! + +!!! + +Let's take a peek at the first 5 rows of data: + +!!! code\_block + +```postgresql +SELECT * +FROM pgml.amazon_us_reviews +LIMIT 5; +``` + +!!! results + +| marketplace | customer\_id | review\_id | product\_id | product\_parent | product\_title | product\_category | star\_rating | helpful\_votes | total\_votes | vine | verified\_purchase | review\_headline | review\_body | review\_date | +| ----------- | ------------ | -------------- | ----------- | --------------- | ------------------------------------------------------------------------------------------------------------------- | ----------------- | ------------ | -------------- | ------------ | ---- | ------------------ | --------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------ | +| US | 27288431 | R33UPQQUZQEM8 | B005T4ND06 | 400024643 | Yoga for Movement Disorders DVD: Rebuilding Strength, Balance, and Flexibility for Parkinson's Disease and Dystonia | Video DVD | 5 | 3 | 3 | 0 | 1 | This was a gift for my aunt who has Parkinson's ... | This was a gift for my aunt who has Parkinson's. While I have not previewed it myself, I also have not gotten any complaints. My prior experiences with yoga tell me this should be just what the doctor ordered. | 2015-08-31 | +| US | 13722556 | R3IKTNQQPD9662 | B004EPZ070 | 685335564 | Something Borrowed | Video DVD | 5 | 0 | 0 | 0 | 1 | Five Stars | Teats my heart out. | 2015-08-31 | +| US | 20381037 | R3U27V5QMCP27T | B005S9EKCW | 922008804 | Les Miserables (2012) \[Blu-ray] | Video DVD | 5 | 1 | 1 | 0 | 1 | Great movie! | Great movie. | 2015-08-31 | +| US | 24852644 | R2TOH2QKNK4IOC | B00FC1ZCB4 | 326560548 | Alien Anthology and Prometheus Bundle \[Blu-ray] | Video DVD | 5 | 0 | 1 | 0 | 1 | Amazing | My husband was so excited to receive these as a gift! Great picture quality and great value! | 2015-08-31 | +| US | 15556113 | R2XQG5NJ59UFMY | B002ZG98Z0 | 637495038 | Sex and the City 2 | Video DVD | 5 | 0 | 0 | 0 | 1 | Five Stars | Love this series. | 2015-08-31 | + +!!! + +!!! + +## Generating embeddings from natural language text + +PostgresML provides a simple interface to generate embeddings from text in your database. You can use the [`pgml.embed`](https://postgresml.org/docs/transformers/embeddings) function to generate embeddings for a column of text. The function takes a transformer name and a text value. The transformer will automatically be downloaded and cached on your connection process for reuse. You can see a list of potential good candidate models to generate embeddings on the [Massive Text Embedding Benchmark leaderboard](https://huggingface.co/spaces/mteb/leaderboard). + +Since our corpus of documents (movie reviews) are all relatively short and similar in style, we don't need a large model. [`intfloat/e5-small`](https://huggingface.co/intfloat/e5-small) will be a good first attempt. The great thing about PostgresML is you can always regenerate your embeddings later to experiment with different embedding models. + +It takes a couple of minutes to download and cache the `intfloat/e5-small` model to generate the first embedding. After that, it's pretty fast. + +Note how we prefix the text we want to embed with either `passage:` or `query:` , the e5 model requires us to prefix our data with `passage:` if we're generating embeddings for our corpus and `query:` if we want to find semantically similar content. + +```postgresql +SELECT pgml.embed('intfloat/e5-small', 'passage: hi mom'); +``` + +This is a pretty powerful function, because we can pass any arbitrary text to any open source model, and it will generate an embedding for us. We can benchmark how long it takes to generate an embedding for a single review, using client-side timings in Postgres: + +```postgresql +\timing on +``` + +Aside from using this function with strings passed from a client, we can use it on strings already present in our database tables by calling **pgml.embed** on columns. For example, we can generate an embedding for the first review using a pretty simple query: + +!!! generic + +!!! code\_block time="54.820 ms" + +```postgresql +SELECT + review_body, + pgml.embed('intfloat/e5-small', 'passage: ' || review_body) +FROM pgml.amazon_us_reviews +LIMIT 1; +``` + +!!! + +!!! results + +``` +CREATE INDEX +``` + +!!! + +!!! + +Time to generate an embedding increases with the length of the input text, and varies widely between different models. If we up our batch size (controlled by `LIMIT`), we can see the average time to compute an embedding on the first 1000 reviews is about 17ms per review: + +!!! code\_block time="17955.026 ms" + +```postgresql +SELECT + review_body, + pgml.embed('intfloat/e5-small', 'passage: ' || review_body) AS embedding +FROM pgml.amazon_us_reviews +LIMIT 1000; +``` + +!!! + +## Comparing different models and hardware performance + +This database is using a single GPU with 32GB RAM and 8 vCPUs with 16GB RAM. Running these benchmarks while looking at the database processes with `htop` and `nvidia-smi`, it becomes clear that the bottleneck in this case is actually tokenizing the strings which happens in a single thread on the CPU, not computing the embeddings on the GPU which was only 20% utilized during the query. + +We can also do a quick sanity check to make sure we're really getting value out of our GPU by passing the device to our embedding function: + +!!! code\_block time="30421.491 ms" + +```postgresql +SELECT + reviqew_body, + pgml.embed( + 'intfloat/e5-small', + 'passage: ' || review_body, + '{"device": "cpu"}' + ) AS embedding +FROM pgml.amazon_us_reviews +LIMIT 1000; +``` + +!!! + +Forcing the embedding function to use `cpu` is almost 2x slower than `cuda` which is the default when GPUs are available. + +If you're managing dedicated hardware, there's always a decision to be made about resource utilization. If this is a multi-workload database with other queries using the GPU, it's probably great that we're not completely hogging it with our multi-decade-Amazon-scale data import process, but if this is a machine we've spun up just for this task, we can up the resource utilization to 4 concurrent connections, all running on a subset of the data to more completely utilize our CPU, GPU and RAM. + +Another consideration is that GPUs are much more expensive right now than CPUs, and if we're primarily interested in backfilling a dataset like this, high concurrency across many CPU cores might just be the price-competitive winner. + +With 4x concurrency and a GPU, it'll take about 6 hours to compute all 5 million embeddings, which will cost $72 on PostgresML Cloud. If we use the CPU instead of the GPU, we'll probably want more cores and higher concurrency to plug through the job faster. A 96 CPU core machine could complete the job in half the time our single GPU would take and at a lower hourly cost as well, for a total cost of $24. It's overall more cost-effective and faster in parallel, but keep in mind if you're interactively generating embeddings for a user facing application, it will add double the latency, 30ms CPU vs 17ms for GPU. + +For comparison, it would cost about $299 to use OpenAI's cheapest embedding model to process this dataset. Their API calls average about 300ms, although they have high variability (200-400ms) and greater than 1000ms p99 in our measurements. They also have a default rate limit of 200 tokens per minute which means it would take 1,425 years to process this dataset. You better call ahead. + +| Processor | Latency | Cost | Time | +| --------- | ------- | ---- | --------- | +| CPU | 30ms | $24 | 3 hours | +| GPU | 17ms | $72 | 6 hours | +| OpenAI | 300ms | $299 | millennia | + +You can also find embedding models that outperform OpenAI's `text-embedding-ada-002` model across many different tests on the [leaderboard](https://huggingface.co/spaces/mteb/leaderboard). It's always best to do your own benchmarking with your data, models, and hardware to find the best fit for your use case. + +> _HTTP requests to a different datacenter cost more time and money for lower reliability than co-located compute and storage._ + +## Instructor embedding models + +The current leading model is `hkunlp/instructor-xl`. Instructor models take an additional `instruction` parameter which includes context for the embeddings use case, similar to prompts before text generation tasks. + +Instructions can provide a "classification" or "topic" for the text: + +#### Classification + +!!! code\_block time="17.912ms" + +```postgresql +SELECT pgml.embed( + transformer => 'hkunlp/instructor-xl', + text => 'The Federal Reserve on Wednesday raised its benchmark interest rate.', + kwargs => '{"instruction": "Represent the Financial statement:"}' +); +``` + +!!! + +They can also specify particular use cases for the embedding: + +#### Querying + +!!! code\_block time="24.263 ms" + +```postgresql +SELECT pgml.embed( + transformer => 'hkunlp/instructor-xl', + text => 'where is the food stored in a yam plant', + kwargs => '{ + "instruction": "Represent the Wikipedia question for retrieving supporting documents:" + }' +); +``` + +!!! + +#### Indexing + +!!! code\_block time="30.571 ms" + +```postgresql +SELECT pgml.embed( + transformer => 'hkunlp/instructor-xl', + text => 'Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.', + kwargs => '{"instruction": "Represent the Wikipedia document for retrieval:"}' +); +``` + +!!! + +#### Clustering + +!!! code\_block time="18.986 ms" + +```postgresql +SELECT pgml.embed( + transformer => 'hkunlp/instructor-xl', + text => 'Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity"}', + kwargs => '{"instruction": "Represent the Medicine sentence for clustering:"}' +); +``` + +!!! + +Performance remains relatively good, even with the most advanced models. + +## Generating embeddings for a large dataset + +For our use case, we want to generate an embedding for every single review in the dataset. We'll use the `vector` datatype available from the `pgvector` extension to store (and later index) embeddings efficiently. All PostgresML cloud installations include [pgvector](https://github.com/pgvector/pgvector). To enable this extension in your database, you can run: + +```postgresql +CREATE EXTENSION vector; +``` + +Then we can add a `vector` column for our review embeddings, with 384 dimensions (the size of e5-small embeddings): + +```postgresql +ALTER TABLE pgml.amazon_us_reviews +ADD COLUMN review_embedding_e5_large vector(1024); +``` + +It's best practice to keep running queries on a production database relatively short, so rather than trying to update all 5M rows in one multi-hour query, we should write a function to issue the updates in smaller batches. To make iterating over the rows easier and more efficient, we'll add an `id` column with an index to our table: + +```postgresql +ALTER TABLE pgml.amazon_us_reviews +ADD COLUMN id SERIAL PRIMARY KEY; +``` + +Every language/framework/codebase has its own preferred method for backfilling data in a table. The 2 most important considerations are: + +1. Keep the number of rows per query small enough that the queries take less than a second +2. More concurrency will get the job done faster, but keep in mind the other workloads on your database + +Here's an example of a very simple back-fill job implemented in pure PGSQL, but I'd also love to see example PRs opened with your techniques in your language of choice for tasks like this. + +```postgresql +DO $$ +BEGIN + FOR i in 1..(SELECT max(id) FROM pgml.amazon_us_reviews) by 10 LOOP + BEGIN RAISE NOTICE 'updating % to %', i, i + 10; END; + + UPDATE pgml.amazon_us_reviews + SET review_embedding_e5_large = pgml.embed( + 'intfloat/e5-large', + 'passage: ' || review_body + ) + WHERE id BETWEEN i AND i + 10 + AND review_embedding_e5_large IS NULL; + + COMMIT; + END LOOP; +END; +$$; +``` + +## What's next? + +That's it for now. We've got an Amazon scale table with state-of-the-art machine learning embeddings. As a premature optimization, we'll go ahead and build an index on our new column to make our future vector similarity queries faster. For the full documentation on vector indexes in Postgres see the [pgvector docs](https://github.com/pgvector/pgvector). + +!!! code\_block time="4068909.269 ms (01:07:48.909)" + +```postgresql +CREATE INDEX CONCURRENTLY index_amazon_us_reviews_on_review_embedding_e5_large +ON pgml.amazon_us_reviews +USING ivfflat (review_embedding_e5_large vector_cosine_ops) +WITH (lists = 2000); +``` + +!!! + +!!! tip + +Create indexes `CONCURRENTLY` to avoid locking your table for other queries. + +!!! + +Building a vector index on a table with this many entries takes a while, so this is a good time to take a coffee break. In the next article we'll look at how to query these embeddings to find the best products and make personalized recommendations for users. We'll also cover updating an index in real time as new data comes in. diff --git a/pgml-cms/docs/guides/embeddings/vector_aggregation.md b/pgml-cms/docs/guides/embeddings/vector-aggregation.md similarity index 100% rename from pgml-cms/docs/guides/embeddings/vector_aggregation.md rename to pgml-cms/docs/guides/embeddings/vector-aggregation.md diff --git a/pgml-cms/docs/guides/embeddings/vector_normalization.md b/pgml-cms/docs/guides/embeddings/vector-normalization.md similarity index 100% rename from pgml-cms/docs/guides/embeddings/vector_normalization.md rename to pgml-cms/docs/guides/embeddings/vector-normalization.md diff --git a/pgml-cms/docs/guides/embeddings/vector_similarity.md b/pgml-cms/docs/guides/embeddings/vector-similarity.md similarity index 79% rename from pgml-cms/docs/guides/embeddings/vector_similarity.md rename to pgml-cms/docs/guides/embeddings/vector-similarity.md index 748e70fe4..1c6f24596 100644 --- a/pgml-cms/docs/guides/embeddings/vector_similarity.md +++ b/pgml-cms/docs/guides/embeddings/vector-similarity.md @@ -1,9 +1,14 @@ -# Vector Distances -There are many distance functions that can be used to measure the similarity or differences between vectors. We list a few of the more common ones here with details on how they work, to help you choose. It's worth taking the time to understand the differences between these simple formulas, because they are the inner loop that accounts for almost all computation when doing nearest neighbor search. They are listed here in order of computational complexity, although modern hardware accelerated implementations can typically compare on the order of 100,000 vectors per second per processor. Modern CPUs may also have tens to hundreds of cores, and GPUs have tens of thousands, to further parallelize searches of large numbers of vectors. +# Vector Similarity + +Similar embeddings should represent similar concepts. If we have one embedding created from a user query and a bunch of other embeddings from documents, we can find documents that are most similar to the query by calculating the similarity between the query and each document. Embedding similarity (≈) is defined as the distance between the two vectors. + +There are several ways to measure the distance between two vectors, that have tradeoffs in latency and accuracy. If two vectors are identical (=), then the distance between them is 0. If the distance is small, then they are similar (≈). Here, we explore a few of the more common ones here with details on how they work, to help you choose. It's worth taking the time to understand the differences between these simple formulas, because they are the inner loop that accounts for almost all computation when doing nearest neighbor search. + +They are listed here in order of computational complexity, although modern hardware accelerated implementations can typically compare on the order of 100,000 vectors per second per processor, depending on how many dimensions the vectors have. Modern CPUs may also have tens to hundreds of cores, and GPUs have tens of thousands, to further parallelize searches across large numbers of vectors. !!! note -If you just want the cliff notes: [Normalize your vectors](vector_normalization.md) and use the inner product as your distance metric between two vectors. This is implemented as: `pgml.dot_product(a, b)` +If you just want the cliff notes: [Normalize your vectors](vector-normalization) and use the inner product as your distance metric between two vectors. This is implemented as: `pgml.dot_product(a, b)` !!! @@ -149,7 +154,7 @@ This metric is as fast to compute as the Euclidean Distance, but may provide mor !!! tip -This is probably the best all around distance metric. It's computationally simple, but also twice as fast due to optimized assembly intructions. It's also able to places more weight on the dominating dimensions of the vectors which can improve relevance during recall. As long as [your vectors are normalized](vector_normalization.md). +This is probably the best all around distance metric. It's computationally simple, but also twice as fast due to optimized assembly intructions. It's also able to places more weight on the dominating dimensions of the vectors which can improve relevance during recall. As long as [your vectors are normalized](vector-normalization). !!! @@ -214,7 +219,7 @@ Cosine distance is a popular metric, because it normalizes the vectors, which me !!! tip -Use PostgresML to [normalize all your vectors](vector_normalization.md) as a separate processing step to pay that cost only at indexing time, and then switch to the inner product which will provide equivalent distance measures, at 1/3 of the computation in the inner loop. _That's not exactly true on all platforms_, because the inner loop is implemented with optimized assembly that can take advantage of additional hardware acceleration, so make sure to always benchmark on your own hardware. On our hardware, the performance difference is negligible. +Use PostgresML to [normalize all your vectors](vector-normalization) as a separate processing step to pay that cost only at indexing time, and then switch to the inner product which will provide equivalent distance measures, at 1/3 of the computation in the inner loop. _That's not exactly true on all platforms_, because the inner loop is implemented with optimized assembly that can take advantage of additional hardware acceleration, so make sure to always benchmark on your own hardware. On our hardware, the performance difference is negligible. !!! diff --git a/pgml-cms/docs/use-cases/embeddings/generating-llm-embeddings-with-open-source-models-in-postgresml.md b/pgml-cms/docs/use-cases/embeddings/generating-llm-embeddings-with-open-source-models-in-postgresml.md index 396301e59..e69de29bb 100644 --- a/pgml-cms/docs/use-cases/embeddings/generating-llm-embeddings-with-open-source-models-in-postgresml.md +++ b/pgml-cms/docs/use-cases/embeddings/generating-llm-embeddings-with-open-source-models-in-postgresml.md @@ -1,343 +0,0 @@ -# Generating LLM embeddings with open source models - -PostgresML makes it easy to generate embeddings from text in your database using a large selection of state-of-the-art models with one simple call to **`pgml.embed`**`(model_name, text)`. Prove the results in this series to your own satisfaction, for free, by signing up for a GPU accelerated database. - -This article is the first in a multipart series that will show you how to build a post-modern semantic search and recommendation engine, including personalization, using open source models. - -1. Generating LLM Embeddings with HuggingFace models -2. Tuning vector recall with pgvector -3. Personalizing embedding results with application data -4. Optimizing semantic results with an XGBoost ranking model - coming soon! - -## Introduction - -In recent years, embeddings have become an increasingly popular technique in machine learning and data analysis. They are essentially vector representations of data points that capture their underlying characteristics or features. In most programming environments, vectors can be efficiently represented as native array datatypes. They can be used for a wide range of applications, from natural language processing to image recognition and recommendation systems. - -They can also turn natural language into quantitative features for downstream machine learning models and applications. - -_Embeddings show us the relationships between rows in the database._ - -A popular use case driving the adoption of "vector databases" is doing similarity search on embeddings, often referred to as "Semantic Search". This is a powerful technique that allows you to find similar items in large datasets by comparing their vectors. For example, you could use it to find similar products in an e-commerce site, similar songs in a music streaming service, or similar documents given a text query. - -Postgres is a good candidate for this type of application because it's a general purpose database that can store both the embeddings and the metadata in the same place, and has a rich set of features for querying and analyzing them, including fast vector indexes used for search. - -This chapter is the first in a multipart series that will show you how to build a modern semantic search and recommendation engine, including personalization, using PostgresML and open source models. We'll show you how to use the **`pgml.embed`** function to generate embeddings from text in your database using an open source pretrained model. Further chapters will expand on how to implement many of the different use cases for embeddings in Postgres, like similarity search, personalization, recommendations and fine-tuned models. - -## It always starts with data - -Most general purpose databases are full of all sorts of great data for machine learning use cases. Text data has historically been more difficult to deal with using complex Natural Language Processing techniques, but embeddings created from open source models can effectively turn unstructured text into structured features, perfect for more straightforward implementations. - -In this example, we'll demonstrate how to generate embeddings for products on an e-commerce site. We'll use a public dataset of millions of product reviews from the [Amazon US Reviews](https://huggingface.co/datasets/amazon\_us\_reviews). It includes the product title, a text review written by a customer and some additional metadata about the product, like category. With just a few pieces of data, we can create a full-featured and personalized product search and recommendation engine, using both generic embeddings and later, additional fine-tuned models trained with PostgresML. - -PostgresML includes a convenience function for loading public datasets from [HuggingFace](https://huggingface.co/datasets) directly into your database. To load the DVD subset of the Amazon US Reviews dataset into your database, run the following command: - -!!! code\_block - -```postgresql -SELECT * -FROM pgml.load_dataset('amazon_us_reviews', 'Video_DVD_v1_00'); -``` - -!!! - -It took about 23 minutes to download the 7.1GB raw dataset with 5,069,140 rows into a table within the `pgml` schema (where all PostgresML functionality is name-spaced). Once it's done, you can see the table structure with the following command: - -!!! generic - -!!! code\_block - -```postgresql -\d pgml.amazon_us_reviews -``` - -!!! - -!!! results - -| Column | Type | Collation | Nullable | Default | -| ------------------ | ------- | --------- | -------- | ------- | -| marketplace | text | | | | -| customer\_id | text | | | | -| review\_id | text | | | | -| product\_id | text | | | | -| product\_parent | text | | | | -| product\_title | text | | | | -| product\_category | text | | | | -| star\_rating | integer | | | | -| helpful\_votes | integer | | | | -| total\_votes | integer | | | | -| vine | bigint | | | | -| verified\_purchase | bigint | | | | -| review\_headline | text | | | | -| review\_body | text | | | | -| review\_date | text | | | | - -!!! - -!!! - -Let's take a peek at the first 5 rows of data: - -!!! code\_block - -```postgresql -SELECT * -FROM pgml.amazon_us_reviews -LIMIT 5; -``` - -!!! results - -| marketplace | customer\_id | review\_id | product\_id | product\_parent | product\_title | product\_category | star\_rating | helpful\_votes | total\_votes | vine | verified\_purchase | review\_headline | review\_body | review\_date | -| ----------- | ------------ | -------------- | ----------- | --------------- | ------------------------------------------------------------------------------------------------------------------- | ----------------- | ------------ | -------------- | ------------ | ---- | ------------------ | --------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------ | -| US | 27288431 | R33UPQQUZQEM8 | B005T4ND06 | 400024643 | Yoga for Movement Disorders DVD: Rebuilding Strength, Balance, and Flexibility for Parkinson's Disease and Dystonia | Video DVD | 5 | 3 | 3 | 0 | 1 | This was a gift for my aunt who has Parkinson's ... | This was a gift for my aunt who has Parkinson's. While I have not previewed it myself, I also have not gotten any complaints. My prior experiences with yoga tell me this should be just what the doctor ordered. | 2015-08-31 | -| US | 13722556 | R3IKTNQQPD9662 | B004EPZ070 | 685335564 | Something Borrowed | Video DVD | 5 | 0 | 0 | 0 | 1 | Five Stars | Teats my heart out. | 2015-08-31 | -| US | 20381037 | R3U27V5QMCP27T | B005S9EKCW | 922008804 | Les Miserables (2012) \[Blu-ray] | Video DVD | 5 | 1 | 1 | 0 | 1 | Great movie! | Great movie. | 2015-08-31 | -| US | 24852644 | R2TOH2QKNK4IOC | B00FC1ZCB4 | 326560548 | Alien Anthology and Prometheus Bundle \[Blu-ray] | Video DVD | 5 | 0 | 1 | 0 | 1 | Amazing | My husband was so excited to receive these as a gift! Great picture quality and great value! | 2015-08-31 | -| US | 15556113 | R2XQG5NJ59UFMY | B002ZG98Z0 | 637495038 | Sex and the City 2 | Video DVD | 5 | 0 | 0 | 0 | 1 | Five Stars | Love this series. | 2015-08-31 | - -!!! - -!!! - -## Generating embeddings from natural language text - -PostgresML provides a simple interface to generate embeddings from text in your database. You can use the [`pgml.embed`](https://postgresml.org/docs/transformers/embeddings) function to generate embeddings for a column of text. The function takes a transformer name and a text value. The transformer will automatically be downloaded and cached on your connection process for reuse. You can see a list of potential good candidate models to generate embeddings on the [Massive Text Embedding Benchmark leaderboard](https://huggingface.co/spaces/mteb/leaderboard). - -Since our corpus of documents (movie reviews) are all relatively short and similar in style, we don't need a large model. [`intfloat/e5-small`](https://huggingface.co/intfloat/e5-small) will be a good first attempt. The great thing about PostgresML is you can always regenerate your embeddings later to experiment with different embedding models. - -It takes a couple of minutes to download and cache the `intfloat/e5-small` model to generate the first embedding. After that, it's pretty fast. - -Note how we prefix the text we want to embed with either `passage:` or `query:` , the e5 model requires us to prefix our data with `passage:` if we're generating embeddings for our corpus and `query:` if we want to find semantically similar content. - -```postgresql -SELECT pgml.embed('intfloat/e5-small', 'passage: hi mom'); -``` - -This is a pretty powerful function, because we can pass any arbitrary text to any open source model, and it will generate an embedding for us. We can benchmark how long it takes to generate an embedding for a single review, using client-side timings in Postgres: - -```postgresql -\timing on -``` - -Aside from using this function with strings passed from a client, we can use it on strings already present in our database tables by calling **pgml.embed** on columns. For example, we can generate an embedding for the first review using a pretty simple query: - -!!! generic - -!!! code\_block time="54.820 ms" - -```postgresql -SELECT - review_body, - pgml.embed('intfloat/e5-small', 'passage: ' || review_body) -FROM pgml.amazon_us_reviews -LIMIT 1; -``` - -!!! - -!!! results - -``` -CREATE INDEX -``` - -!!! - -!!! - -Time to generate an embedding increases with the length of the input text, and varies widely between different models. If we up our batch size (controlled by `LIMIT`), we can see the average time to compute an embedding on the first 1000 reviews is about 17ms per review: - -!!! code\_block time="17955.026 ms" - -```postgresql -SELECT - review_body, - pgml.embed('intfloat/e5-small', 'passage: ' || review_body) AS embedding -FROM pgml.amazon_us_reviews -LIMIT 1000; -``` - -!!! - -## Comparing different models and hardware performance - -This database is using a single GPU with 32GB RAM and 8 vCPUs with 16GB RAM. Running these benchmarks while looking at the database processes with `htop` and `nvidia-smi`, it becomes clear that the bottleneck in this case is actually tokenizing the strings which happens in a single thread on the CPU, not computing the embeddings on the GPU which was only 20% utilized during the query. - -We can also do a quick sanity check to make sure we're really getting value out of our GPU by passing the device to our embedding function: - -!!! code\_block time="30421.491 ms" - -```postgresql -SELECT - reviqew_body, - pgml.embed( - 'intfloat/e5-small', - 'passage: ' || review_body, - '{"device": "cpu"}' - ) AS embedding -FROM pgml.amazon_us_reviews -LIMIT 1000; -``` - -!!! - -Forcing the embedding function to use `cpu` is almost 2x slower than `cuda` which is the default when GPUs are available. - -If you're managing dedicated hardware, there's always a decision to be made about resource utilization. If this is a multi-workload database with other queries using the GPU, it's probably great that we're not completely hogging it with our multi-decade-Amazon-scale data import process, but if this is a machine we've spun up just for this task, we can up the resource utilization to 4 concurrent connections, all running on a subset of the data to more completely utilize our CPU, GPU and RAM. - -Another consideration is that GPUs are much more expensive right now than CPUs, and if we're primarily interested in backfilling a dataset like this, high concurrency across many CPU cores might just be the price-competitive winner. - -With 4x concurrency and a GPU, it'll take about 6 hours to compute all 5 million embeddings, which will cost $72 on PostgresML Cloud. If we use the CPU instead of the GPU, we'll probably want more cores and higher concurrency to plug through the job faster. A 96 CPU core machine could complete the job in half the time our single GPU would take and at a lower hourly cost as well, for a total cost of $24. It's overall more cost-effective and faster in parallel, but keep in mind if you're interactively generating embeddings for a user facing application, it will add double the latency, 30ms CPU vs 17ms for GPU. - -For comparison, it would cost about $299 to use OpenAI's cheapest embedding model to process this dataset. Their API calls average about 300ms, although they have high variability (200-400ms) and greater than 1000ms p99 in our measurements. They also have a default rate limit of 200 tokens per minute which means it would take 1,425 years to process this dataset. You better call ahead. - -| Processor | Latency | Cost | Time | -| --------- | ------- | ---- | --------- | -| CPU | 30ms | $24 | 3 hours | -| GPU | 17ms | $72 | 6 hours | -| OpenAI | 300ms | $299 | millennia | - -You can also find embedding models that outperform OpenAI's `text-embedding-ada-002` model across many different tests on the [leaderboard](https://huggingface.co/spaces/mteb/leaderboard). It's always best to do your own benchmarking with your data, models, and hardware to find the best fit for your use case. - -> _HTTP requests to a different datacenter cost more time and money for lower reliability than co-located compute and storage._ - -## Instructor embedding models - -The current leading model is `hkunlp/instructor-xl`. Instructor models take an additional `instruction` parameter which includes context for the embeddings use case, similar to prompts before text generation tasks. - -Instructions can provide a "classification" or "topic" for the text: - -#### Classification - -!!! code\_block time="17.912ms" - -```postgresql -SELECT pgml.embed( - transformer => 'hkunlp/instructor-xl', - text => 'The Federal Reserve on Wednesday raised its benchmark interest rate.', - kwargs => '{"instruction": "Represent the Financial statement:"}' -); -``` - -!!! - -They can also specify particular use cases for the embedding: - -#### Querying - -!!! code\_block time="24.263 ms" - -```postgresql -SELECT pgml.embed( - transformer => 'hkunlp/instructor-xl', - text => 'where is the food stored in a yam plant', - kwargs => '{ - "instruction": "Represent the Wikipedia question for retrieving supporting documents:" - }' -); -``` - -!!! - -#### Indexing - -!!! code\_block time="30.571 ms" - -```postgresql -SELECT pgml.embed( - transformer => 'hkunlp/instructor-xl', - text => 'Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.', - kwargs => '{"instruction": "Represent the Wikipedia document for retrieval:"}' -); -``` - -!!! - -#### Clustering - -!!! code\_block time="18.986 ms" - -```postgresql -SELECT pgml.embed( - transformer => 'hkunlp/instructor-xl', - text => 'Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity"}', - kwargs => '{"instruction": "Represent the Medicine sentence for clustering:"}' -); -``` - -!!! - -Performance remains relatively good, even with the most advanced models. - -## Generating embeddings for a large dataset - -For our use case, we want to generate an embedding for every single review in the dataset. We'll use the `vector` datatype available from the `pgvector` extension to store (and later index) embeddings efficiently. All PostgresML cloud installations include [pgvector](https://github.com/pgvector/pgvector). To enable this extension in your database, you can run: - -```postgresql -CREATE EXTENSION vector; -``` - -Then we can add a `vector` column for our review embeddings, with 384 dimensions (the size of e5-small embeddings): - -```postgresql -ALTER TABLE pgml.amazon_us_reviews -ADD COLUMN review_embedding_e5_large vector(1024); -``` - -It's best practice to keep running queries on a production database relatively short, so rather than trying to update all 5M rows in one multi-hour query, we should write a function to issue the updates in smaller batches. To make iterating over the rows easier and more efficient, we'll add an `id` column with an index to our table: - -```postgresql -ALTER TABLE pgml.amazon_us_reviews -ADD COLUMN id SERIAL PRIMARY KEY; -``` - -Every language/framework/codebase has its own preferred method for backfilling data in a table. The 2 most important considerations are: - -1. Keep the number of rows per query small enough that the queries take less than a second -2. More concurrency will get the job done faster, but keep in mind the other workloads on your database - -Here's an example of a very simple back-fill job implemented in pure PGSQL, but I'd also love to see example PRs opened with your techniques in your language of choice for tasks like this. - -```postgresql -DO $$ -BEGIN - FOR i in 1..(SELECT max(id) FROM pgml.amazon_us_reviews) by 10 LOOP - BEGIN RAISE NOTICE 'updating % to %', i, i + 10; END; - - UPDATE pgml.amazon_us_reviews - SET review_embedding_e5_large = pgml.embed( - 'intfloat/e5-large', - 'passage: ' || review_body - ) - WHERE id BETWEEN i AND i + 10 - AND review_embedding_e5_large IS NULL; - - COMMIT; - END LOOP; -END; -$$; -``` - -## What's next? - -That's it for now. We've got an Amazon scale table with state-of-the-art machine learning embeddings. As a premature optimization, we'll go ahead and build an index on our new column to make our future vector similarity queries faster. For the full documentation on vector indexes in Postgres see the [pgvector docs](https://github.com/pgvector/pgvector). - -!!! code\_block time="4068909.269 ms (01:07:48.909)" - -```postgresql -CREATE INDEX CONCURRENTLY index_amazon_us_reviews_on_review_embedding_e5_large -ON pgml.amazon_us_reviews -USING ivfflat (review_embedding_e5_large vector_cosine_ops) -WITH (lists = 2000); -``` - -!!! - -!!! tip - -Create indexes `CONCURRENTLY` to avoid locking your table for other queries. - -!!! - -Building a vector index on a table with this many entries takes a while, so this is a good time to take a coffee break. In the next article we'll look at how to query these embeddings to find the best products and make personalized recommendations for users. We'll also cover updating an index in real time as new data comes in. diff --git a/pgml-cms/docs/use-cases/fraud-detection.md b/pgml-cms/docs/use-cases/fraud-detection.md deleted file mode 100644 index dbe05b5dd..000000000 --- a/pgml-cms/docs/use-cases/fraud-detection.md +++ /dev/null @@ -1,3 +0,0 @@ -# Fraud Detection - -Describe this app, write a GitHub issue and ask people to do a :thumbsup:on the issue diff --git a/pgml-cms/docs/use-cases/recommendation-engine.md b/pgml-cms/docs/use-cases/recommendation-engine.md deleted file mode 100644 index 73e132a6e..000000000 --- a/pgml-cms/docs/use-cases/recommendation-engine.md +++ /dev/null @@ -1,3 +0,0 @@ -# Recommendation Engine - -Describe this app, write a GitHub issue and ask people to do a :thumbsup:on the issue diff --git a/pgml-cms/docs/use-cases/time-series-forecasting.md b/pgml-cms/docs/use-cases/time-series-forecasting.md deleted file mode 100644 index a7f7ab998..000000000 --- a/pgml-cms/docs/use-cases/time-series-forecasting.md +++ /dev/null @@ -1,2 +0,0 @@ -# Time-series Forecasting - diff --git a/pgml-dashboard/src/utils/markdown.rs b/pgml-dashboard/src/utils/markdown.rs index 84f9243bf..4cb4b136c 100644 --- a/pgml-dashboard/src/utils/markdown.rs +++ b/pgml-dashboard/src/utils/markdown.rs @@ -262,8 +262,6 @@ impl SyntaxHighlighterAdapter for SyntaxHighlighter { fn build_pre_tag(&self, _attributes: &HashMap) -> String { String::from("
copy#codeCopy\" class=\"material-symbols-outlined btn-code-toolbar\">content_copy - link - edit
") } diff --git a/pgml-dashboard/static/css/scss/base/_base.scss b/pgml-dashboard/static/css/scss/base/_base.scss index c6cc402b1..80ca64b33 100644 --- a/pgml-dashboard/static/css/scss/base/_base.scss +++ b/pgml-dashboard/static/css/scss/base/_base.scss @@ -41,10 +41,6 @@ pre { } } -pre[data-controller="copy"] { - padding-top: 2rem; -} - // links a { text-decoration: none; diff --git a/pgml-dashboard/static/css/scss/components/_admonitions.scss b/pgml-dashboard/static/css/scss/components/_admonitions.scss index 6e3dde527..e145e7dc8 100644 --- a/pgml-dashboard/static/css/scss/components/_admonitions.scss +++ b/pgml-dashboard/static/css/scss/components/_admonitions.scss @@ -69,9 +69,6 @@ pre { margin: 0px; } - pre[data-controller="copy"] { - padding-top: 2rem !important; - } div.code-block { border: none !important; diff --git a/pgml-dashboard/static/css/scss/components/_code.scss b/pgml-dashboard/static/css/scss/components/_code.scss index f7c97f2a0..a9973069b 100644 --- a/pgml-dashboard/static/css/scss/components/_code.scss +++ b/pgml-dashboard/static/css/scss/components/_code.scss @@ -143,7 +143,7 @@ pre { pre { background-color: #{$gray-500}; code { - background-color: #{$gray-500}; + background-color: #{$gray-600}; } } @@ -222,6 +222,7 @@ pre { pre { background-color: #{$gray-600}; border-radius: #{$border-radius}; + border: solid 2px white; code { border: none; diff --git a/pgml-dashboard/static/css/scss/pages/_docs.scss b/pgml-dashboard/static/css/scss/pages/_docs.scss index 2bf785658..c1354f99d 100644 --- a/pgml-dashboard/static/css/scss/pages/_docs.scss +++ b/pgml-dashboard/static/css/scss/pages/_docs.scss @@ -3,17 +3,18 @@ div.results { overflow-x: auto; - margin: 24px 24px; + margin: 0; + padding: 0; + .code-toolbar { - display: none; + display: none !important; } pre { - background-color: #{$gray-500}; - code { - background-color: #{$gray-500}; - } + padding: 0 !important; + border: none; + margin: 0; } .title { @@ -32,13 +33,15 @@ border-start-end-radius: 0px; } } + + > * { + padding: 0.5rem 1rem; + } } div.code-block { overflow-x: auto; - border: 2px solid $slate-tint-1000; - border-radius: 8px; - margin: 24px 0px; + border-bottom: 2px solid white; .title { background-color: #{$gray-700}; @@ -50,6 +53,8 @@ pre { margin: 0px; + border: none; + padding: 0px !important; } &.with-title { @@ -106,10 +111,41 @@ pre { background-color: #{$gray-600}; border-radius: #{$border-radius}; + padding: 0; + position: relative; code { border: none; white-space: inherit; + padding: 0; + } + + .code-toolbar { + display: none; + z-index: 1; + border: 2px solid white; + border-bottom-left-radius: 8px; + border-top-right-radius: 8px; + right: -2px; + top: -2px; + } + + .cm-gutters { + background: $gray-800; + } + + .cm-activeLineGutter { + background: $gray-800; + } + + .cm-content { + padding: 0.75rem; + } + } + + pre:hover { + .code-toolbar { + display: flex; } } @@ -186,15 +222,11 @@ // Codemirror overrideds .cm-editor { background: inherit; - - // default no line numbers. - .cm-gutters { - display: none; - } } .cm-gutters { background: inherit; + border-right: 1px solid white; } .code-highlight { From 436b44790f90a0378e7901d6fc9921990099e2e1 Mon Sep 17 00:00:00 2001 From: Montana Low Date: Sat, 11 May 2024 20:16:33 -0700 Subject: [PATCH 07/11] css --- ...s-with-open-source-models-in-postgresml.md | 4 ++-- pgml-cms/docs/SUMMARY.md | 6 ++--- .../css/scss/components/_admonitions.scss | 1 - .../static/css/scss/pages/_docs.scss | 23 +++++++++++++++++-- 4 files changed, 26 insertions(+), 8 deletions(-) diff --git a/pgml-cms/blog/generating-llm-embeddings-with-open-source-models-in-postgresml.md b/pgml-cms/blog/generating-llm-embeddings-with-open-source-models-in-postgresml.md index 317a0d346..b66129614 100644 --- a/pgml-cms/blog/generating-llm-embeddings-with-open-source-models-in-postgresml.md +++ b/pgml-cms/blog/generating-llm-embeddings-with-open-source-models-in-postgresml.md @@ -142,7 +142,7 @@ Aside from using this function with strings passed from a client, we can use it !!! generic -!!! code\_block time="54.820 ms" +!!! code_block time="54.820 ms" ```postgresql SELECT @@ -156,7 +156,7 @@ LIMIT 1; !!! results -``` +```postgressql CREATE INDEX ``` diff --git a/pgml-cms/docs/SUMMARY.md b/pgml-cms/docs/SUMMARY.md index acc24ce3f..686688728 100644 --- a/pgml-cms/docs/SUMMARY.md +++ b/pgml-cms/docs/SUMMARY.md @@ -58,9 +58,9 @@ * [Dimensionality Reduction]() * [Re-ranking nearest neighbors]() * [Indexing w/ pgvector]() - * [Aggregation](guides/embeddings/vector-aggregation) - * [Similarity](guides/embeddings/vector-similarity) - * [Normalization](guides/embeddings/vector-normalization) + * [Aggregation](guides/embeddings/vector-aggregation.md) + * [Similarity](guides/embeddings/vector-similarity.md) + * [Normalization](guides/embeddings/vector-normalization.md) * [Search]() * [Keyword Search]() * [Vector Search]() diff --git a/pgml-dashboard/static/css/scss/components/_admonitions.scss b/pgml-dashboard/static/css/scss/components/_admonitions.scss index e145e7dc8..ed9e13153 100644 --- a/pgml-dashboard/static/css/scss/components/_admonitions.scss +++ b/pgml-dashboard/static/css/scss/components/_admonitions.scss @@ -81,7 +81,6 @@ .execution-time { border-top: 2px solid #{$gray-100}; - border-bottom: 2px solid #{$gray-100}; background-color: #{$gray-600}; padding: 12px 12px; margin: 0px !important; diff --git a/pgml-dashboard/static/css/scss/pages/_docs.scss b/pgml-dashboard/static/css/scss/pages/_docs.scss index c1354f99d..2fb5aebde 100644 --- a/pgml-dashboard/static/css/scss/pages/_docs.scss +++ b/pgml-dashboard/static/css/scss/pages/_docs.scss @@ -5,7 +5,7 @@ overflow-x: auto; margin: 0; padding: 0; - + border-top: 2px solid white; .code-toolbar { display: none !important; @@ -17,6 +17,25 @@ margin: 0; } + .overflow-auto { + margin: 0; + } + + table { + margin: 0; + border-spacing: 0; + background-color: #{$gray-900}; + + tr { + padding: 0 0.5rem; + } + + td, th { + border: 1px solid #{$gray-800}; + padding: 0.1rem 0.5rem; + } + } + .title { background-color: #{$gray-700}; border-start-start-radius: 8px; @@ -35,7 +54,7 @@ } > * { - padding: 0.5rem 1rem; + margin: 0.5rem 1rem; } } From e2a28ebc7196f09b3fa3e6b8852f1605e4d8bce4 Mon Sep 17 00:00:00 2001 From: Montana Low Date: Thu, 23 May 2024 11:55:12 -0700 Subject: [PATCH 08/11] postgres dialect --- README.md | 102 +++--- ...lm-support-for-huggingface-transformers.md | 22 +- ...ve-search-results-with-machine-learning.md | 26 +- pgml-cms/blog/mindsdb-vs-postgresml.md | 8 +- ...xtension-for-querying-system-statistics.md | 8 +- .../postgres-full-text-search-is-awesome.md | 2 +- ...l-is-moving-to-rust-for-our-2.0-release.md | 24 +- ...nalysis-using-express-js-and-postgresml.md | 4 +- pgml-cms/blog/the-1.0-sdk-is-here.md | 2 +- .../which-database-that-is-the-question.md | 2 +- pgml-cms/docs/README.md | 2 +- pgml-cms/docs/SUMMARY.md | 4 +- pgml-cms/docs/api/sql-extension/pgml.chunk.md | 10 +- .../docs/api/sql-extension/pgml.decompose.md | 4 +- .../docs/api/sql-extension/pgml.deploy.md | 22 +- pgml-cms/docs/api/sql-extension/pgml.embed.md | 45 +-- .../api/sql-extension/pgml.predict/README.md | 20 +- .../pgml.predict/batch-predictions.md | 10 +- .../api/sql-extension/pgml.train/README.md | 6 +- .../pgml.train/classification.md | 12 +- .../sql-extension/pgml.train/clustering.md | 4 +- .../pgml.train/data-pre-processing.md | 10 +- .../sql-extension/pgml.train/decomposition.md | 4 +- .../pgml.train/hyperparameter-search.md | 2 +- .../pgml.train/joint-optimization.md | 2 +- .../sql-extension/pgml.train/regression.md | 12 +- .../sql-extension/pgml.transform/fill-mask.md | 2 +- .../pgml.transform/question-answering.md | 2 +- .../pgml.transform/summarization.md | 2 +- .../pgml.transform/text-classification.md | 14 +- .../pgml.transform/text-generation.md | 14 +- .../pgml.transform/text-to-text-generation.md | 4 +- .../pgml.transform/token-classification.md | 4 +- .../pgml.transform/translation.md | 2 +- .../zero-shot-classification.md | 2 +- pgml-cms/docs/api/sql-extension/pgml.tune.md | 56 ++-- .../embeddings/dimensionality-reduction.md | 6 + .../embeddings/in-database-generation.md | 313 ++++++------------ .../guides/embeddings/indexing-w-pgvector.md | 1 + .../re-ranking-nearest-neighbors.md | 3 + .../guides/embeddings/vector-aggregation.md | 14 +- .../guides/embeddings/vector-normalization.md | 16 +- .../guides/embeddings/vector-similarity.md | 16 +- .../getting-started/import-your-data/copy.md | 2 +- pgml-cms/docs/product/vector-database.md | 8 +- ...lm-support-for-huggingface-transformers.md | 22 +- .../benchmarks/mindsdb-vs-postgresml.md | 8 +- ...x-faster-than-python-http-microservices.md | 2 +- .../data-storage-and-retrieval/README.md | 8 +- .../data-storage-and-retrieval/documents.md | 14 +- .../partitioning.md | 22 +- .../resources/developer-docs/contributing.md | 6 +- .../resources/developer-docs/gpu-support.md | 4 +- .../self-hosting/replication.md | 2 +- pgml-cms/docs/use-cases/embeddings/README.md | 14 +- ...ve-search-results-with-machine-learning.md | 26 +- .../docs/use-cases/supervised-learning.md | 18 +- .../static/images/gym/quick_start.md | 12 +- pgml-extension/examples/regression.sql | 8 + pgml-extension/src/api.rs | 2 +- 60 files changed, 456 insertions(+), 562 deletions(-) create mode 100644 pgml-cms/docs/guides/embeddings/dimensionality-reduction.md create mode 100644 pgml-cms/docs/guides/embeddings/indexing-w-pgvector.md create mode 100644 pgml-cms/docs/guides/embeddings/re-ranking-nearest-neighbors.md diff --git a/README.md b/README.md index 5d16fd988..382d28c6e 100644 --- a/README.md +++ b/README.md @@ -65,7 +65,7 @@ PostgresML is a machine learning extension for PostgreSQL that enables you to pe *SQL query* -```sql +```postgresql SELECT pgml.transform( 'translation_en_to_fr', inputs => ARRAY[ @@ -76,7 +76,7 @@ SELECT pgml.transform( ``` *Result* -```sql +```postgresql french ------------------------------------------------------------ @@ -89,7 +89,7 @@ SELECT pgml.transform( **Sentiment Analysis** *SQL query* -```sql +```postgresql SELECT pgml.transform( task => 'text-classification', inputs => ARRAY[ @@ -99,7 +99,7 @@ SELECT pgml.transform( ) AS positivity; ``` *Result* -```sql +```postgresql positivity ------------------------------------------------------ [ @@ -117,7 +117,7 @@ SELECT pgml.transform( **Training a classification model** *Training* -```sql +```postgresql SELECT * FROM pgml.train( 'Handwritten Digit Image Classifier', algorithm => 'xgboost', @@ -128,7 +128,7 @@ SELECT * FROM pgml.train( ``` *Inference* -```sql +```postgresql SELECT pgml.predict( 'My Classification Project', ARRAY[0.1, 2.0, 5.0] @@ -203,7 +203,7 @@ PostgresML integrates 🤗 Hugging Face Transformers to bring state-of-the-art N You can call different NLP tasks and customize using them using the following SQL query. -```sql +```postgresql SELECT pgml.transform( task => TEXT OR JSONB, -- Pipeline initializer arguments inputs => TEXT[] OR BYTEA[], -- inputs for inference @@ -220,7 +220,7 @@ Text classification involves assigning a label or category to a given text. Comm Sentiment analysis is a type of natural language processing technique that involves analyzing a piece of text to determine the sentiment or emotion expressed within it. It can be used to classify a text as positive, negative, or neutral, and has a wide range of applications in fields such as marketing, customer service, and political analysis. *Basic usage* -```sql +```postgresql SELECT pgml.transform( task => 'text-classification', inputs => ARRAY[ @@ -242,7 +242,7 @@ The default model trained on around 40,000 English tweets and that has POS (positive), NEG (negative), and NEU (neutral) labels for its classes, include this information in the JSONB object when making your query. -```sql +```postgresql SELECT pgml.transform( inputs => ARRAY[ 'I love how amazingly simple ML has become!', @@ -265,7 +265,7 @@ SELECT pgml.transform( By selecting a model that has been specifically designed for a particular industry, you can achieve more accurate and relevant text classification. An example of such a model is FinBERT, a pre-trained NLP model that has been optimized for analyzing sentiment in financial text. FinBERT was created by training the BERT language model on a large financial corpus, and fine-tuning it to specifically classify financial sentiment. When using FinBERT, the model will provide softmax outputs for three different labels: positive, negative, or neutral. -```sql +```postgresql SELECT pgml.transform( inputs => ARRAY[ 'Stocks rallied and the British pound gained.', @@ -295,7 +295,7 @@ The GLUE dataset is the benchmark dataset for evaluating NLI models. There are d If you want to use an NLI model, you can find them on the :hugs: Hugging Face model hub. Look for models with "mnli". -```sql +```postgresql SELECT pgml.transform( inputs => ARRAY[ 'A soccer game with multiple males playing. Some men are playing a sport.' @@ -316,7 +316,7 @@ The QNLI task involves determining whether a given question can be answered by t If you want to use an QNLI model, you can find them on the :hugs: Hugging Face model hub. Look for models with "qnli". -```sql +```postgresql SELECT pgml.transform( inputs => ARRAY[ 'Where is the capital of France?, Paris is the capital of France.' @@ -339,7 +339,7 @@ The Quora Question Pairs model is designed to evaluate whether two given questio If you want to use an QQP model, you can find them on the :hugs: Hugging Face model hub. Look for models with `qqp`. -```sql +```postgresql SELECT pgml.transform( inputs => ARRAY[ 'Which city is the capital of France?, Where is the capital of France?' @@ -362,7 +362,7 @@ Linguistic Acceptability is a task that involves evaluating the grammatical corr If you want to use a grammatical correctness model, you can find them on the :hugs: Hugging Face model hub. Look for models with `cola`. -```sql +```postgresql SELECT pgml.transform( inputs => ARRAY[ 'I will walk to home when I went through the bus.' @@ -388,7 +388,7 @@ In the example provided below, we will demonstrate how to classify a given sente Look for models with `mnli` to use a zero-shot classification model on the :hugs: Hugging Face model hub. -```sql +```postgresql SELECT pgml.transform( inputs => ARRAY[ 'I have a problem with my iphone that needs to be resolved asap!!' @@ -421,7 +421,7 @@ Token classification is a task in natural language understanding, where labels a ### Named Entity Recognition Named Entity Recognition (NER) is a task that involves identifying named entities in a text. These entities can include the names of people, locations, or organizations. The task is completed by labeling each token with a class for each named entity and a class named "0" for tokens that don't contain any entities. In this task, the input is text, and the output is the annotated text with named entities. -```sql +```postgresql SELECT pgml.transform( inputs => ARRAY[ 'I am Omar and I live in New York City.' @@ -443,7 +443,7 @@ SELECT pgml.transform( PoS tagging is a task that involves identifying the parts of speech, such as nouns, pronouns, adjectives, or verbs, in a given text. In this task, the model labels each word with a specific part of speech. Look for models with `pos` to use a zero-shot classification model on the :hugs: Hugging Face model hub. -```sql +```postgresql select pgml.transform( inputs => array [ 'I live in Amsterdam.' @@ -470,7 +470,7 @@ Translation is the task of converting text written in one language into another You have the option to select from over 2000 models available on the Hugging Face hub for translation. -```sql +```postgresql select pgml.transform( inputs => array[ 'How are you?' @@ -491,7 +491,7 @@ Summarization involves creating a condensed version of a document that includes ![summarization](pgml-cms/docs/images/summarization.png) -```sql +```postgresql select pgml.transform( task => '{"task": "summarization", "model": "sshleifer/distilbart-cnn-12-6" @@ -509,7 +509,7 @@ select pgml.transform( ``` You can control the length of summary_text by passing `min_length` and `max_length` as arguments to the SQL query. -```sql +```postgresql select pgml.transform( task => '{"task": "summarization", "model": "sshleifer/distilbart-cnn-12-6" @@ -535,7 +535,7 @@ Question Answering models are designed to retrieve the answer to a question from ![question answering](pgml-cms/docs/images/question-answering.png) -```sql +```postgresql SELECT pgml.transform( 'question-answering', inputs => ARRAY[ @@ -564,7 +564,7 @@ Text generation is the task of producing new text, such as filling in incomplete ![text generation](pgml-cms/docs/images/text-generation.png) -```sql +```postgresql SELECT pgml.transform( task => 'text-generation', inputs => ARRAY[ @@ -584,7 +584,7 @@ SELECT pgml.transform( To use a specific model from :hugs: model hub, pass the model name along with task name in task. -```sql +```postgresql SELECT pgml.transform( task => '{ "task" : "text-generation", @@ -603,7 +603,7 @@ SELECT pgml.transform( ``` To make the generated text longer, you can include the argument `max_length` and specify the desired maximum length of the text. -```sql +```postgresql SELECT pgml.transform( task => '{ "task" : "text-generation", @@ -625,7 +625,7 @@ SELECT pgml.transform( ``` If you want the model to generate more than one output, you can specify the number of desired output sequences by including the argument `num_return_sequences` in the arguments. -```sql +```postgresql SELECT pgml.transform( task => '{ "task" : "text-generation", @@ -651,7 +651,7 @@ SELECT pgml.transform( ``` Text generation typically utilizes a greedy search algorithm that selects the word with the highest probability as the next word in the sequence. However, an alternative method called beam search can be used, which aims to minimize the possibility of overlooking hidden high probability word combinations. Beam search achieves this by retaining the num_beams most likely hypotheses at each step and ultimately selecting the hypothesis with the highest overall probability. We set `num_beams > 1` and `early_stopping=True` so that generation is finished when all beam hypotheses reached the EOS token. -```sql +```postgresql SELECT pgml.transform( task => '{ "task" : "text-generation", @@ -681,7 +681,7 @@ However, the randomness of the sampling method can also result in less coherent You can pass `do_sample = True` in the arguments to use sampling methods. It is recommended to alter `temperature` or `top_p` but not both. *Temperature* -```sql +```postgresql SELECT pgml.transform( task => '{ "task" : "text-generation", @@ -702,7 +702,7 @@ SELECT pgml.transform( ``` *Top p* -```sql +```postgresql SELECT pgml.transform( task => '{ "task" : "text-generation", @@ -726,7 +726,7 @@ Text-to-text generation methods, such as T5, are neural network architectures de ![text-to-text](pgml-cms/docs/images/text-to-text-generation.png) *Translation* -```sql +```postgresql SELECT pgml.transform( task => '{ "task" : "text2text-generation" @@ -745,7 +745,7 @@ SELECT pgml.transform( ``` Similar to other tasks, we can specify a model for text-to-text generation. -```sql +```postgresql SELECT pgml.transform( task => '{ "task" : "text2text-generation", @@ -762,7 +762,7 @@ SELECT pgml.transform( Fill-mask refers to a task where certain words in a sentence are hidden or "masked", and the objective is to predict what words should fill in those masked positions. Such models are valuable when we want to gain statistical insights about the language used to train the model. ![fill mask](pgml-cms/docs/images/fill-mask.png) -```sql +```postgresql SELECT pgml.transform( task => '{ "task" : "fill-mask" @@ -794,7 +794,7 @@ Using a vector database involves three key steps: creating embeddings, indexing To create embeddings for your data, you first need to choose a transformer that can generate embeddings from your input data. Some popular transformer options include BERT, GPT-2, and T5. Once you've selected a transformer, you can use it to generate embeddings for your data. In the following section, we will demonstrate how to use PostgresML to generate embeddings for a dataset of tweets commonly used in sentiment analysis. To generate the embeddings, we will use the `pgml.embed` function, which will generate an embedding for each tweet in the dataset. These embeddings will then be inserted into a table called tweet_embeddings. -```sql +```postgresql SELECT pgml.load_dataset('tweet_eval', 'sentiment'); SELECT * @@ -831,13 +831,13 @@ The index is being created on the embedding column in the tweet_embeddings table By creating an index on the embedding column, the database can quickly search for and retrieve records that are similar to a given query vector. This can be useful for a variety of machine learning applications, such as similarity search or recommendation systems. -```sql +```postgresql CREATE INDEX ON tweet_embeddings USING ivfflat (embedding vector_cosine_ops); ``` ## Step 3: Querying the index using embeddings for your queries Once your embeddings have been indexed, you can use them to perform queries against your database. To do this, you'll need to provide a query embedding that represents the query you want to perform. The index will then return the closest matching embeddings from your database, based on the similarity between the query embedding and the stored embeddings. -```sql +```postgresql WITH query AS ( SELECT pgml.embed('distilbert-base-uncased', 'Star Wars christmas special is on Disney')::vector AS embedding ) @@ -876,7 +876,7 @@ In this section, we will provide a step-by-step walkthrough for fine-tuning a La ### 1. Loading the Dataset To begin, create a table to store your dataset. In this example, we use the 'imdb' dataset from Hugging Face. IMDB dataset contains three splits: train (25K rows), test (25K rows) and unsupervised (50K rows). In train and test splits, negative class has label 0 and positive class label 1. All rows in unsupervised split has a label of -1. -```sql +```postgresql SELECT pgml.load_dataset('imdb'); ``` @@ -888,7 +888,7 @@ We will create a view of the dataset by performing the following operations: - Shuffled view of the dataset to ensure randomness in the distribution of data. - Remove all the unsupervised splits that have label = -1. -```sql +```postgresql CREATE VIEW pgml.imdb_shuffled_view AS SELECT label, @@ -910,7 +910,7 @@ Before splitting the data into training and test sets, it's essential to perform To analyze the distribution of labels in the shuffled dataset, you can use the following SQL query: -```sql +```postgresql -- Count the occurrences of each label in the shuffled dataset pgml=# SELECT class, @@ -931,7 +931,7 @@ This query provides insights into the distribution of labels, helping you unders #### 3.2 Sample Records To get a glimpse of the data, you can retrieve a sample of records from the shuffled dataset: -```sql +```postgresql -- Retrieve a sample of records from the shuffled dataset pgml=# SELECT LEFT(text,100) AS text, class FROM pgml.imdb_shuffled_view @@ -957,7 +957,7 @@ Feel free to explore other aspects of the data, such as the distribution of text Create views for training and test data by splitting the shuffled dataset. In this example, 80% is allocated for training, and 20% for testing. We will use `pgml.imdb_test_view` in [section 6](#6-inference-using-fine-tuned-model) for batch predictions using the finetuned model. -```sql +```postgresql -- Create a view for training data CREATE VIEW pgml.imdb_train_view AS SELECT * @@ -975,7 +975,7 @@ OFFSET (SELECT COUNT(*) * 0.8 FROM pgml.imdb_shuffled_view); Now, fine-tune the Language Model for text classification using the created training view. In the following sections, you will see a detailed explanation of different parameters used during fine-tuning. Fine-tuned model is pushed to your public Hugging Face Hub periodically. A new repository will be created under your username using your project name (`imdb_review_sentiment` in this case). You can also choose to push the model to a private repository by setting `hub_private_repo: true` in training arguments. -```sql +```postgresql SELECT pgml.tune( 'imdb_review_sentiment', task => 'text-classification', @@ -1122,7 +1122,7 @@ Now, that we have fine-tuned model on Hugging Face Hub, we can use [`pgml.transf **Real-time predictions** Here is an example pgml.transform call for real-time predictions on the newly minted LLM fine-tuned on IMDB review dataset. -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-classification", @@ -1143,7 +1143,7 @@ Time: 175.264 ms **Batch predictions** -```sql +```postgresql pgml=# SELECT LEFT(text, 100) AS truncated_text, class, @@ -1179,14 +1179,14 @@ Sometimes, it's necessary to restart the training process from a previously trai Specify the name of the existing model you want to use as a starting point. This is achieved by setting the `model_name` parameter in the `pgml.tune` function. In the example below, it is set to 'santiadavani/imdb_review_sentiement'. -```sql +```postgresql model_name => 'santiadavani/imdb_review_sentiement', ``` ### Adjust Hyperparameters Fine-tune hyperparameters as needed for the restarted training process. This might include modifying learning rates, batch sizes, or training epochs. In the example below, hyperparameters such as learning rate, batch sizes, and epochs are adjusted. -```sql +```postgresql "training_args": { "learning_rate": 2e-5, "per_device_train_batch_size": 16, @@ -1201,7 +1201,7 @@ Fine-tune hyperparameters as needed for the restarted training process. This mig ### Ensure Consistent Dataset Configuration Confirm that the dataset configuration remains consistent, including specifying the same text and class columns as in the previous training. This ensures compatibility between the existing model and the restarted training process. -```sql +```postgresql "dataset_args": { "text_column": "text", "class_column": "class" @@ -1211,7 +1211,7 @@ Confirm that the dataset configuration remains consistent, including specifying ### Run the pgml.tune Function Execute the `pgml.tune` function with the updated parameters to initiate the training restart. The function will leverage the existing model and adapt it based on the adjusted hyperparameters and dataset configuration. -```sql +```postgresql SELECT pgml.tune( 'imdb_review_sentiement', task => 'text-classification', @@ -1250,7 +1250,7 @@ However, in certain scenarios, pushing the model to a central repository and pul ### 1. Load and Shuffle the Dataset In this section, we begin by loading the FinGPT sentiment analysis dataset using the `pgml.load_dataset` function. The dataset is then processed and organized into a shuffled view (pgml.fingpt_sentiment_shuffled_view), ensuring a randomized order of records. This step is crucial for preventing biases introduced by the original data ordering and enhancing the training process. -```sql +```postgresql -- Load the dataset SELECT pgml.load_dataset('FinGPT/fingpt-sentiment-train'); @@ -1262,7 +1262,7 @@ SELECT * FROM pgml."FinGPT/fingpt-sentiment-train" ORDER BY RANDOM(); ### 2. Explore Class Distribution Once the dataset is loaded and shuffled, we delve into understanding the distribution of sentiment classes within the data. By querying the shuffled view, we obtain valuable insights into the number of instances for each sentiment class. This exploration is essential for gaining a comprehensive understanding of the dataset and its inherent class imbalances. -```sql +```postgresql -- Explore class distribution SELECTpgml=# SELECT output, @@ -1288,7 +1288,7 @@ ORDER BY output; ### 3. Create Training and Test Views To facilitate the training process, we create distinct views for training and testing purposes. The training view (pgml.fingpt_sentiment_train_view) contains 80% of the shuffled dataset, enabling the model to learn patterns and associations. Simultaneously, the test view (pgml.fingpt_sentiment_test_view) encompasses the remaining 20% of the data, providing a reliable evaluation set to assess the model's performance. -```sql +```postgresql -- Create a view for training data (e.g., 80% of the shuffled records) CREATE VIEW pgml.fingpt_sentiment_train_view AS SELECT * @@ -1306,7 +1306,7 @@ OFFSET (SELECT COUNT(*) * 0.8 FROM pgml.fingpt_sentiment_shuffled_view); ### 4. Fine-Tune the Model for 9 Classes In the final section, we kick off the fine-tuning process using the `pgml.tune` function. The model will be internally configured for sentiment analysis with 9 classes. The training is executed on the 80% of the train view and evaluated on the remaining 20% of the train view. The test view is reserved for evaluating the model's accuracy after training is completed. Please note that the option `hub_private_repo: true` is used to push the model to a private Hugging Face repository. -```sql +```postgresql -- Fine-tune the model for 9 classes without HUB token SELECT pgml.tune( 'fingpt_sentiement', @@ -1387,7 +1387,7 @@ What makes the conversation task truly remarkable is its remarkable versatility. ### Fine-tuning Llama2-7b model using LoRA In this section, we will explore how to fine-tune the Llama2-7b-chat large language model for the financial sentiment data discussed in the previous [section](#text-classification-9-classes) utilizing the pgml.tune function and employing the LoRA approach. LoRA is a technique that enables efficient fine-tuning of large language models by only updating a small subset of the model's weights during fine-tuning, while keeping the majority of the weights frozen. This approach can significantly reduce the computational requirements and memory footprint compared to traditional full model fine-tuning. -```sql +```postgresql SELECT pgml.tune( 'fingpt-llama2-7b-chat', task => 'conversation', diff --git a/pgml-cms/blog/announcing-gptq-and-ggml-quantized-llm-support-for-huggingface-transformers.md b/pgml-cms/blog/announcing-gptq-and-ggml-quantized-llm-support-for-huggingface-transformers.md index 6242776db..70f0202e0 100644 --- a/pgml-cms/blog/announcing-gptq-and-ggml-quantized-llm-support-for-huggingface-transformers.md +++ b/pgml-cms/blog/announcing-gptq-and-ggml-quantized-llm-support-for-huggingface-transformers.md @@ -41,7 +41,7 @@ You can select the data type for torch tensors in PostgresML by setting the `tor !!! code\_block time="4584.906 ms" -```sql +```postgresql SELECT pgml.transform( task => '{ "model": "tiiuae/falcon-7b-instruct", @@ -102,7 +102,7 @@ PostgresML will automatically use GPTQ or GGML when a HuggingFace model has one !!! code\_block time="281.213 ms" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", @@ -136,7 +136,7 @@ SELECT pgml.transform( !!! code\_block time="252.213 ms" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", @@ -167,7 +167,7 @@ SELECT pgml.transform( !!! code\_block time="279.888 ms" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", @@ -204,7 +204,7 @@ We can specify the CPU by passing a `"device": "cpu"` argument to the `task`. !!! code\_block time="266.997 ms" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", @@ -236,7 +236,7 @@ SELECT pgml.transform( !!! code\_block time="33224.136 ms" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", @@ -274,7 +274,7 @@ HuggingFace and these libraries have a lot of great models. Not all of these mod !!! code\_block time="3411.324 ms" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", @@ -306,7 +306,7 @@ SELECT pgml.transform( !!! code\_block time="4198.817 ms" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", @@ -338,7 +338,7 @@ SELECT pgml.transform( !!! code\_block time="4198.817 ms" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", @@ -372,7 +372,7 @@ Many of these models are published with multiple different quantization methods !!! code\_block time="6498.597" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", @@ -410,7 +410,7 @@ Shoutout to [Tostino](https://github.com/Tostino/) for the extended example belo !!! code\_block time="3784.565" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", diff --git a/pgml-cms/blog/how-to-improve-search-results-with-machine-learning.md b/pgml-cms/blog/how-to-improve-search-results-with-machine-learning.md index 5ee950918..074d431ea 100644 --- a/pgml-cms/blog/how-to-improve-search-results-with-machine-learning.md +++ b/pgml-cms/blog/how-to-improve-search-results-with-machine-learning.md @@ -36,7 +36,7 @@ Our search application will start with a **documents** table. Our documents have !!! code\_block time="10.493 ms" -```sql +```postgresql CREATE TABLE documents ( id BIGSERIAL PRIMARY KEY, title TEXT, @@ -54,7 +54,7 @@ We can add new documents to our _text corpus_ with the standard SQL `INSERT` sta !!! code\_block time="3.417 ms" -```sql +```postgresql INSERT INTO documents (title, body) VALUES ('This is a title', 'This is the body of the first document.'), ('This is another title', 'This is the body of the second document.'), @@ -79,7 +79,7 @@ You can configure the grammatical rules in many advanced ways, but we'll use the !!! code\_block time="0.651 ms" -```sql +```postgresql SELECT * FROM documents WHERE to_tsvector('english', body) @@ to_tsquery('english', 'second'); @@ -109,7 +109,7 @@ The first step is to store the `tsvector` in the table, so we don't have to gene !!! code\_block time="17.883 ms" -```sql +```postgresql ALTER TABLE documents ADD COLUMN title_and_body_text tsvector GENERATED ALWAYS AS (to_tsvector('english', title || ' ' || body )) STORED; @@ -125,7 +125,7 @@ One nice aspect of generated columns is that they will backfill the data for exi !!! code\_block time="5.145 ms" -```sql +```postgresql CREATE INDEX documents_title_and_body_text_index ON documents USING GIN (title_and_body_text); @@ -141,7 +141,7 @@ And now, we'll demonstrate a slightly more complex `tsquery`, that requires both !!! code\_block time="3.673 ms" -```sql +```postgresql SELECT * FROM documents WHERE title_and_body_text @@ to_tsquery('english', 'another & second'); @@ -171,7 +171,7 @@ With multiple query terms OR `|` together, the `ts_rank` will add the numerators !!! code\_block time="0.561 ms" -```sql +```postgresql SELECT ts_rank(title_and_body_text, to_tsquery('english', 'second | title')), * FROM documents ORDER BY ts_rank DESC; @@ -201,7 +201,7 @@ A quick improvement we could make to our search query would be to differentiate !!! code\_block time="0.561 ms" -```sql +```postgresql SELECT ts_rank(title, to_tsquery('english', 'second | title')) AS title_rank, ts_rank(body, to_tsquery('english', 'second | title')) AS body_rank, @@ -230,7 +230,7 @@ First things first, we need to record some user clicks on our search results. We !!! code\_block time="0.561 ms" -```sql +```postgresql CREATE TABLE search_result_clicks ( title_rank REAL, body_rank REAL, @@ -250,7 +250,7 @@ I've made up 4 example searches, across our 3 documents, and recorded the `ts_ra !!! code\_block time="2.161 ms" -```sql +```postgresql INSERT INTO search_result_clicks (title_rank, body_rank, clicked) VALUES @@ -289,7 +289,7 @@ Here goes some machine learning: !!! code\_block time="6.867 ms" -```sql +```postgresql SELECT * FROM pgml.train( project_name => 'Search Ranking', task => 'regression', @@ -336,7 +336,7 @@ Once a model is trained, you can use `pgml.predict` to use it on new inputs. `pg !!! code\_block time="3.119 ms" -```sql +```postgresql SELECT clicked, pgml.predict('Search Ranking', array[title_rank, body_rank]) @@ -389,7 +389,7 @@ It's nice to organize the query into logical steps, and we can use **Common Tabl !!! code\_block time="2.118 ms" -```sql +```postgresql WITH first_pass_ranked_documents AS ( SELECT -- Compute the ts_rank for the title and body text of each document diff --git a/pgml-cms/blog/mindsdb-vs-postgresml.md b/pgml-cms/blog/mindsdb-vs-postgresml.md index 9b92bd851..6459d2d9e 100644 --- a/pgml-cms/blog/mindsdb-vs-postgresml.md +++ b/pgml-cms/blog/mindsdb-vs-postgresml.md @@ -94,7 +94,7 @@ For both implementations, we can just pass in our data as part of the query for !!! code\_block time="4769.337 ms" -```sql +```postgresql SELECT pgml.transform( inputs => ARRAY[ 'I am so excited to benchmark deep learning models in SQL. I can not wait to see the results!' @@ -124,7 +124,7 @@ The first time `transform` is run with a particular model name, it will download !!! code\_block time="45.094 ms" -```sql +```postgresql SELECT pgml.transform( inputs => ARRAY[ 'I don''t really know if 5 seconds is fast or slow for deep learning. How much time is spent downloading vs running the model?' @@ -154,7 +154,7 @@ SELECT pgml.transform( !!! code\_block time="165.036 ms" -```sql +```postgresql SELECT pgml.transform( inputs => ARRAY[ 'Are GPUs really worth it? Sometimes they are more expensive than the rest of the computer combined.' @@ -209,7 +209,7 @@ psql postgres://mindsdb:123@127.0.0.1:55432 And turn timing on to see how long it takes to run the same query: -```sql +```postgresql \timing on ``` diff --git a/pgml-cms/blog/pg-stat-sysinfo-a-postgres-extension-for-querying-system-statistics.md b/pgml-cms/blog/pg-stat-sysinfo-a-postgres-extension-for-querying-system-statistics.md index bb14ff2dd..b50572ea0 100644 --- a/pgml-cms/blog/pg-stat-sysinfo-a-postgres-extension-for-querying-system-statistics.md +++ b/pgml-cms/blog/pg-stat-sysinfo-a-postgres-extension-for-querying-system-statistics.md @@ -62,7 +62,7 @@ All system statistics are stored together in this one structure. !!! code\_block -```sql +```postgresql SELECT * FROM pg_stat_sysinfo WHERE metric = 'load_average' AND at BETWEEN '2023-04-07 19:20:09.3' @@ -97,7 +97,7 @@ In the case of the load average, we could handle this situation by having a tabl !!! code\_block -```sql +```postgresql CREATE TABLE load_average ( at timestamptz NOT NULL DEFAULT now(), "1m" float4 NOT NULL, @@ -112,7 +112,7 @@ This structure is fine for `load_average` but wouldn't work for CPU, disk, RAM o !!! code\_block -```sql +```postgresql CREATE TABLE load_average ( at timestamptz NOT NULL DEFAULT now(), "1m" float4 NOT NULL, @@ -132,7 +132,7 @@ This has the disadvantage of baking in a lot of keys and the overall structure o !!! code\_block -```sql +```postgresql CREATE TABLE load_average ( at timestamptz NOT NULL DEFAULT now(), "1m" float4 NOT NULL, diff --git a/pgml-cms/blog/postgres-full-text-search-is-awesome.md b/pgml-cms/blog/postgres-full-text-search-is-awesome.md index c1cab12b5..4ef6e9db8 100644 --- a/pgml-cms/blog/postgres-full-text-search-is-awesome.md +++ b/pgml-cms/blog/postgres-full-text-search-is-awesome.md @@ -54,7 +54,7 @@ These queries can execute in milliseconds on large production-sized corpora with The following full blown example is for demonstration purposes only of a 3rd generation search engine. You can test it for real in the PostgresML Gym to build up a complete understanding. -```sql +```postgresql WITH query AS ( -- construct a query context with arguments that would typically be -- passed in from the application layer diff --git a/pgml-cms/blog/postgresml-is-moving-to-rust-for-our-2.0-release.md b/pgml-cms/blog/postgresml-is-moving-to-rust-for-our-2.0-release.md index 623d3e006..eff3ee084 100644 --- a/pgml-cms/blog/postgresml-is-moving-to-rust-for-our-2.0-release.md +++ b/pgml-cms/blog/postgresml-is-moving-to-rust-for-our-2.0-release.md @@ -27,7 +27,7 @@ Python is generally touted as fast enough for machine learning, and is the de fa To illustrate our motivation, we'll create a test set of 10,000 random embeddings with 128 dimensions, and store them in a table. Our first benchmark will simulate semantic ranking, by computing the dot product against every member of the test set, sorting the results and returning the top match. -```sql +```postgresql -- Generate 10,000 embeddings with 128 dimensions as FLOAT4[] type. CREATE TABLE embeddings AS SELECT ARRAY_AGG(random())::FLOAT4[] AS vector @@ -39,7 +39,7 @@ Spoiler alert: idiomatic Rust is about 10x faster than native SQL, embedded PL/p {% tabs %} {% tab title="SQL" %} -```sql +```postgresql CREATE OR REPLACE FUNCTION dot_product_sql(a FLOAT4[], b FLOAT4[]) RETURNS FLOAT4 LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS @@ -49,7 +49,7 @@ $$ $$; ``` -```sql +```postgresql WITH test AS ( SELECT ARRAY_AGG(random())::FLOAT4[] AS vector FROM generate_series(1, 128) i @@ -62,7 +62,7 @@ LIMIT 1; {% endtab %} {% tab title="PL/pgSQL" %} -```sql +```postgresql CREATE OR REPLACE FUNCTION dot_product_plpgsql(a FLOAT4[], b FLOAT4[]) RETURNS FLOAT4 LANGUAGE plpgsql IMMUTABLE STRICT PARALLEL SAFE AS @@ -74,7 +74,7 @@ $$ $$; ``` -```sql +```postgresql WITH test AS ( SELECT ARRAY_AGG(random())::FLOAT4[] AS vector FROM generate_series(1, 128) i @@ -87,7 +87,7 @@ LIMIT 1; {% endtab %} {% tab title="Python" %} -```sql +```postgresql CREATE OR REPLACE FUNCTION dot_product_python(a FLOAT4[], b FLOAT4[]) RETURNS FLOAT4 LANGUAGE plpython3u IMMUTABLE STRICT PARALLEL SAFE AS @@ -96,7 +96,7 @@ $$ $$; ``` -```sql +```postgresql WITH test AS ( SELECT ARRAY_AGG(random())::FLOAT4[] AS vector FROM generate_series(1, 128) i @@ -109,7 +109,7 @@ LIMIT 1; {% endtab %} {% tab title="NumPy" %} -```sql +```postgresql CREATE OR REPLACE FUNCTION dot_product_numpy(a FLOAT4[], b FLOAT4[]) RETURNS FLOAT4 LANGUAGE plpython3u IMMUTABLE STRICT PARALLEL SAFE AS @@ -119,7 +119,7 @@ $$ $$; ``` -```sql +```postgresql WITH test AS ( SELECT ARRAY_AGG(random())::FLOAT4[] AS vector FROM generate_series(1, 128) i @@ -144,7 +144,7 @@ fn dot_product_rust(vector: Vec, other: Vec) -> f32 { } ``` -```sql +```postgresql WITH test AS ( SELECT ARRAY_AGG(random())::FLOAT4[] AS vector FROM generate_series(1, 128) i @@ -203,7 +203,7 @@ The results are somewhat staggering. We didn't spend any time intentionally opti ## Preserving Backward Compatibility -```sql +```postgresql SELECT pgml.train( project_name => 'Handwritten Digit Classifier', task => 'classification', @@ -213,7 +213,7 @@ SELECT pgml.train( ); ``` -```sql +```postgresql SELECT pgml.predict('Handwritten Digit Classifier', image) FROM pgml.digits; ``` diff --git a/pgml-cms/blog/sentiment-analysis-using-express-js-and-postgresml.md b/pgml-cms/blog/sentiment-analysis-using-express-js-and-postgresml.md index 73407bcf7..56f836db3 100644 --- a/pgml-cms/blog/sentiment-analysis-using-express-js-and-postgresml.md +++ b/pgml-cms/blog/sentiment-analysis-using-express-js-and-postgresml.md @@ -65,7 +65,7 @@ We also have three endpoints to hit: * `app.get(“/", async (req, res, next)` which returns all the notes for that day and the daily summary. * `app.post(“/add", async (req, res, next)` which accepts a new note entry and performs a sentiment analysis. We simplify the score by converting it to 1, 0, -1 for positive, neutral, negative and save it in our notes table. -```sql +```postgresql WITH note AS ( SELECT pgml.transform( inputs => ARRAY['${req.body.note}'], @@ -88,7 +88,7 @@ INSERT INTO notes (note, score) VALUES ('${req.body.note}', (SELECT score FROM s * `app.get(“/analyze”, async (req, res, next)` which takes the daily entries, produces a summary and total sentiment score, and places that into our days table. -```sql +```postgresql WITH day AS ( SELECT note, diff --git a/pgml-cms/blog/the-1.0-sdk-is-here.md b/pgml-cms/blog/the-1.0-sdk-is-here.md index 185f0d438..94464d566 100644 --- a/pgml-cms/blog/the-1.0-sdk-is-here.md +++ b/pgml-cms/blog/the-1.0-sdk-is-here.md @@ -119,7 +119,7 @@ print(results) The SQL for the vector\_search is actually just: -```sql +```postgresql WITH "pipeline" ( "schema" ) AS ( diff --git a/pgml-cms/blog/which-database-that-is-the-question.md b/pgml-cms/blog/which-database-that-is-the-question.md index 2f9908807..bc0835a27 100644 --- a/pgml-cms/blog/which-database-that-is-the-question.md +++ b/pgml-cms/blog/which-database-that-is-the-question.md @@ -57,7 +57,7 @@ Most importantly though, Postgres allows you to understand your data and your bu Understanding your business is good, but what if you could improve it too? Most are tempted to throw spaghetti against the wall (and that's okay), but machine learning allows for a more scientific approach. Traditionally, ML has been tough to use with modern data architectures: using key-value databases makes data virtually inaccessible in bulk. With PostgresML though, you can train an XGBoost model directly on your orders table with a single SQL query: -```sql +```postgresql SELECT pgml.train( 'Orders Likely To Be Returned', -- name of your model 'regression', -- objective (regression or classification) diff --git a/pgml-cms/docs/README.md b/pgml-cms/docs/README.md index 1d993a933..fe5f9df15 100644 --- a/pgml-cms/docs/README.md +++ b/pgml-cms/docs/README.md @@ -4,7 +4,7 @@ description: The key concepts that make up PostgresML. # Overview -PostgresML is a complete MLOps platform built on PostgreSQL. Our operating principle is: +PostgresML is a complete MLOps platform built inside PostgreSQL. Our operating principle is: > _Move models to the database, rather than constantly moving data to the models._ diff --git a/pgml-cms/docs/SUMMARY.md b/pgml-cms/docs/SUMMARY.md index 686688728..5fe8dad33 100644 --- a/pgml-cms/docs/SUMMARY.md +++ b/pgml-cms/docs/SUMMARY.md @@ -55,8 +55,8 @@ * [Embeddings](guides/embeddings/README.md) * [In-database Generation](guides/embeddings/in-database-generation.md) - * [Dimensionality Reduction]() - * [Re-ranking nearest neighbors]() + * [Dimensionality Reduction](uides/embeddings/dimensionality-reduction.md) + * [Re-ranking nearest neighbors](uides/embeddings/re-ranking-nearest-neighbors.md) * [Indexing w/ pgvector]() * [Aggregation](guides/embeddings/vector-aggregation.md) * [Similarity](guides/embeddings/vector-similarity.md) diff --git a/pgml-cms/docs/api/sql-extension/pgml.chunk.md b/pgml-cms/docs/api/sql-extension/pgml.chunk.md index fa2380057..897889f89 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.chunk.md +++ b/pgml-cms/docs/api/sql-extension/pgml.chunk.md @@ -8,7 +8,7 @@ Chunks are pieces of documents split using some specified splitter. This is typi ## API -```sql +```postgresql pgml.chunk( splitter TEXT, -- splitter name text TEXT, -- text to embed @@ -18,21 +18,21 @@ pgml.chunk( ## Example -```sql +```postgresql SELECT pgml.chunk('recursive_character', 'test'); ``` -```sql +```postgresql SELECT pgml.chunk('recursive_character', 'test', '{"chunk_size": 1000, "chunk_overlap": 40}'::jsonb); ``` -```sql +```postgresql SELECT pgml.chunk('markdown', '# Some test'); ``` Note that the input text for those splitters is so small it isn't splitting it at all, a real world example would look more like: -```sql +```postgresql SELECT pgml.chunk('recursive_character', content) FROM documents; ``` diff --git a/pgml-cms/docs/api/sql-extension/pgml.decompose.md b/pgml-cms/docs/api/sql-extension/pgml.decompose.md index 9e1f37484..94db1ac91 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.decompose.md +++ b/pgml-cms/docs/api/sql-extension/pgml.decompose.md @@ -8,7 +8,7 @@ Matrix decomposition reduces the number of dimensions in a vector, to improve re ## API -```sql +```postgresql pgml.decompose( project_name TEXT, -- project name vector REAL[] -- features to decompose @@ -24,6 +24,6 @@ pgml.decompose( ## Example -```sql +```postgresql SELECT pgml.decompose('My PCA', ARRAY[0.1, 2.0, 5.0]); ``` diff --git a/pgml-cms/docs/api/sql-extension/pgml.deploy.md b/pgml-cms/docs/api/sql-extension/pgml.deploy.md index 3181f9d51..645d99e6e 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.deploy.md +++ b/pgml-cms/docs/api/sql-extension/pgml.deploy.md @@ -12,7 +12,7 @@ A model is automatically deployed and used for predictions if its key metric (_R ## API -```sql +```postgresql pgml.deploy( project_name TEXT, strategy TEXT DEFAULT 'best_score', @@ -46,7 +46,7 @@ The default deployment behavior allows any algorithm to qualify. It's automatica #### SQL -```sql +```postgresql SELECT * FROM pgml.deploy( 'Handwritten Digit Image Classifier', strategy => 'best_score' @@ -55,7 +55,7 @@ SELECT * FROM pgml.deploy( #### Output -```sql +```postgresql project | strategy | algorithm ------------------------------------+------------+----------- Handwritten Digit Image Classifier | best_score | xgboost @@ -68,7 +68,7 @@ Deployment candidates can be restricted to a specific algorithm by including the #### SQL -```sql +```postgresql SELECT * FROM pgml.deploy( project_name => 'Handwritten Digit Image Classifier', strategy => 'best_score', @@ -78,7 +78,7 @@ SELECT * FROM pgml.deploy( #### Output -```sql +```postgresql project_name | strategy | algorithm ------------------------------------+----------------+---------------- Handwritten Digit Image Classifier | classification | svm @@ -91,7 +91,7 @@ In case the new model isn't performing well in production, it's easy to rollback #### Rollback -```sql +```postgresql SELECT * FROM pgml.deploy( 'Handwritten Digit Image Classifier', strategy => 'rollback' @@ -100,7 +100,7 @@ SELECT * FROM pgml.deploy( #### Output -```sql +```postgresql project | strategy | algorithm ------------------------------------+----------+----------- Handwritten Digit Image Classifier | rollback | linear @@ -111,7 +111,7 @@ SELECT * FROM pgml.deploy( Rollbacks are actually new deployments, so issuing two rollbacks in a row will leave you back with the original model, making rollback safely undoable. -```sql +```postgresql SELECT * FROM pgml.deploy( 'Handwritten Digit Image Classifier', strategy => 'rollback' @@ -120,7 +120,7 @@ SELECT * FROM pgml.deploy( #### Output -```sql +```postgresql project | strategy | algorithm ------------------------------------+----------+----------- Handwritten Digit Image Classifier | rollback | xgboost @@ -133,13 +133,13 @@ In the case you need to deploy an exact model that is not the `most_recent` or ` #### SQL -```sql +```postgresql SELECT * FROM pgml.deploy(12); ``` #### Output -```sql +```postgresql project | strategy | algorithm ------------------------------------+----------+----------- Handwritten Digit Image Classifier | specific | xgboost diff --git a/pgml-cms/docs/api/sql-extension/pgml.embed.md b/pgml-cms/docs/api/sql-extension/pgml.embed.md index 1f9b946b5..43da6120e 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.embed.md +++ b/pgml-cms/docs/api/sql-extension/pgml.embed.md @@ -10,7 +10,7 @@ The `pgml.embed()` function generates [embeddings](/docs/use-cases/embeddings/) ## API -```sql +```postgresql pgml.embed( transformer TEXT, "text" TEXT, @@ -34,10 +34,10 @@ Creating an embedding from text is as simple as calling the function with the te {% tab title="SQL" %} ```postgresql -SELECT * FROM pgml.embed( - 'intfloat/e5-small', +SELECT pgml.embed( + 'intfloat/e5-small-v2', 'No, that''s not true, that''s impossible.' -) AS star_wars_embedding; +); ``` {% endtab %} @@ -47,52 +47,35 @@ SELECT * FROM pgml.embed( SQL functions can be used as part of a query to insert, update, or even automatically generate column values of any table: -{% tabs %} -{% tab title="SQL" %} - ```postgresql CREATE TABLE star_wars_quotes ( quote TEXT NOT NULL, embedding vector(384) GENERATED ALWAYS AS ( - pgml.embed('intfloat/e5-small', quote) + pgml.embed('intfloat/e5-small-v2', quote) ) STORED ); -INSERT INTO - star_wars_quotes (quote) +INSERT INTO star_wars_quotes (quote) VALUES -('I find your lack of faith disturbing'), -('I''ve got a bad feeling about this.'), -('Do or do not, there is no try.'); + ('I find your lack of faith disturbing'), + ('I''ve got a bad feeling about this.'), + ('Do or do not, there is no try.'); ``` -{% endtab %} -{% endtabs %} - In this example, we're using [generated columns](https://www.postgresql.org/docs/current/ddl-generated-columns.html) to automatically create an embedding of the `quote` column every time the column value is updated. #### Using embeddings in queries Once you have embeddings, you can use them in queries to find text with similar semantic meaning: -{% tabs %} -{% tab title="SQL" %} - ```postgresql -SELECT - quote -FROM - star_wars_quotes -ORDER BY - pgml.embed( - 'intfloat/e5-small', +SELECT quote +FROM star_wars_quotes +ORDER BY pgml.embed( + 'intfloat/e5-small-v2', 'Feel the force!', - ) <=> embedding - DESC + ) <=> embedding DESC LIMIT 1; ``` -{% endtab %} -{% endtabs %} - This query will return the quote with the most similar meaning to `'Feel the force!'` by generating an embedding of that quote and comparing it to all other embeddings in the table, using vector cosine similarity as the measure of distance. diff --git a/pgml-cms/docs/api/sql-extension/pgml.predict/README.md b/pgml-cms/docs/api/sql-extension/pgml.predict/README.md index ffcfc3043..71fed7a6c 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.predict/README.md +++ b/pgml-cms/docs/api/sql-extension/pgml.predict/README.md @@ -10,7 +10,7 @@ description: >- The `pgml.predict()` function is the key value proposition of PostgresML. It provides online predictions using the best, automatically deployed model for a project. The API for predictions is very simple and only requires two arguments: the project name and the features used for prediction. -```sql +```postgresql select pgml.predict ( project_name TEXT, features REAL[] @@ -26,7 +26,7 @@ select pgml.predict ( ### Regression Example -```sql +```postgresql SELECT pgml.predict( 'My Classification Project', ARRAY[0.1, 2.0, 5.0] @@ -37,7 +37,7 @@ where `ARRAY[0.1, 2.0, 5.0]` is the same type of features used in training, in t !!! example -```sql +```postgresql SELECT *, pgml.predict( 'Buy it Again', @@ -59,7 +59,7 @@ LIMIT 25; If you've already been through the [pgml.train](../pgml.train "mention") examples, you can see the predictive results of those models: -```sql +```postgresql SELECT target, pgml.predict('Handwritten Digit Image Classifier', image) AS prediction @@ -67,7 +67,7 @@ FROM pgml.digits LIMIT 10; ``` -```sql +```postgresql target | prediction --------+------------ 0 | 0 @@ -87,11 +87,11 @@ LIMIT 10; Since it's so easy to train multiple algorithms with different hyperparameters, sometimes it's a good idea to know which deployed model is used to make predictions. You can find that out by querying the `pgml.deployed_models` view: -```sql +```postgresql SELECT * FROM pgml.deployed_models; ``` -```sql +```postgresql id | name | task | algorithm | runtime | deployed_at ----+------------------------------------+----------------+-----------+---------+---------------------------- 4 | Handwritten Digit Image Classifier | classification | xgboost | rust | 2022-10-11 13:06:26.473489 @@ -106,7 +106,7 @@ Take a look at [pgml.deploy.md](../pgml.deploy.md "mention") for more details. You may also specify a model\_id to predict rather than a project name, to use a particular training run. You can find model ids by querying the `pgml.models` table. -```sql +```postgresql SELECT models.id, models.algorithm, models.metrics FROM pgml.models JOIN pgml.projects @@ -114,7 +114,7 @@ JOIN pgml.projects WHERE projects.name = 'Handwritten Digit Image Classifier'; ``` -```sql +```postgresql id | algorithm | metrics ----+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------- @@ -125,7 +125,7 @@ recision": 0.9175060987472534, "score_time": 0.019625699147582054} For example, making predictions with `model_id = 1`: -```sql +```postgresql SELECT target, pgml.predict(1, image) AS prediction diff --git a/pgml-cms/docs/api/sql-extension/pgml.predict/batch-predictions.md b/pgml-cms/docs/api/sql-extension/pgml.predict/batch-predictions.md index 3f45c71c3..442454c27 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.predict/batch-predictions.md +++ b/pgml-cms/docs/api/sql-extension/pgml.predict/batch-predictions.md @@ -10,7 +10,7 @@ Many machine learning algorithms can benefit from calculating predictions in one The API for batch predictions is very similar to individual predictions, and only requires two arguments: the project name and the _aggregated_ features used for predictions. -```sql +```postgresql pgml.predict_batch( project_name TEXT, features REAL[] @@ -26,7 +26,7 @@ pgml.predict_batch( !!! example -```sql +```postgresql SELECT pgml.predict_batch( 'My First PostgresML Project', array_agg(ARRAY[0.1, 2.0, 5.0]) @@ -44,7 +44,7 @@ Batch predictions have to be fetched in a subquery or a CTE because they are usi \=== "SQL" -```sql +```postgresql WITH predictions AS ( SELECT pgml.predict_batch( 'My Classification Project', @@ -62,7 +62,7 @@ LIMIT 10; \=== "Output" -```sql +```postgresql prediction | target ------------+-------- 0 | 0 @@ -88,7 +88,7 @@ To perform a join on batch predictions, it's necessary to have a uniquely identi **Example** -```sql +```postgresql WITH predictions AS ( SELECT -- diff --git a/pgml-cms/docs/api/sql-extension/pgml.train/README.md b/pgml-cms/docs/api/sql-extension/pgml.train/README.md index ec49916cc..9a8507ea9 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.train/README.md +++ b/pgml-cms/docs/api/sql-extension/pgml.train/README.md @@ -12,7 +12,7 @@ The training function is at the heart of PostgresML. It's a powerful single mech Most parameters are optional and have configured defaults. The `project_name` parameter is required and is an easily recognizable identifier to organize your work. -```sql +```postgresql pgml.train( project_name TEXT, task TEXT DEFAULT NULL, @@ -48,7 +48,7 @@ pgml.train( !!! example -```sql +```postgresql SELECT * FROM pgml.train( project_name => 'My Classification Project', task => 'classification', @@ -67,7 +67,7 @@ The first time it's called, the function will also require a `relation_name` and !!! tip -```sql +```postgresql SELECT * FROM pgml.train( 'My Classification Project', algorithm => 'xgboost' diff --git a/pgml-cms/docs/api/sql-extension/pgml.train/classification.md b/pgml-cms/docs/api/sql-extension/pgml.train/classification.md index 24df21c49..82cc2f967 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.train/classification.md +++ b/pgml-cms/docs/api/sql-extension/pgml.train/classification.md @@ -10,7 +10,7 @@ description: >- This example trains models on the sklean digits dataset which is a copy of the test set of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits). This demonstrates using a table with a single array feature column for classification. You could do something similar with a vector column. -```sql +```postgresql -- load the sklearn digits dataset SELECT pgml.load_dataset('digits'); @@ -46,7 +46,7 @@ We currently support classification algorithms from [scikit-learn](https://sciki #### Examples -```sql +```postgresql SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}'); SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}'); SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'lightgbm', hyperparams => '{"n_estimators": 1}'); @@ -66,7 +66,7 @@ SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'catboost', hyperpar #### Examples -```sql +```postgresql SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'ada_boost'); SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'bagging'); SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 10}'); @@ -85,7 +85,7 @@ SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'hist_gradient_boost #### Examples -```sql +```postgresql SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'svm'); SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'nu_svm'); SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'linear_svm'); @@ -103,7 +103,7 @@ SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'linear_svm'); #### Examples -```sql +```postgresql SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'ridge'); SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'stochastic_gradient_descent'); SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'perceptron'); @@ -118,6 +118,6 @@ SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'passive_aggressive' #### Examples -```sql +```postgresql SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'gaussian_process', hyperparams => '{"max_iter_predict": 100, "warm_start": true}'); ``` diff --git a/pgml-cms/docs/api/sql-extension/pgml.train/clustering.md b/pgml-cms/docs/api/sql-extension/pgml.train/clustering.md index 5ecf0b552..5c0558dd7 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.train/clustering.md +++ b/pgml-cms/docs/api/sql-extension/pgml.train/clustering.md @@ -6,7 +6,7 @@ Models can be trained using `pgml.train` on unlabeled data to identify groups wi This example trains models on the sklearn digits dataset -- which is a copy of the test set of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits). This demonstrates using a table with a single array feature column for clustering. You could do something similar with a vector column. -```sql +```postgresql SELECT pgml.load_dataset('digits'); -- create an unlabeled table of the images for unsupervised learning @@ -38,7 +38,7 @@ All clustering algorithms implemented by PostgresML are online versions. You may ### Examples -```sql +```postgresql SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'affinity_propagation'); SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'birch', hyperparams => '{"n_clusters": 10}'); SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'kmeans', hyperparams => '{"n_clusters": 10}'); diff --git a/pgml-cms/docs/api/sql-extension/pgml.train/data-pre-processing.md b/pgml-cms/docs/api/sql-extension/pgml.train/data-pre-processing.md index 683343309..551e287f3 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.train/data-pre-processing.md +++ b/pgml-cms/docs/api/sql-extension/pgml.train/data-pre-processing.md @@ -31,7 +31,7 @@ There are 3 steps to preprocessing data: These preprocessing steps may be specified on a per-column basis to the [train()](./) function. By default, PostgresML does minimal preprocessing on training data, and will raise an error during analysis if NULL values are encountered without a preprocessor. All types other than `TEXT` are treated as quantitative variables and cast to floating point representations before passing them to the underlying algorithm implementations. -```sql +```postgresql SELECT pgml.train( project_name => 'preprocessed_model', task => 'classification', @@ -60,7 +60,7 @@ In some cases, it may make sense to use multiple steps for a single column. For A model that has been trained with preprocessors should use a Postgres tuple for prediction, rather than a `FLOAT4[]`. Tuples may contain multiple different types (like `TEXT` and `BIGINT`), while an ARRAY may only contain a single type. You can use parenthesis around values to create a Postgres tuple. -```sql +```postgresql SELECT pgml.predict('preprocessed_model', ('jan', 'nimbus', 0.5, 7)); ``` @@ -79,7 +79,7 @@ Encoding categorical variables is an O(N log(M)) where N is the number of rows, Target encoding is a relatively efficient way to represent a categorical variable. The average value of the target is computed for each category in the training data set. It is reasonable to `scale` target encoded variables using the same method as other variables. -```sql +```postgresql preprocess => '{ "clouds": {"encode": "target" } }' @@ -131,7 +131,7 @@ preprocess => '{ | `max` | the maximum value of the variable in the training data set | | `zero` | replaces all missing values with 0.0 | -```sql +```postgresql preprocess => '{ "temp": {"impute": "mean"} }' @@ -149,7 +149,7 @@ Scaling all variables to a standardized range can help make sure that no feature | `max_abs` | Scales data from -1.0 to +1.0. Data will not be centered around 0, unless abs(min) == abs(max). | | `robust` | Scales data as a factor of the first and third quartiles. This method may handle outliers more robustly than others. | -```sql +```postgresql preprocess => '{ "temp": {"scale": "standard"} }' diff --git a/pgml-cms/docs/api/sql-extension/pgml.train/decomposition.md b/pgml-cms/docs/api/sql-extension/pgml.train/decomposition.md index be8420df2..abe3b88ef 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.train/decomposition.md +++ b/pgml-cms/docs/api/sql-extension/pgml.train/decomposition.md @@ -6,7 +6,7 @@ Models can be trained using `pgml.train` on unlabeled data to identify important This example trains models on the sklearn digits dataset -- which is a copy of the test set of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits). This demonstrates using a table with a single array feature column for principal component analysis. You could do something similar with a vector column. -```sql +```postgresql SELECT pgml.load_dataset('digits'); -- create an unlabeled table of the images for unsupervised learning @@ -37,6 +37,6 @@ All decomposition algorithms implemented by PostgresML are online versions. You ### Examples -```sql +```postgresql SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'pca', hyperparams => '{"n_components": 10}'); ``` diff --git a/pgml-cms/docs/api/sql-extension/pgml.train/hyperparameter-search.md b/pgml-cms/docs/api/sql-extension/pgml.train/hyperparameter-search.md index 4461963f1..8b0788f98 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.train/hyperparameter-search.md +++ b/pgml-cms/docs/api/sql-extension/pgml.train/hyperparameter-search.md @@ -12,7 +12,7 @@ The parameters passed to `pgml.train()` easily allow one to perform hyperparamet | `search_params` | `{"alpha": [0.1, 0.2, 0.5] }` | | `search_args` | `{"n_iter": 10 }` | -```sql +```postgresql SELECT * FROM pgml.train( 'Handwritten Digit Image Classifier', algorithm => 'xgboost', diff --git a/pgml-cms/docs/api/sql-extension/pgml.train/joint-optimization.md b/pgml-cms/docs/api/sql-extension/pgml.train/joint-optimization.md index b65812045..3ad397249 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.train/joint-optimization.md +++ b/pgml-cms/docs/api/sql-extension/pgml.train/joint-optimization.md @@ -4,7 +4,7 @@ Some algorithms support joint optimization of the task across multiple outputs, To leverage multiple outputs in PostgresML, you'll need to substitute the standard usage of `pgml.train()` with `pgml.train_joint()`, which has the same API, except the notable exception of `y_column_name` parameter, which now accepts an array instead of a simple string. -```sql +```postgresql SELECT * FROM pgml.train_join( 'My Joint Project', task => 'regression', diff --git a/pgml-cms/docs/api/sql-extension/pgml.train/regression.md b/pgml-cms/docs/api/sql-extension/pgml.train/regression.md index eb1a1d4de..9e9e8332c 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.train/regression.md +++ b/pgml-cms/docs/api/sql-extension/pgml.train/regression.md @@ -12,7 +12,7 @@ We currently support regression algorithms from [scikit-learn](https://scikit-le This example trains models on the sklean [diabetes dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load\_diabetes.html#sklearn.datasets.load\_diabetes). This example uses multiple input features to predict a single output variable. -```sql +```postgresql -- load the dataset SELECT pgml.load_dataset('diabetes'); @@ -41,7 +41,7 @@ LIMIT 10; #### Examples -```sql +```postgresql SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}'); SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}'); SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lightgbm', hyperparams => '{"n_estimators": 1}'); @@ -61,7 +61,7 @@ SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'catboost', hyperp #### Examples -```sql +```postgresql SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ada_boost', hyperparams => '{"n_estimators": 5}'); SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'bagging', hyperparams => '{"n_estimators": 5}'); SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 5}'); @@ -80,7 +80,7 @@ SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'hist_gradient_boo #### Examples -```sql +```postgresql SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'svm', hyperparams => '{"max_iter": 100}'); SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'nu_svm', hyperparams => '{"max_iter": 10}'); SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear_svm', hyperparams => '{"max_iter": 100}'); @@ -108,7 +108,7 @@ SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear_svm', hype #### Examples -```sql +```postgresql SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear'); SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ridge'); SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lasso'); @@ -135,7 +135,7 @@ SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'quantile'); #### Examples -```sql +```postgresql SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'kernel_ridge'); SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'gaussian_process'); ``` diff --git a/pgml-cms/docs/api/sql-extension/pgml.transform/fill-mask.md b/pgml-cms/docs/api/sql-extension/pgml.transform/fill-mask.md index 07775258f..6202b59b5 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.transform/fill-mask.md +++ b/pgml-cms/docs/api/sql-extension/pgml.transform/fill-mask.md @@ -11,7 +11,7 @@ Fill-Mask is a task where certain words in a sentence are hidden or "masked", an {% tabs %} {% tab title="SQL" %} -```sql +```postgresql SELECT pgml.transform( task => '{ "task" : "fill-mask" diff --git a/pgml-cms/docs/api/sql-extension/pgml.transform/question-answering.md b/pgml-cms/docs/api/sql-extension/pgml.transform/question-answering.md index 9dfd41246..861a5afc3 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.transform/question-answering.md +++ b/pgml-cms/docs/api/sql-extension/pgml.transform/question-answering.md @@ -11,7 +11,7 @@ Question answering models are designed to retrieve the answer to a question from {% tabs %} {% tab title="SQL" %} -```sql +```postgresql SELECT pgml.transform( 'question-answering', inputs => ARRAY[ diff --git a/pgml-cms/docs/api/sql-extension/pgml.transform/summarization.md b/pgml-cms/docs/api/sql-extension/pgml.transform/summarization.md index b37a406ec..ec0171a17 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.transform/summarization.md +++ b/pgml-cms/docs/api/sql-extension/pgml.transform/summarization.md @@ -11,7 +11,7 @@ Summarization involves creating a condensed version of a document that includes {% tabs %} {% tab title="SQL" %} -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "summarization", diff --git a/pgml-cms/docs/api/sql-extension/pgml.transform/text-classification.md b/pgml-cms/docs/api/sql-extension/pgml.transform/text-classification.md index eb670b267..e53f4952e 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.transform/text-classification.md +++ b/pgml-cms/docs/api/sql-extension/pgml.transform/text-classification.md @@ -15,7 +15,7 @@ Sentiment analysis is a type of natural language processing technique which anal {% tabs %} {% tab title="SQL" %} -```sql +```postgresql SELECT pgml.transform( task => 'text-classification', inputs => ARRAY[ @@ -50,7 +50,7 @@ For example, if you want to use a RoBERTa model trained on around 40,000 English {% tabs %} {% tab title="SQL" %} -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-classification", @@ -86,7 +86,7 @@ By selecting a model that has been specifically designed for a particular subjec {% tabs %} {% tab title="SQL" %} -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-classification", @@ -131,7 +131,7 @@ If you want to use an NLI model, you can find them on the Hugging Face. When sea {% tabs %} {% tab title="SQL" %} -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-classification", @@ -164,7 +164,7 @@ If you want to use an QNLI model, you can find them on the Hugging Face, by look {% tabs %} {% tab title="SQL" %} -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-classification", @@ -197,7 +197,7 @@ If you want to use an QQP model, you can find them on Hugging Face, by looking f {% tabs %} {% tab title="SQL" %} -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-classification", @@ -230,7 +230,7 @@ If you want to use a grammatical correctness model, you can find them on the Hug {% tabs %} {% tab title="SQL" %} -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-classification", diff --git a/pgml-cms/docs/api/sql-extension/pgml.transform/text-generation.md b/pgml-cms/docs/api/sql-extension/pgml.transform/text-generation.md index 8d84ca762..d04ba910b 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.transform/text-generation.md +++ b/pgml-cms/docs/api/sql-extension/pgml.transform/text-generation.md @@ -6,7 +6,7 @@ description: Task of producing new text Text generation is the task of producing new text, such as filling in incomplete sentences or paraphrasing existing text. It has various use cases, including code generation and story generation. Completion generation models can predict the next word in a text sequence, while text-to-text generation models are trained to learn the mapping between pairs of texts, such as translating between languages. Popular models for text generation include GPT-based models, T5, T0, and BART. These models can be trained to accomplish a wide range of tasks, including text classification, summarization, and translation. -```sql +```postgresql SELECT pgml.transform( task => 'text-generation', inputs => ARRAY[ @@ -29,7 +29,7 @@ _Result_ To use a specific model from :hugging: model hub, pass the model name along with task name in task. -```sql +```postgresql SELECT pgml.transform( task => '{ "task" : "text-generation", @@ -53,7 +53,7 @@ _Result_ To make the generated text longer, you can include the argument `max_length` and specify the desired maximum length of the text. -```sql +```postgresql SELECT pgml.transform( task => '{ "task" : "text-generation", @@ -80,7 +80,7 @@ _Result_ If you want the model to generate more than one output, you can specify the number of desired output sequences by including the argument `num_return_sequences` in the arguments. -```sql +```postgresql SELECT pgml.transform( task => '{ "task" : "text-generation", @@ -111,7 +111,7 @@ _Result_ Text generation typically utilizes a greedy search algorithm that selects the word with the highest probability as the next word in the sequence. However, an alternative method called beam search can be used, which aims to minimize the possibility of overlooking hidden high probability word combinations. Beam search achieves this by retaining the num\_beams most likely hypotheses at each step and ultimately selecting the hypothesis with the highest overall probability. We set `num_beams > 1` and `early_stopping=True` so that generation is finished when all beam hypotheses reached the EOS token. -```sql +```postgresql SELECT pgml.transform( task => '{ "task" : "text-generation", @@ -143,7 +143,7 @@ You can pass `do_sample = True` in the arguments to use sampling methods. It is ### _Temperature_ -```sql +```postgresql SELECT pgml.transform( task => '{ "task" : "text-generation", @@ -167,7 +167,7 @@ _Result_ ### _Top p_ -```sql +```postgresql SELECT pgml.transform( task => '{ "task" : "text-generation", diff --git a/pgml-cms/docs/api/sql-extension/pgml.transform/text-to-text-generation.md b/pgml-cms/docs/api/sql-extension/pgml.transform/text-to-text-generation.md index dc97021c7..76ea9cf8d 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.transform/text-to-text-generation.md +++ b/pgml-cms/docs/api/sql-extension/pgml.transform/text-to-text-generation.md @@ -4,7 +4,7 @@ Text-to-text generation methods, such as T5, are neural network architectures de _Translation_ -```sql +```postgresql SELECT pgml.transform( task => '{ "task" : "text2text-generation" @@ -25,7 +25,7 @@ _Result_ Similar to other tasks, we can specify a model for text-to-text generation. -```sql +```postgresql SELECT pgml.transform( task => '{ "task" : "text2text-generation", diff --git a/pgml-cms/docs/api/sql-extension/pgml.transform/token-classification.md b/pgml-cms/docs/api/sql-extension/pgml.transform/token-classification.md index 6f90a04fb..ed1e73507 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.transform/token-classification.md +++ b/pgml-cms/docs/api/sql-extension/pgml.transform/token-classification.md @@ -10,7 +10,7 @@ Token classification is a task in natural language understanding, where labels a Named Entity Recognition (NER) is a task that involves identifying named entities in a text. These entities can include the names of people, locations, or organizations. The task is completed by labeling each token with a class for each named entity and a class named "0" for tokens that don't contain any entities. In this task, the input is text, and the output is the annotated text with named entities. -```sql +```postgresql SELECT pgml.transform( inputs => ARRAY[ 'I am Omar and I live in New York City.' @@ -36,7 +36,7 @@ PoS tagging is a task that involves identifying the parts of speech, such as nou Look for models with `pos` to use a zero-shot classification model on the :hugs: Hugging Face model hub. -```sql +```postgresql select pgml.transform( inputs => array [ 'I live in Amsterdam.' diff --git a/pgml-cms/docs/api/sql-extension/pgml.transform/translation.md b/pgml-cms/docs/api/sql-extension/pgml.transform/translation.md index 874467b2f..0c0de9f2f 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.transform/translation.md +++ b/pgml-cms/docs/api/sql-extension/pgml.transform/translation.md @@ -6,7 +6,7 @@ description: Task of converting text written in one language into another langua Translation is the task of converting text written in one language into another language. You have the option to select from over 2000 models available on the Hugging Face [hub](https://huggingface.co/models?pipeline\_tag=translation) for translation. -```sql +```postgresql select pgml.transform( inputs => array[ 'How are you?' diff --git a/pgml-cms/docs/api/sql-extension/pgml.transform/zero-shot-classification.md b/pgml-cms/docs/api/sql-extension/pgml.transform/zero-shot-classification.md index 8d7e272e3..f0190e262 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.transform/zero-shot-classification.md +++ b/pgml-cms/docs/api/sql-extension/pgml.transform/zero-shot-classification.md @@ -10,7 +10,7 @@ In the example provided below, we will demonstrate how to classify a given sente Look for models with `mnli` to use a zero-shot classification model on the :hugs: Hugging Face model hub. -```sql +```postgresql SELECT pgml.transform( inputs => ARRAY[ 'I have a problem with my iphone that needs to be resolved asap!!' diff --git a/pgml-cms/docs/api/sql-extension/pgml.tune.md b/pgml-cms/docs/api/sql-extension/pgml.tune.md index 4c874893a..ec07b1242 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.tune.md +++ b/pgml-cms/docs/api/sql-extension/pgml.tune.md @@ -16,7 +16,7 @@ The [Helsinki-NLP](https://huggingface.co/Helsinki-NLP) organization provides mo The [kde4](https://huggingface.co/datasets/kde4) dataset contains many language pairs. Subsets can be loaded into your Postgres instance with a call to `pgml.load_dataset`, or you may wish to create your own fine tuning dataset with vocabulary specific to your domain. -```sql +```postgresql SELECT pgml.load_dataset('kde4', kwargs => '{"lang1": "en", "lang2": "es"}'); ``` @@ -24,13 +24,13 @@ You can view the newly loaded data in your Postgres database: \=== "SQL" -```sql +```postgresql SELECT * FROM pgml.kde4 LIMIT 5; ``` \=== "Result" -```sql +```postgresql id | translation @@ -49,7 +49,7 @@ This HuggingFace dataset stores the data as language key pairs in a JSON documen \=== "SQL" -```sql +```postgresql CREATE OR REPLACE VIEW kde4_en_to_es AS SELECT translation->>'en' AS "en", translation->>'es' AS "es" FROM pgml.kde4 @@ -58,7 +58,7 @@ LIMIT 10; \=== "Result" -```sql +```postgresql CREATE VIEW ``` @@ -68,13 +68,13 @@ Now, we can see the data in more normalized form. The exact column names don't m \=== "SQL" -```sql +```postgresql SELECT * FROM kde4_en_to_es LIMIT 10; ``` \=== "Result" -```sql +```postgresql en | es --------------------------------------------------------------------------------------------+-------------------------------------------------------------------------- @@ -100,7 +100,7 @@ o de traducción de Babelfish. Tuning is very similar to training with PostgresML, although we specify a `model_name` to download from Hugging Face instead of the base `algorithm`. -```sql +```postgresql SELECT pgml.tune( 'Translate English to Spanish', task => 'translation', @@ -130,7 +130,7 @@ Translations use the `pgml.generate` API since they return `TEXT` rather than nu \=== "SQL" -```sql +```postgresql SELECT pgml.generate('Translate English to Spanish', 'I love SQL') AS spanish; @@ -138,7 +138,7 @@ AS spanish; \=== "Result" -```sql +```postgresql spanish ---------------- Me encanta SQL @@ -165,7 +165,7 @@ Once our model has been fine tuned on the dataset, it'll be saved and deployed w The IMDB dataset has 50,000 examples of user reviews with positive or negative viewing experiences as the labels, and is split 50/50 into training and evaluation datasets. -```sql +```postgresql SELECT pgml.load_dataset('imdb'); ``` @@ -173,13 +173,13 @@ You can view the newly loaded data in your Postgres database: \=== "SQL" -```sql +```postgresql SELECT * FROM pgml.imdb LIMIT 1; ``` \=== "Result" -```sql +```postgresql text | label -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------- This has to be the funniest stand up comedy I have ever seen. Eddie Izzard is a genius, he picks in Brits, Americans and everyone in between. His style is completely natural and completely hilarious. I doubt that anyone could sit through this and not laugh their a** off. Watch, enjoy, it's funny. | 1 @@ -192,7 +192,7 @@ SELECT * FROM pgml.imdb LIMIT 1; Tuning has a nearly identical API to training, except you may pass the name of a [model published on Hugging Face](https://huggingface.co/models) to start with, rather than training an algorithm from scratch. -```sql +```postgresql SELECT pgml.tune( 'IMDB Review Sentiment', task => 'text-classification', @@ -215,14 +215,14 @@ SELECT pgml.tune( \=== "SQL" -```sql +```postgresql SELECT pgml.predict('IMDB Review Sentiment', 'I love SQL') AS sentiment; ``` \=== "Result" -```sql +```postgresql sentiment ----------- 1 @@ -237,14 +237,14 @@ The default for predict in a classification problem classifies the statement as \=== "SQL" -```sql +```postgresql SELECT pgml.predict_proba('IMDB Review Sentiment', 'I love SQL') AS sentiment; ``` \=== "Result" -```sql +```postgresql sentiment ------------------------------------------- [0.06266672909259796, 0.9373332858085632] @@ -267,7 +267,7 @@ At a high level, summarization uses similar techniques to translation. Both use [BillSum](https://huggingface.co/datasets/billsum) is a dataset with training examples that summarize US Congressional and California state bills. You can pass `kwargs` specific to loading datasets, in this case we'll restrict the dataset to California samples: -```sql +```postgresql SELECT pgml.load_dataset('billsum', kwargs => '{"split": "ca_test"}'); ``` @@ -275,7 +275,7 @@ You can view the newly loaded data in your Postgres database: \=== "SQL" -```sql +```postgresql SELECT * FROM pgml.billsum LIMIT 1; ``` @@ -362,14 +362,14 @@ This act provides for a tax levy within the meaning of Article IV of the Constit This dataset has 3 fields, but summarization transformers only take a single input to produce their output. We can create a view that simply omits the `title` from the training data: -```sql +```postgresql CREATE OR REPLACE VIEW billsum_training_data AS SELECT "text", summary FROM pgml.billsum; ``` Or, it might be interesting to concat the title to the text field to see how relevant it actually is to the bill. If the title of a bill is the first sentence, and doesn't appear in summary, it may indicate that it's a poorly chosen title for the bill: -```sql +```postgresql CREATE OR REPLACE VIEW billsum_training_data AS SELECT title || '\n' || "text" AS "text", summary FROM pgml.billsum LIMIT 10; @@ -379,7 +379,7 @@ LIMIT 10; Tuning has a nearly identical API to training, except you may pass the name of a [model published on Hugging Face](https://huggingface.co/models) to start with, rather than training an algorithm from scratch. -```sql +```postgresql SELECT pgml.tune( 'Legal Summarization', task => 'summarization', @@ -403,13 +403,13 @@ SELECT pgml.tune( \=== "SQL" -```sql +```postgresql SELECT pgml.predict('IMDB Review Sentiment', 'I love SQL') AS sentiment; ``` \=== "Result" -```sql +```postgresql sentiment ----------- 1 @@ -424,13 +424,13 @@ The default for predict in a classification problem classifies the statement as \=== "SQL" -```sql +```postgresql SELECT pgml.predict_proba('IMDB Review Sentiment', 'I love SQL') AS sentiment; ``` \=== "Result" -```sql +```postgresql sentiment ------------------------------------------- [0.06266672909259796, 0.9373332858085632] @@ -447,7 +447,7 @@ See the [task documentation](https://huggingface.co/tasks/text-classification) f ### Text Generation -```sql +```postgresql SELECT pgml.load_dataset('bookcorpus', "limit" => 100); SELECT pgml.tune( diff --git a/pgml-cms/docs/guides/embeddings/dimensionality-reduction.md b/pgml-cms/docs/guides/embeddings/dimensionality-reduction.md new file mode 100644 index 000000000..7c25516f4 --- /dev/null +++ b/pgml-cms/docs/guides/embeddings/dimensionality-reduction.md @@ -0,0 +1,6 @@ + +# Dimensionality Reduction + +## Introduction + +## Principal Component Analysis diff --git a/pgml-cms/docs/guides/embeddings/in-database-generation.md b/pgml-cms/docs/guides/embeddings/in-database-generation.md index c4badaa7e..7f885cbe9 100644 --- a/pgml-cms/docs/guides/embeddings/in-database-generation.md +++ b/pgml-cms/docs/guides/embeddings/in-database-generation.md @@ -1,328 +1,221 @@ # In-database Embedding Generation -PostgresML makes it easy to generate embeddings from text in your database using a large selection of state-of-the-art models with one simple call to **`pgml.embed`**`(model_name, text)`. +Embedding generation is a process of transforming high-dimensional data into dense vectors of fixed size, which can be used for various machine learning tasks. PostgresML makes it easy to generate embeddings from text in your database using state-of-the-art models with the native function **`pgml.embed`**`(model_name, text)`, leveraging the computational power of local GPUs. ## Introduction -Different models have been trained on different types of text and with different algorithms. Each one has its own tradeoffs, generally latency vs quality, although recent progress in the LLMs . +Different models have been trained on different types of text and with different algorithms. Each one has its own tradeoffs, generally latency vs quality, although recent progress in the LLMs. -You can train your own model from scratch, or download +## Benefits of in-database processing +PostgresML cloud databases include GPU hardware to run state-of-the-art models for embedding generation within the database environment, among other ML/AI workloads. This contrasts with network calls, where data must be sent to an external service for processing. If you're running PostgresML on your own hardware it's important to configure it correctly, or choose an embedding model that will run efficiently on a CPU. -## It always starts with data +- **Reduced Latency**: Local computation eliminates the need for network calls, significantly reducing latency. +- **Enhanced Security**: Data remains within the database, enhancing security by minimizing exposure. +- **Cost-Effectiveness**: Utilizing local hardware can be more cost-effective than relying on external services, especially for large-scale operations. -Most general purpose databases are full of all sorts of great data for machine learning use cases. Text data has historically been more difficult to deal with using complex Natural Language Processing techniques, but embeddings created from open source models can effectively turn unstructured text into structured features, perfect for more straightforward implementations. +GPU accelerated models can compute embeddings in sub millisecond timeframes when batching, this means that even _in-datacenter_ processing is orders of magnitude more expensive than _in-database_, in terms of latency and finances due to the networking overhead. Using a hosted service to generate embeddings outside-of your datacenter, is even less efficient, given the additional overhead of transport costs. -In this example, we'll demonstrate how to generate embeddings for products on an e-commerce site. We'll use a public dataset of millions of product reviews from the [Amazon US Reviews](https://huggingface.co/datasets/amazon\_us\_reviews). It includes the product title, a text review written by a customer and some additional metadata about the product, like category. With just a few pieces of data, we can create a full-featured and personalized product search and recommendation engine, using both generic embeddings and later, additional fine-tuned models trained with PostgresML. +## Model Selection -PostgresML includes a convenience function for loading public datasets from [HuggingFace](https://huggingface.co/datasets) directly into your database. To load the DVD subset of the Amazon US Reviews dataset into your database, run the following command: +There are many excellent pre-trained open-weight models available for download from HuggingFace. PostgresML serverless instances run with the following models available w/ instant autoscaling: -!!! code\_block +| Model | Parameters (M) | Strengths | +|-------------------------------------------------------------------------------------------------|----------------|--------------------------------| +| [intfloat/e5-small-v2](https://huggingface.co/intfloat/e5-small-v2) | 33.4 | High quality, lowest latency | +| [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) | 335 | Higher quality, higher latency | +| [Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) | 434 | Supports up to 8k token inputs | -```postgresql -SELECT * -FROM pgml.load_dataset('amazon_us_reviews', 'Video_DVD_v1_00'); -``` -!!! +If you'd like to use a different model you can also provision dedicated resources for it. The [Massive Text Embedding Benchmark](https://huggingface.co/spaces/mteb/leaderboard) is a helpful resource provided by HuggingFace that maintains up-to-date rankings on the latest models. -It took about 23 minutes to download the 7.1GB raw dataset with 5,069,140 rows into a table within the `pgml` schema (where all PostgresML functionality is name-spaced). Once it's done, you can see the table structure with the following command: +## Creating Embeddings + +You can generate embeddings using [pgml.embed(model_name, text)](../../api/sql-extension/pgml.embed.md). For example: !!! generic -!!! code\_block +!!! code_block time="12.029 ms" ```postgresql -\d pgml.amazon_us_reviews +SELECT pgml.embed('intfloat/e5-small-v2', 'This is some text to embed'); ``` !!! !!! results -| Column | Type | Collation | Nullable | Default | -| ------------------ | ------- | --------- | -------- | ------- | -| marketplace | text | | | | -| customer\_id | text | | | | -| review\_id | text | | | | -| product\_id | text | | | | -| product\_parent | text | | | | -| product\_title | text | | | | -| product\_category | text | | | | -| star\_rating | integer | | | | -| helpful\_votes | integer | | | | -| total\_votes | integer | | | | -| vine | bigint | | | | -| verified\_purchase | bigint | | | | -| review\_headline | text | | | | -| review\_body | text | | | | -| review\_date | text | | | | - -!!! - -!!! - -Let's take a peek at the first 5 rows of data: - -!!! code\_block - ```postgresql -SELECT * -FROM pgml.amazon_us_reviews -LIMIT 5; +{-0.080910146,0.033980247,0.052564066,0.0020346553,-0.03936229,0.031479727,0.0685036,-0.06294509,-0.024574954,0.040237393,0.051508162,0.0038814095,-0.010645757,0.020144403,0.031223888,-0.04440482,0.020333821,0.07103317,-0.12705344,0.030591827,0.07019173,-0.036886554,-0.012233759,-0.07092232,0.0027690812,-0.0020539823,0.040779375,0.05908495,-0.026025668,-0.08242788,-0.018558107,-0.0094666025,0.059807047,-0.02525427,0.103207916,-0.068966456,-0.039847758,0.04071019,0.04450286,0.03424993,-0.06227554,-0.055733517,0.054585237,-0.060373828,-0.024653753,0.009867895,-0.041141387,-0.08721736,0.08264962,-0.0031608255,-0.012134463,-0.014921003,0.04267465,0.029093502,0.058714338,0.023871746,0.027041607,0.05843493,0.04142925,0.09514731,-0.030493727,0.07500542,-0.11280806,0.10281551,0.055736117,0.061823647,-0.020118464,0.014440284,-0.08269981,0.0040008957,-0.018531831,-0.008568512,-0.046970874,0.04578424,-0.039577056,0.08775033,-0.008210567,0.051924113,-0.04171466,-0.0367731,-0.01827072,0.0069318637,-0.047051124,0.033687923,0.0075546373,-0.037275027,0.043123465,-0.045893792,-0.036658753,-0.040635854,-0.03440536,0.0011549098,0.042740136,-0.025120102,-0.017873302,-0.039899718,0.031648446,0.0068402113,0.02402832,0.089285314,0.017456057,0.012008715,0.0076218387,-0.07197755,-0.038144454,-0.05969434,0.0389503,-0.0058245854,0.01937407,-0.018212182,-0.06195428,-0.038283527,-0.01753182,-0.023789542,0.07097847,0.04855445,-0.05200343,-0.009433737,-0.010195946,0.00442146,0.043388885,-0.013206756,0.03384104,0.0052567925,0.10585855,-0.08633147,0.05733634,0.046828035,0.111744046,-0.016215837,0.031619936,-0.0007159129,-0.0209652,-0.015532438,-0.06690792,-0.0091873575,-0.044681326,-0.007757966,0.053561073,-0.011261849,-0.03140146,-0.050118096,-0.031356297,-0.124189764,0.024152948,0.02993825,-0.07240996,0.01793813,-0.070896275,-0.024419364,-0.040071633,-0.026535412,0.027830372,0.021783136,-0.0075028464,0.014013486,-0.005176842,0.044899847,-0.068266265,-0.024272943,-0.104513876,-0.007814491,0.06390731,0.10318874,0.08249727,-0.092428714,0.0062611965,-0.0115522025,0.056004044,-0.043695573,-0.0010207174,0.013102924,-0.0035022667,0.0025919478,0.12973104,-0.053112745,-0.008374208,-0.022599943,0.04597443,-0.074845895,0.07259128,-0.062168732,-0.03033916,0.03646452,0.033044446,-0.040221635,-0.060735658,-0.040255345,0.013989559,-0.026528435,-0.059659433,-0.0010745272,-0.02860176,0.073617734,0.009127505,0.012357427,-0.024373775,-0.07039051,-0.038225688,-0.07232986,0.06928063,0.06729482,-0.07500053,0.0036577163,-0.03904865,0.09585222,0.035453793,-0.0061846063,-0.05000263,-0.050227694,-0.022932036,-0.0073578595,-0.034768302,-0.038604897,-0.01470962,-0.04274356,-0.01689811,0.04931222,0.010990732,0.019879386,0.01243605,-0.07632878,-0.070137314,-0.15282577,-0.020428825,-0.030160243,-0.0050396603,0.007732285,-0.032149784,-0.015778365,0.07480648,0.017192233,0.024550207,0.06951421,-0.014848112,-0.05396024,-0.03223639,0.04666939,0.012844642,-0.05892448,-0.030294335,0.06794056,-0.063875966,-0.046530016,-0.07084713,-0.031829637,-0.047059055,0.08617301,-0.05032479,0.118310556,0.04755146,-0.028393123,-0.024320556,0.030537084,0.020449162,0.05665035,-0.075432904,0.07822404,-0.07196871,0.010495469,0.05382172,-0.0016319404,-0.087258086,0.0930253,-0.01846083,0.0033103244,-0.08890738,0.071200974,-0.03997286,-0.005042026,0.011910354,-0.025650134,0.054577664,-0.0014927471,-0.047521923,0.049124297,0.006342861,-0.089150384,-0.0073342607,-0.07849969,0.0010329112,-0.038727123,0.016429648,-0.086470395,-4.8742084e-05,0.060051307,0.0033317064,0.006863758,0.0446841,-0.031092882,0.017449407,-0.07479843,-0.058406148,-0.012044445,0.08927765,-0.04008159,0.05227031,0.021864118,0.054245688,0.027357962,0.02569578,-0.06151034,-0.05588746,-0.034790445,-0.020313034,0.03713666,0.025836824,0.039398894,0.02515884,-0.008512022,-0.014856683,0.037740804,-0.06471344,0.029907772,0.0077477624,0.061302595,0.037709966,-0.032406874,-0.049870085,-0.15800017,-0.014624413,0.018514019,-0.010369406,-0.022790398,0.009587365,0.03241724,-0.02795245,-0.05280684,-0.031362813,0.047515675,0.009669598,0.09689132,-0.038499177,-0.019239947,0.06885492,0.08843166,-0.027636368,-0.058589518,-0.11492329,0.036349587,0.03926196,0.16907486,0.036197387,-0.0128475325,0.05160944,0.0034505632,0.016367715,0.068978526,0.0676247,0.0064224014,-0.06316567,0.11720159,0.005348484,0.05403974,0.061581556,-0.027833184,0.05563025,0.03337182,-0.030032963,0.06838953,0.08052612,-0.01996433,0.006692282,0.11277913,0.03004468,-0.063005574,-0.024108425,-0.03547973,0.0060482216,-0.0032331524,-0.038302638,0.083412275,0.07387719,0.052097928,-0.037775334,-0.05458932,0.0004270608,-0.034030076,-0.07965879,0.012511749,-0.028165875,0.03768439,0.00082042674,0.053660177} ``` -!!! results - -| marketplace | customer\_id | review\_id | product\_id | product\_parent | product\_title | product\_category | star\_rating | helpful\_votes | total\_votes | vine | verified\_purchase | review\_headline | review\_body | review\_date | -| ----------- | ------------ | -------------- | ----------- | --------------- | ------------------------------------------------------------------------------------------------------------------- | ----------------- | ------------ | -------------- | ------------ | ---- | ------------------ | --------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------ | -| US | 27288431 | R33UPQQUZQEM8 | B005T4ND06 | 400024643 | Yoga for Movement Disorders DVD: Rebuilding Strength, Balance, and Flexibility for Parkinson's Disease and Dystonia | Video DVD | 5 | 3 | 3 | 0 | 1 | This was a gift for my aunt who has Parkinson's ... | This was a gift for my aunt who has Parkinson's. While I have not previewed it myself, I also have not gotten any complaints. My prior experiences with yoga tell me this should be just what the doctor ordered. | 2015-08-31 | -| US | 13722556 | R3IKTNQQPD9662 | B004EPZ070 | 685335564 | Something Borrowed | Video DVD | 5 | 0 | 0 | 0 | 1 | Five Stars | Teats my heart out. | 2015-08-31 | -| US | 20381037 | R3U27V5QMCP27T | B005S9EKCW | 922008804 | Les Miserables (2012) \[Blu-ray] | Video DVD | 5 | 1 | 1 | 0 | 1 | Great movie! | Great movie. | 2015-08-31 | -| US | 24852644 | R2TOH2QKNK4IOC | B00FC1ZCB4 | 326560548 | Alien Anthology and Prometheus Bundle \[Blu-ray] | Video DVD | 5 | 0 | 1 | 0 | 1 | Amazing | My husband was so excited to receive these as a gift! Great picture quality and great value! | 2015-08-31 | -| US | 15556113 | R2XQG5NJ59UFMY | B002ZG98Z0 | 637495038 | Sex and the City 2 | Video DVD | 5 | 0 | 0 | 0 | 1 | Five Stars | Love this series. | 2015-08-31 | - !!! !!! -## Generating embeddings from natural language text - -PostgresML provides a simple interface to generate embeddings from text in your database. You can use the [`pgml.embed`](https://postgresml.org/docs/transformers/embeddings) function to generate embeddings for a column of text. The function takes a transformer name and a text value. The transformer will automatically be downloaded and cached on your connection process for reuse. You can see a list of potential good candidate models to generate embeddings on the [Massive Text Embedding Benchmark leaderboard](https://huggingface.co/spaces/mteb/leaderboard). - -Since our corpus of documents (movie reviews) are all relatively short and similar in style, we don't need a large model. [`intfloat/e5-small`](https://huggingface.co/intfloat/e5-small) will be a good first attempt. The great thing about PostgresML is you can always regenerate your embeddings later to experiment with different embedding models. - -It takes a couple of minutes to download and cache the `intfloat/e5-small` model to generate the first embedding. After that, it's pretty fast. - -Note how we prefix the text we want to embed with either `passage:` or `query:` , the e5 model requires us to prefix our data with `passage:` if we're generating embeddings for our corpus and `query:` if we want to find semantically similar content. +A database typically holds the text data used to generate the embeddings in a table. We'll use `documents` as an example. ```postgresql -SELECT pgml.embed('intfloat/e5-small', 'passage: hi mom'); +CREATE TABLE documents ( + id SERIAL PRIMARY KEY, + body TEXT +); ``` -This is a pretty powerful function, because we can pass any arbitrary text to any open source model, and it will generate an embedding for us. We can benchmark how long it takes to generate an embedding for a single review, using client-side timings in Postgres: +Inserting some example data: ```postgresql -\timing on +INSERT INTO documents (body) +VALUES + ('Example text data'), + ('Another example document'), + ('Some other thing'); ``` -Aside from using this function with strings passed from a client, we can use it on strings already present in our database tables by calling **pgml.embed** on columns. For example, we can generate an embedding for the first review using a pretty simple query: +Passing the data from the table to the embedding function: !!! generic -!!! code\_block time="54.820 ms" +!!! code_block time="50.001 ms" ```postgresql -SELECT - review_body, - pgml.embed('intfloat/e5-small', 'passage: ' || review_body) -FROM pgml.amazon_us_reviews -LIMIT 1; +SELECT id, pgml.embed('intfloat/e5-small-v2', body) +FROM documents; ``` !!! !!! results -``` -CREATE INDEX -``` - -!!! - -!!! - -Time to generate an embedding increases with the length of the input text, and varies widely between different models. If we up our batch size (controlled by `LIMIT`), we can see the average time to compute an embedding on the first 1000 reviews is about 17ms per review: - -!!! code\_block time="17955.026 ms" - ```postgresql -SELECT - review_body, - pgml.embed('intfloat/e5-small', 'passage: ' || review_body) AS embedding -FROM pgml.amazon_us_reviews -LIMIT 1000; + id | embed +---+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- + 1 | {-0.09234577,0.037487056,-0.03421769,-0.033738457,-0.042548284,-0.0015319627,0.042109113,0.011365055,-0.018372666,0.020417988,0.061961487,-0.022707041,0.015810987,0.03675479,0.001995532,-0.04197657,-0.034883354,0.07871886,-0.11676137,0.06141681,0.08321331,-0.03457781,-0.013248807,-0.05802344,-0.039144825,-0.015038275,0.020686107,0.08593334,-0.041029375,-0.13210341,-0.034079146,0.016687978,0.06363906,-0.05279167,0.10102262,-0.048170853,-0.014849669,0.03523273,0.024248678,0.031341534,-0.021447029,-0.05781338,0.039722513,-0.058294114,-0.035174508,-0.056844078,-0.051775914,-0.05822031,0.083022244,0.027178412,0.0032413877,0.023898097,0.023951318,0.0565093,0.036267336,0.049430914,0.027110789,0.05017207,0.058326595,0.040568575,0.014855128,0.06272174,-0.12961388,0.0998898,0.014964503,0.07735804,-0.028795758,0.026889611,-0.0613238,-0.004798127,0.009027658,0.046634953,-0.034936648,0.076499216,-0.03855506,0.08894715,-0.0019889707,0.07027481,-0.04624302,-0.048422314,-0.02444203,-0.0442959,-0.028878363,0.04586853,-0.004158767,-0.0027680802,0.029728336,-0.06130052,-0.028088963,-0.050658133,-0.024370935,-0.0030779864,0.018137587,-0.029853988,-0.06877675,-0.001238518,0.025249483,-0.0045243553,0.07250941,0.12831028,0.0077543575,0.012130527,-0.0006014347,-0.027807593,-0.011226617,-0.04837827,0.0376276,-0.058811083,0.020967057,-0.021439878,-0.0634577,-0.029189702,-0.040197153,-0.01993339,0.0899751,-0.014370172,0.0021994617,-0.0759979,-0.010541287,0.034424484,0.030067233,0.016858222,0.015223163,0.021410512,0.072372325,-0.06270684,0.09666927,0.07237114,0.09372637,-0.027058149,0.06319879,-0.03626834,-0.03539027,0.010406426,-0.08829164,-0.020550422,-0.043701466,-0.018676292,0.038060706,-0.0058152666,-0.04057362,-0.06266247,-0.026675962,-0.07610313,-0.023740835,0.06968648,-0.076157875,0.05129882,-0.053703927,-0.04906172,-0.014506706,-0.033226766,0.04197027,0.009892002,-0.019509513,0.020975547,0.015931072,0.044290986,-0.048697367,-0.022310019,-0.088599496,-0.0371257,0.037382104,0.14381507,0.07789086,-0.10580675,0.0255245,0.014246269,0.01157928,-0.069586724,0.023313843,0.02494169,-0.014511085,-0.017566541,0.0865948,-0.012115137,0.024397936,-0.049067125,0.03300015,-0.058626212,0.034921415,-0.04132337,-0.025009211,0.057668354,0.016189015,-0.04954466,-0.036778226,-0.046015732,-0.041587763,-0.03449501,-0.033505566,0.019262834,-0.018552447,0.019556912,0.01612039,0.0026575527,-0.05330489,-0.06894643,-0.04592849,-0.08485257,0.12714563,0.026810834,-0.053618323,-0.029470881,-0.04381535,0.055211045,-0.0111715235,-0.004484313,-0.02654065,-0.022317547,-0.027823675,0.0135190515,0.001530742,-0.04323672,-0.028350104,-0.07154715,-0.0024147208,0.031836234,0.03476004,0.033611998,0.038179073,-0.087631755,-0.048408568,-0.11773682,-0.019127818,0.013682835,-0.02015055,0.01888005,-0.03280704,0.0076310635,0.074330166,-0.031277154,0.056628436,0.119448215,-0.0012235055,-0.009727585,-0.05459528,0.04298459,0.054554865,-0.027898816,0.0040641865,0.08585007,-0.053415824,-0.030528797,-0.08231634,-0.069264784,-0.08337459,0.049254872,-0.021684796,0.12479715,0.053940497,-0.038884085,-0.032209005,0.035795107,0.0054665194,0.0085438965,-0.039386917,0.083624765,-0.056901276,0.022051739,0.06955752,-0.0008329906,-0.07959222,0.075660035,-0.017008293,0.015329365,-0.07439257,0.057193674,-0.06564091,0.0007063081,-0.015799401,-0.008529507,0.027204275,0.0076780985,-0.018589584,0.065267086,-0.02026929,-0.0559547,-0.035843417,-0.07237942,0.028072618,-0.048903402,-0.027478782,-0.084877744,-0.040812787,0.026713751,0.016210195,-0.039116003,0.03572044,-0.014964189,0.026315138,-0.08638934,-0.04198059,-0.02164005,0.09299754,-0.047685668,0.061317034,0.035914674,0.03533252,0.0287274,-0.033809293,-0.046841178,-0.042211317,-0.02567011,-0.048029255,0.039492987,0.04906847,0.030969618,0.0066106897,0.025528666,-0.008357054,0.04791732,-0.070402496,0.053391967,-0.06309544,0.06575766,0.06522203,0.060434356,-0.047547556,-0.13597175,-0.048658505,0.009734684,-0.016258504,-0.034227647,0.05382081,0.001330341,0.011890187,-0.047945525,-0.031132223,0.0010349775,0.030007072,0.12059559,-0.060273632,-0.010099646,0.055261053,0.053757478,-0.045518342,-0.041972063,-0.08315036,0.049884394,0.037543204,0.17598632,-0.0027433096,0.015989233,0.017486975,0.0059954696,-0.022668751,0.05677827,0.029728843,0.0011321013,-0.051546678,0.1113402,0.017779723,0.050953783,0.10342974,0.04067395,0.054890294,0.017487328,-0.020321153,0.062171113,0.07234749,-0.06777497,-0.03888628,0.08744684,0.032227095,-0.04398878,-0.049698275,-0.0018518695,-0.015967874,-0.0415869,-0.022655524,0.03596353,0.07130526,0.056296617,-0.06720573,-0.092787154,0.021057911,0.015628621,-0.04396636,-0.0063872878,-0.0127499355,0.01633339,-0.0006204544,0.0438727} + 2 | {-0.11384405,0.067140445,0.004428383,-0.019213142,0.011713443,0.009808596,0.06439777,-0.014959955,-0.03600561,0.01949383,0.04094742,0.030407589,-0.026018979,0.044171993,0.022412317,-0.057937913,-0.05182386,0.07793179,-0.109105654,0.057499174,0.102279164,-0.04705679,0.0010215766,-0.052305017,-0.0064890077,-0.019298203,0.0027092565,0.07363092,-0.010116459,-0.12196041,-0.025577176,0.010314696,0.031369787,-0.020949671,0.08722754,-0.051809352,0.0007810379,0.07672705,-0.008455481,0.06511949,-0.021327827,-0.060510863,0.044916406,-0.08674781,-0.047401372,-0.01868107,-0.075262256,-0.055392392,0.072947465,-0.01151735,-0.0072187134,0.015544381,0.039965566,0.020232335,0.04894269,0.04900096,0.05358905,0.032501124,0.053288646,0.07584814,0.031957388,0.05976136,-0.12726106,0.103460334,0.06346268,0.06554993,-0.045167506,0.012330433,-0.062929176,0.043507233,-0.008544882,0.027812833,-0.040016085,0.055822216,-0.03835489,0.040096387,0.018063055,0.060356017,-0.0726533,-0.0671456,-0.05047295,-0.042710193,-0.042777598,-0.006822609,0.012524907,-0.032105528,0.026691807,-0.05756205,0.015424967,-0.04767447,-0.036748573,-0.02527533,0.025934244,-0.033328723,-4.1858173e-05,-0.027706677,0.047805857,0.00018475522,0.050902035,0.1352519,0.005388455,0.029921843,-0.02537518,-0.058101207,-0.021984883,-0.059336115,0.03498545,-0.052446626,0.022411253,0.0060822135,-0.068493545,-0.013820616,-0.03522277,-0.018971028,0.07487064,-0.0009035772,-0.009381329,-0.04850395,0.001105027,0.016467793,0.0268643,0.0013964645,0.043346133,-0.009041368,0.07489963,-0.07887815,0.068340026,0.03767777,0.11665796,-0.025433592,0.062018104,-0.030672694,-0.012993033,0.0068405713,-0.03688894,-0.022034604,-0.040981747,-0.033101898,0.071058825,-0.0017327801,-0.021141728,-0.07144207,-0.02906128,-0.095396295,0.006055787,0.08500532,-0.031142898,0.055712428,-0.041926548,-0.042101618,-0.013311086,-0.046836447,0.023902802,0.031264246,-0.012085872,0.042904463,0.011645057,0.049069524,-0.0039288886,-0.014362478,-0.06809574,-0.038734697,0.028410498,0.12843607,0.090781115,-0.119838186,0.016676102,0.0009924435,0.0314442,-0.040607806,0.0020882064,0.044765383,0.01829387,-0.05677682,0.08415222,-0.06399008,-0.010945022,-0.024140757,0.046428833,-0.0651499,0.041250102,-0.06294866,-0.032783676,0.047456875,0.034612734,-0.021892011,-0.050926965,-0.06388983,-0.031164767,0.053277884,-0.069394015,0.03465082,-0.0410735,0.03736871,0.010950864,0.01830701,-0.070063934,-0.06988279,-0.03560967,-0.05519299,0.07882521,0.05533408,-0.02321644,0.007326457,-0.05126765,0.045479607,0.01830127,-0.037239183,-0.08015762,-0.056017533,-0.07647084,-0.0065865014,-0.027235825,-0.039984804,-0.0156225115,-0.014561295,0.024489071,0.009097713,0.04265267,-0.003169223,0.010329996,-0.078917705,-0.026417341,-0.13925064,-0.009786513,-0.037679326,-0.023494951,0.016230932,-0.010068113,0.008919443,0.05672694,-0.0647096,0.0074613485,0.0856074,-0.0072963624,-0.04508945,-0.027654354,0.031864826,0.046863783,-0.032239847,-0.024967564,0.065593235,-0.05142123,-0.011477745,-0.083396286,-0.036403924,-0.030264381,0.060208946,-0.037968345,0.13118903,0.055968005,-0.02204113,-0.00871512,0.06265703,0.024767108,0.06307163,-0.093918525,0.06388834,-0.027308429,0.028177679,0.046643235,-0.008643308,-0.08599351,0.08742052,-0.0045658057,0.009925819,-0.061982065,0.06666853,-0.085638665,-0.008682048,0.016528588,-0.015443429,0.040419903,0.0059123226,-0.04848474,0.026133329,-0.042095724,-0.06860905,-0.033551272,-0.06492134,0.019667841,-0.04917464,-0.0096588,-0.10072659,-0.07769663,0.03221359,0.019174514,0.039727442,0.025392585,-0.016384942,0.0024048705,-0.09175566,-0.03225071,0.0066428655,0.10759633,-0.04011207,0.031578932,0.06299788,0.061487168,0.048043367,-0.0047893273,-0.054848563,-0.06647676,-0.027905045,-0.055799212,0.028914401,0.04013868,0.050728165,-0.0063177645,-0.018899892,0.008193828,0.025991635,-0.08009935,0.044058595,-0.046858713,0.072079815,0.046664152,0.019002488,-0.018447064,-0.15560018,-0.050175466,0.001016439,-0.0035773942,-0.025972001,0.047064543,0.01866733,0.0049167247,-0.052880444,-0.029235922,-0.024581103,0.040634423,0.095990844,-0.019483034,-0.02325509,0.056078408,0.09241045,-0.03079215,-0.023518562,-0.08394134,0.03326668,0.008070111,0.14776507,0.030338759,-0.01846056,0.009517991,0.0034727904,0.007246884,0.015436005,0.058226254,-0.037932027,-0.04309255,0.09766471,0.014914252,0.03149386,0.10146584,0.009303289,0.05649276,0.04743103,-0.016993523,0.054828145,0.033858124,-0.059207607,-0.027288152,0.09254907,0.07817234,-0.047911037,-0.023988279,-0.067968085,-0.03140125,-0.02434741,-0.017226815,0.050405838,0.048384074,0.10386314,-0.05366119,-0.048218876,0.022471255,-0.04470827,-0.055776954,0.0146418335,-0.03505756,0.041757654,0.0076765255,0.0637766} + 3 | {-0.06530473,0.043326367,0.027487691,-0.012605501,-0.003679171,0.0068843057,0.093755856,-0.018192727,-0.038994554,0.060702052,0.047350235,0.0015797003,-0.026038624,0.029946782,0.053223953,-0.009188536,-0.012273773,0.07512682,-0.1220027,0.024623549,0.040207546,-0.061494265,-0.0016338134,-0.096063755,-0.020626824,-0.0008177105,0.025736991,0.08205663,-0.064413406,-0.10329614,-0.050153203,0.022038238,-0.011629073,-0.03142779,0.09684598,-0.045188677,-0.032773193,0.041901052,0.032470446,0.06218501,0.00056252955,-0.03571358,0.030095506,-0.09239761,-0.020187493,-0.00932361,-0.08373726,-0.053929392,0.09724756,-0.032078817,0.02658544,0.009965162,0.07477913,0.05487153,0.023828406,0.06263976,0.06882497,0.08249143,0.062069558,0.08915651,-0.005154778,0.056259956,-0.13729677,0.08404741,0.07149277,0.04482675,-0.058625933,0.0034976404,-0.030747578,0.004520399,0.0007449915,9.660358e-05,-0.022526976,0.11449101,-0.043607008,0.026769284,0.021050733,0.05854427,-0.042627476,-0.022924222,-0.059794623,-0.037738875,-0.018500011,0.017315088,-0.00020744087,-0.0016206327,0.013337528,-0.022439854,-0.0042932644,-0.04706647,-0.06771751,-0.040391076,0.0638978,-0.031776994,0.011536817,-0.04593729,0.08626801,0.0016808647,-0.0046028513,0.13702579,0.02293593,0.043189116,-0.0073873955,-0.06097065,-0.019305069,-0.025651531,0.043129053,-0.033460874,0.03261353,-0.022361644,-0.07769732,-0.021210406,-0.020294553,-0.044899672,0.083500296,0.038056396,-0.052046232,-0.03215008,-0.028185,0.041909587,0.016012225,-0.0058236965,0.021344814,-0.037620485,0.07454872,-0.03517924,0.086520284,0.096695796,0.0937938,-0.04190071,0.072271764,-0.07022541,0.01583733,-0.0017275782,-0.05280332,-0.005904967,-0.046241984,-0.024421731,0.09988276,-0.0077029592,-0.04107849,-0.091607556,0.033811443,-0.1323201,-0.015927043,0.011014193,-0.039773338,0.033963792,-0.053305525,-0.005038948,-0.024107914,-0.0079898145,0.039604105,0.009226985,0.0010978039,-0.015565131,-0.0002796709,0.037623808,-0.059376597,0.015390821,-0.07600872,-0.008280972,0.023050148,0.0777234,0.061332665,-0.13979945,-0.009342198,0.012803744,0.049805813,-0.03578894,-0.05038778,0.048912454,0.032017626,0.015345221,0.10369494,-0.048897773,-0.054201737,-0.015793057,0.08130064,-0.064783126,0.074246705,-0.06964914,-0.025839275,0.030869238,0.06357789,-0.028754702,-0.02960897,-0.04956488,0.030501548,0.005857936,-0.023547728,0.03717584,0.0024309678,0.066338174,-0.009775384,-0.030799516,-0.028462514,-0.058787093,-0.051071096,-0.048674088,0.011397957,0.07817651,-0.03227047,0.027149512,-0.0030777291,0.061677814,0.0025318298,-0.027110869,-0.0691719,-0.033963803,-0.0648151,-0.033951994,-0.0478505,0.0016642202,-0.019602248,-0.030472266,0.015889537,-0.0009066139,0.032841947,0.021004336,-0.029254122,-0.09597239,-0.04359093,-0.15422617,-0.016366383,-0.059343938,-0.064871244,0.07659653,0.023196936,-0.021893008,0.080793895,-0.05248758,0.018764181,0.0008353451,-0.03318359,-0.04830206,-0.05518034,0.038481984,0.06544077,0.019498836,-0.054670736,0.040052623,-0.028875519,-0.047129385,-0.03614192,-0.012638911,-0.0042204396,0.013685266,-0.047130045,0.11024768,0.07135732,-0.017937008,-0.040911496,0.09008783,0.039298594,0.042975742,-0.08974752,0.08711358,-0.021977019,0.051495675,0.0140351625,-0.053809136,-0.08241595,0.04982693,-0.020355707,0.017629888,-0.039196398,0.08688628,-0.051167585,-0.029257154,0.009161573,-0.0021740724,0.027258197,0.015352816,-0.07426982,0.022452697,-0.041628033,-0.023250584,-0.051996145,-0.031867135,-0.01930267,-0.05257186,0.032619886,-0.08220233,-0.017010445,0.038414452,-0.02268424,0.007727591,0.0064041745,-0.024256933,0.0028989788,-0.06191567,-0.020444075,-0.010515549,0.08980986,-0.020033991,0.009208651,0.044014987,0.067944355,0.07915397,0.019362122,-0.010731527,-0.057449125,-0.007854527,-0.067998596,0.036500365,0.037355963,-0.0011789168,0.030410502,-0.012768641,-0.03281059,0.026916556,-0.052477527,0.042145997,-0.023683913,0.099338256,0.035008017,-0.029086927,-0.032222193,-0.14743629,-0.04350868,0.030494612,-0.013000542,0.021753347,0.023393912,0.021320568,0.0031570331,-0.06008047,-0.031103736,0.030275675,0.015258714,0.09004704,0.0033432578,-0.0045539658,0.06602429,0.072156474,-0.0613405,-0.047462273,-0.057639644,-0.008026253,0.03090332,0.12396069,0.04592149,-0.053269017,0.034282286,-0.0045666047,-0.026025562,0.004598449,0.04304216,-0.02252559,-0.040372007,0.08094969,-0.021883471,0.05903653,0.10130699,0.001840184,0.06142003,0.004450253,-0.023686321,0.014760433,0.07669066,-0.08392746,-0.028447477,0.08995419,0.028487092,-0.047503598,-0.026627144,-0.0475691,-0.069141485,-0.039571274,-0.054866526,0.04417342,0.08155949,0.065555565,-0.053984754,-0.04142323,-0.023902748,0.0066344747,-0.065118864,0.02183451,-0.06479133,0.010425607,-0.010283142,0.0940532} ``` !!! -## Comparing different models and hardware performance +!!! -This database is using a single GPU with 32GB RAM and 8 vCPUs with 16GB RAM. Running these benchmarks while looking at the database processes with `htop` and `nvidia-smi`, it becomes clear that the bottleneck in this case is actually tokenizing the strings which happens in a single thread on the CPU, not computing the embeddings on the GPU which was only 20% utilized during the query. +We can store embeddings in the database as well. Here's an example of creating a temporary table to hold all the embeddings during the current transaction. -We can also do a quick sanity check to make sure we're really getting value out of our GPU by passing the device to our embedding function: +!!! generic -!!! code\_block time="30421.491 ms" +!!! code_block time="54.123 ms" ```postgresql -SELECT - reviqew_body, - pgml.embed( - 'intfloat/e5-small', - 'passage: ' || review_body, - '{"device": "cpu"}' - ) AS embedding -FROM pgml.amazon_us_reviews -LIMIT 1000; +CREATE TEMPORARY TABLE embeddings AS +SELECT id AS document_id, + pgml.embed('intfloat/e5-small-v2', body) +FROM documents; ``` !!! -Forcing the embedding function to use `cpu` is almost 2x slower than `cuda` which is the default when GPUs are available. - -If you're managing dedicated hardware, there's always a decision to be made about resource utilization. If this is a multi-workload database with other queries using the GPU, it's probably great that we're not completely hogging it with our multi-decade-Amazon-scale data import process, but if this is a machine we've spun up just for this task, we can up the resource utilization to 4 concurrent connections, all running on a subset of the data to more completely utilize our CPU, GPU and RAM. - -Another consideration is that GPUs are much more expensive right now than CPUs, and if we're primarily interested in backfilling a dataset like this, high concurrency across many CPU cores might just be the price-competitive winner. - -With 4x concurrency and a GPU, it'll take about 6 hours to compute all 5 million embeddings, which will cost $72 on PostgresML Cloud. If we use the CPU instead of the GPU, we'll probably want more cores and higher concurrency to plug through the job faster. A 96 CPU core machine could complete the job in half the time our single GPU would take and at a lower hourly cost as well, for a total cost of $24. It's overall more cost-effective and faster in parallel, but keep in mind if you're interactively generating embeddings for a user facing application, it will add double the latency, 30ms CPU vs 17ms for GPU. - -For comparison, it would cost about $299 to use OpenAI's cheapest embedding model to process this dataset. Their API calls average about 300ms, although they have high variability (200-400ms) and greater than 1000ms p99 in our measurements. They also have a default rate limit of 200 tokens per minute which means it would take 1,425 years to process this dataset. You better call ahead. - -| Processor | Latency | Cost | Time | -| --------- | ------- | ---- | --------- | -| CPU | 30ms | $24 | 3 hours | -| GPU | 17ms | $72 | 6 hours | -| OpenAI | 300ms | $299 | millennia | - -You can also find embedding models that outperform OpenAI's `text-embedding-ada-002` model across many different tests on the [leaderboard](https://huggingface.co/spaces/mteb/leaderboard). It's always best to do your own benchmarking with your data, models, and hardware to find the best fit for your use case. - -> _HTTP requests to a different datacenter cost more time and money for lower reliability than co-located compute and storage._ - -## Instructor embedding models - -The current leading model is `hkunlp/instructor-xl`. Instructor models take an additional `instruction` parameter which includes context for the embeddings use case, similar to prompts before text generation tasks. - -Instructions can provide a "classification" or "topic" for the text: - -#### Classification - -!!! code\_block time="17.912ms" +!!! results ```postgresql -SELECT pgml.embed( - transformer => 'hkunlp/instructor-xl', - text => 'The Federal Reserve on Wednesday raised its benchmark interest rate.', - kwargs => '{"instruction": "Represent the Financial statement:"}' -); +SELECT 3 ``` !!! -They can also specify particular use cases for the embedding: - -#### Querying +!!! -!!! code\_block time="24.263 ms" +Another way would be to generated and store the embeddings any time a document is updated: ```postgresql -SELECT pgml.embed( - transformer => 'hkunlp/instructor-xl', - text => 'where is the food stored in a yam plant', - kwargs => '{ - "instruction": "Represent the Wikipedia question for retrieving supporting documents:" - }' +CREATE TABLE documents_with_embeddings ( +id SERIAL PRIMARY KEY, +body TEXT, +embedding FLOAT[] GENERATED ALWAYS AS (pgml.normalize_l2(pgml.embed('intfloat/e5-small-v2', body))) STORED ); ``` -!!! - -#### Indexing +!!! generic -!!! code\_block time="30.571 ms" +!!! code_block time="46.823" ```postgresql -SELECT pgml.embed( - transformer => 'hkunlp/instructor-xl', - text => 'Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.', - kwargs => '{"instruction": "Represent the Wikipedia document for retrieval:"}' -); +INSERT INTO documents_with_embeddings (body) +VALUES -- embedding vectors are automatically generated + ('Example text data'), + ('Another example document'), + ('Some other thing'); ``` !!! -#### Clustering - -!!! code\_block time="18.986 ms" +!!! results ```postgresql -SELECT pgml.embed( - transformer => 'hkunlp/instructor-xl', - text => 'Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity"}', - kwargs => '{"instruction": "Represent the Medicine sentence for clustering:"}' -); +INSERT 0 3 ``` !!! -Performance remains relatively good, even with the most advanced models. +!!! -## Generating embeddings for a large dataset +You could also use a Common Table Expression to generate an embedding on the fly and then reference it later in the SQL statement. For example, to generate a search embedding, and compare it to all existing embeddings in a table to find the nearest neighbors: -For our use case, we want to generate an embedding for every single review in the dataset. We'll use the `vector` datatype available from the `pgvector` extension to store (and later index) embeddings efficiently. All PostgresML cloud installations include [pgvector](https://github.com/pgvector/pgvector). To enable this extension in your database, you can run: +!!! generic +!!! code_block time="25.688 ms" ```postgresql -CREATE EXTENSION vector; +WITH query AS ( + SELECT pgml.embed('intfloat/e5-small-v2', 'An example search query') AS embedding +) +SELECT id, pgml.distance_l2(query.embedding, documents_with_embeddings.embedding) +FROM documents_with_embeddings, query +ORDER BY distance_l2; ``` -Then we can add a `vector` column for our review embeddings, with 384 dimensions (the size of e5-small embeddings): +!!! results ```postgresql -ALTER TABLE pgml.amazon_us_reviews -ADD COLUMN review_embedding_e5_large vector(1024); + id | distance_l2 +----+--------------------- + 1 | 0.45335962377530326 + 2 | 0.49441662560530825 + 3 | 0.632445005046323 ``` -It's best practice to keep running queries on a production database relatively short, so rather than trying to update all 5M rows in one multi-hour query, we should write a function to issue the updates in smaller batches. To make iterating over the rows easier and more efficient, we'll add an `id` column with an index to our table: +!!! -```postgresql -ALTER TABLE pgml.amazon_us_reviews -ADD COLUMN id SERIAL PRIMARY KEY; -``` +!!! -Every language/framework/codebase has its own preferred method for backfilling data in a table. The 2 most important considerations are: +## Batching -1. Keep the number of rows per query small enough that the queries take less than a second -2. More concurrency will get the job done faster, but keep in mind the other workloads on your database +PostgresML supports batching embeddings. It turns out, a lot of the cost of generating an embedding is streaming the model weights for each layer from memory to the processors, rather than performing the actual calculations. By batching embeddings, we can reuse the weights for each layer on multiple inputs, before loading the next layer and continuing, which amortizes the RAM latency across all embeddings. -Here's an example of a very simple back-fill job implemented in pure PGSQL, but I'd also love to see example PRs opened with your techniques in your language of choice for tasks like this. +!!! generic + +!!! code_block time="21.204 ms" ```postgresql -DO $$ -BEGIN - FOR i in 1..(SELECT max(id) FROM pgml.amazon_us_reviews) by 10 LOOP - BEGIN RAISE NOTICE 'updating % to %', i, i + 10; END; - - UPDATE pgml.amazon_us_reviews - SET review_embedding_e5_large = pgml.embed( - 'intfloat/e5-large', - 'passage: ' || review_body - ) - WHERE id BETWEEN i AND i + 10 - AND review_embedding_e5_large IS NULL; - - COMMIT; - END LOOP; -END; -$$; +SELECT pgml.embed('intfloat/e5-small-v2', array_agg(body)) AS embedding +FROM documents; ``` -## What's next? - -That's it for now. We've got an Amazon scale table with state-of-the-art machine learning embeddings. As a premature optimization, we'll go ahead and build an index on our new column to make our future vector similarity queries faster. For the full documentation on vector indexes in Postgres see the [pgvector docs](https://github.com/pgvector/pgvector). +!!! -!!! code\_block time="4068909.269 ms (01:07:48.909)" +!!! results ```postgresql -CREATE INDEX CONCURRENTLY index_amazon_us_reviews_on_review_embedding_e5_large -ON pgml.amazon_us_reviews -USING ivfflat (review_embedding_e5_large vector_cosine_ops) -WITH (lists = 2000); + id | embed +---+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- + 1 | {-0.09234577,0.037487056,-0.03421769,-0.033738457,-0.042548284,-0.0015319627,0.042109113,0.011365055,-0.018372666,0.020417988,0.061961487,-0.022707041,0.015810987,0.03675479,0.001995532,-0.04197657,-0.034883354,0.07871886,-0.11676137,0.06141681,0.08321331,-0.03457781,-0.013248807,-0.05802344,-0.039144825,-0.015038275,0.020686107,0.08593334,-0.041029375,-0.13210341,-0.034079146,0.016687978,0.06363906,-0.05279167,0.10102262,-0.048170853,-0.014849669,0.03523273,0.024248678,0.031341534,-0.021447029,-0.05781338,0.039722513,-0.058294114,-0.035174508,-0.056844078,-0.051775914,-0.05822031,0.083022244,0.027178412,0.0032413877,0.023898097,0.023951318,0.0565093,0.036267336,0.049430914,0.027110789,0.05017207,0.058326595,0.040568575,0.014855128,0.06272174,-0.12961388,0.0998898,0.014964503,0.07735804,-0.028795758,0.026889611,-0.0613238,-0.004798127,0.009027658,0.046634953,-0.034936648,0.076499216,-0.03855506,0.08894715,-0.0019889707,0.07027481,-0.04624302,-0.048422314,-0.02444203,-0.0442959,-0.028878363,0.04586853,-0.004158767,-0.0027680802,0.029728336,-0.06130052,-0.028088963,-0.050658133,-0.024370935,-0.0030779864,0.018137587,-0.029853988,-0.06877675,-0.001238518,0.025249483,-0.0045243553,0.07250941,0.12831028,0.0077543575,0.012130527,-0.0006014347,-0.027807593,-0.011226617,-0.04837827,0.0376276,-0.058811083,0.020967057,-0.021439878,-0.0634577,-0.029189702,-0.040197153,-0.01993339,0.0899751,-0.014370172,0.0021994617,-0.0759979,-0.010541287,0.034424484,0.030067233,0.016858222,0.015223163,0.021410512,0.072372325,-0.06270684,0.09666927,0.07237114,0.09372637,-0.027058149,0.06319879,-0.03626834,-0.03539027,0.010406426,-0.08829164,-0.020550422,-0.043701466,-0.018676292,0.038060706,-0.0058152666,-0.04057362,-0.06266247,-0.026675962,-0.07610313,-0.023740835,0.06968648,-0.076157875,0.05129882,-0.053703927,-0.04906172,-0.014506706,-0.033226766,0.04197027,0.009892002,-0.019509513,0.020975547,0.015931072,0.044290986,-0.048697367,-0.022310019,-0.088599496,-0.0371257,0.037382104,0.14381507,0.07789086,-0.10580675,0.0255245,0.014246269,0.01157928,-0.069586724,0.023313843,0.02494169,-0.014511085,-0.017566541,0.0865948,-0.012115137,0.024397936,-0.049067125,0.03300015,-0.058626212,0.034921415,-0.04132337,-0.025009211,0.057668354,0.016189015,-0.04954466,-0.036778226,-0.046015732,-0.041587763,-0.03449501,-0.033505566,0.019262834,-0.018552447,0.019556912,0.01612039,0.0026575527,-0.05330489,-0.06894643,-0.04592849,-0.08485257,0.12714563,0.026810834,-0.053618323,-0.029470881,-0.04381535,0.055211045,-0.0111715235,-0.004484313,-0.02654065,-0.022317547,-0.027823675,0.0135190515,0.001530742,-0.04323672,-0.028350104,-0.07154715,-0.0024147208,0.031836234,0.03476004,0.033611998,0.038179073,-0.087631755,-0.048408568,-0.11773682,-0.019127818,0.013682835,-0.02015055,0.01888005,-0.03280704,0.0076310635,0.074330166,-0.031277154,0.056628436,0.119448215,-0.0012235055,-0.009727585,-0.05459528,0.04298459,0.054554865,-0.027898816,0.0040641865,0.08585007,-0.053415824,-0.030528797,-0.08231634,-0.069264784,-0.08337459,0.049254872,-0.021684796,0.12479715,0.053940497,-0.038884085,-0.032209005,0.035795107,0.0054665194,0.0085438965,-0.039386917,0.083624765,-0.056901276,0.022051739,0.06955752,-0.0008329906,-0.07959222,0.075660035,-0.017008293,0.015329365,-0.07439257,0.057193674,-0.06564091,0.0007063081,-0.015799401,-0.008529507,0.027204275,0.0076780985,-0.018589584,0.065267086,-0.02026929,-0.0559547,-0.035843417,-0.07237942,0.028072618,-0.048903402,-0.027478782,-0.084877744,-0.040812787,0.026713751,0.016210195,-0.039116003,0.03572044,-0.014964189,0.026315138,-0.08638934,-0.04198059,-0.02164005,0.09299754,-0.047685668,0.061317034,0.035914674,0.03533252,0.0287274,-0.033809293,-0.046841178,-0.042211317,-0.02567011,-0.048029255,0.039492987,0.04906847,0.030969618,0.0066106897,0.025528666,-0.008357054,0.04791732,-0.070402496,0.053391967,-0.06309544,0.06575766,0.06522203,0.060434356,-0.047547556,-0.13597175,-0.048658505,0.009734684,-0.016258504,-0.034227647,0.05382081,0.001330341,0.011890187,-0.047945525,-0.031132223,0.0010349775,0.030007072,0.12059559,-0.060273632,-0.010099646,0.055261053,0.053757478,-0.045518342,-0.041972063,-0.08315036,0.049884394,0.037543204,0.17598632,-0.0027433096,0.015989233,0.017486975,0.0059954696,-0.022668751,0.05677827,0.029728843,0.0011321013,-0.051546678,0.1113402,0.017779723,0.050953783,0.10342974,0.04067395,0.054890294,0.017487328,-0.020321153,0.062171113,0.07234749,-0.06777497,-0.03888628,0.08744684,0.032227095,-0.04398878,-0.049698275,-0.0018518695,-0.015967874,-0.0415869,-0.022655524,0.03596353,0.07130526,0.056296617,-0.06720573,-0.092787154,0.021057911,0.015628621,-0.04396636,-0.0063872878,-0.0127499355,0.01633339,-0.0006204544,0.0438727} + 2 | {-0.11384405,0.067140445,0.004428383,-0.019213142,0.011713443,0.009808596,0.06439777,-0.014959955,-0.03600561,0.01949383,0.04094742,0.030407589,-0.026018979,0.044171993,0.022412317,-0.057937913,-0.05182386,0.07793179,-0.109105654,0.057499174,0.102279164,-0.04705679,0.0010215766,-0.052305017,-0.0064890077,-0.019298203,0.0027092565,0.07363092,-0.010116459,-0.12196041,-0.025577176,0.010314696,0.031369787,-0.020949671,0.08722754,-0.051809352,0.0007810379,0.07672705,-0.008455481,0.06511949,-0.021327827,-0.060510863,0.044916406,-0.08674781,-0.047401372,-0.01868107,-0.075262256,-0.055392392,0.072947465,-0.01151735,-0.0072187134,0.015544381,0.039965566,0.020232335,0.04894269,0.04900096,0.05358905,0.032501124,0.053288646,0.07584814,0.031957388,0.05976136,-0.12726106,0.103460334,0.06346268,0.06554993,-0.045167506,0.012330433,-0.062929176,0.043507233,-0.008544882,0.027812833,-0.040016085,0.055822216,-0.03835489,0.040096387,0.018063055,0.060356017,-0.0726533,-0.0671456,-0.05047295,-0.042710193,-0.042777598,-0.006822609,0.012524907,-0.032105528,0.026691807,-0.05756205,0.015424967,-0.04767447,-0.036748573,-0.02527533,0.025934244,-0.033328723,-4.1858173e-05,-0.027706677,0.047805857,0.00018475522,0.050902035,0.1352519,0.005388455,0.029921843,-0.02537518,-0.058101207,-0.021984883,-0.059336115,0.03498545,-0.052446626,0.022411253,0.0060822135,-0.068493545,-0.013820616,-0.03522277,-0.018971028,0.07487064,-0.0009035772,-0.009381329,-0.04850395,0.001105027,0.016467793,0.0268643,0.0013964645,0.043346133,-0.009041368,0.07489963,-0.07887815,0.068340026,0.03767777,0.11665796,-0.025433592,0.062018104,-0.030672694,-0.012993033,0.0068405713,-0.03688894,-0.022034604,-0.040981747,-0.033101898,0.071058825,-0.0017327801,-0.021141728,-0.07144207,-0.02906128,-0.095396295,0.006055787,0.08500532,-0.031142898,0.055712428,-0.041926548,-0.042101618,-0.013311086,-0.046836447,0.023902802,0.031264246,-0.012085872,0.042904463,0.011645057,0.049069524,-0.0039288886,-0.014362478,-0.06809574,-0.038734697,0.028410498,0.12843607,0.090781115,-0.119838186,0.016676102,0.0009924435,0.0314442,-0.040607806,0.0020882064,0.044765383,0.01829387,-0.05677682,0.08415222,-0.06399008,-0.010945022,-0.024140757,0.046428833,-0.0651499,0.041250102,-0.06294866,-0.032783676,0.047456875,0.034612734,-0.021892011,-0.050926965,-0.06388983,-0.031164767,0.053277884,-0.069394015,0.03465082,-0.0410735,0.03736871,0.010950864,0.01830701,-0.070063934,-0.06988279,-0.03560967,-0.05519299,0.07882521,0.05533408,-0.02321644,0.007326457,-0.05126765,0.045479607,0.01830127,-0.037239183,-0.08015762,-0.056017533,-0.07647084,-0.0065865014,-0.027235825,-0.039984804,-0.0156225115,-0.014561295,0.024489071,0.009097713,0.04265267,-0.003169223,0.010329996,-0.078917705,-0.026417341,-0.13925064,-0.009786513,-0.037679326,-0.023494951,0.016230932,-0.010068113,0.008919443,0.05672694,-0.0647096,0.0074613485,0.0856074,-0.0072963624,-0.04508945,-0.027654354,0.031864826,0.046863783,-0.032239847,-0.024967564,0.065593235,-0.05142123,-0.011477745,-0.083396286,-0.036403924,-0.030264381,0.060208946,-0.037968345,0.13118903,0.055968005,-0.02204113,-0.00871512,0.06265703,0.024767108,0.06307163,-0.093918525,0.06388834,-0.027308429,0.028177679,0.046643235,-0.008643308,-0.08599351,0.08742052,-0.0045658057,0.009925819,-0.061982065,0.06666853,-0.085638665,-0.008682048,0.016528588,-0.015443429,0.040419903,0.0059123226,-0.04848474,0.026133329,-0.042095724,-0.06860905,-0.033551272,-0.06492134,0.019667841,-0.04917464,-0.0096588,-0.10072659,-0.07769663,0.03221359,0.019174514,0.039727442,0.025392585,-0.016384942,0.0024048705,-0.09175566,-0.03225071,0.0066428655,0.10759633,-0.04011207,0.031578932,0.06299788,0.061487168,0.048043367,-0.0047893273,-0.054848563,-0.06647676,-0.027905045,-0.055799212,0.028914401,0.04013868,0.050728165,-0.0063177645,-0.018899892,0.008193828,0.025991635,-0.08009935,0.044058595,-0.046858713,0.072079815,0.046664152,0.019002488,-0.018447064,-0.15560018,-0.050175466,0.001016439,-0.0035773942,-0.025972001,0.047064543,0.01866733,0.0049167247,-0.052880444,-0.029235922,-0.024581103,0.040634423,0.095990844,-0.019483034,-0.02325509,0.056078408,0.09241045,-0.03079215,-0.023518562,-0.08394134,0.03326668,0.008070111,0.14776507,0.030338759,-0.01846056,0.009517991,0.0034727904,0.007246884,0.015436005,0.058226254,-0.037932027,-0.04309255,0.09766471,0.014914252,0.03149386,0.10146584,0.009303289,0.05649276,0.04743103,-0.016993523,0.054828145,0.033858124,-0.059207607,-0.027288152,0.09254907,0.07817234,-0.047911037,-0.023988279,-0.067968085,-0.03140125,-0.02434741,-0.017226815,0.050405838,0.048384074,0.10386314,-0.05366119,-0.048218876,0.022471255,-0.04470827,-0.055776954,0.0146418335,-0.03505756,0.041757654,0.0076765255,0.0637766} + 3 | {-0.06530473,0.043326367,0.027487691,-0.012605501,-0.003679171,0.0068843057,0.093755856,-0.018192727,-0.038994554,0.060702052,0.047350235,0.0015797003,-0.026038624,0.029946782,0.053223953,-0.009188536,-0.012273773,0.07512682,-0.1220027,0.024623549,0.040207546,-0.061494265,-0.0016338134,-0.096063755,-0.020626824,-0.0008177105,0.025736991,0.08205663,-0.064413406,-0.10329614,-0.050153203,0.022038238,-0.011629073,-0.03142779,0.09684598,-0.045188677,-0.032773193,0.041901052,0.032470446,0.06218501,0.00056252955,-0.03571358,0.030095506,-0.09239761,-0.020187493,-0.00932361,-0.08373726,-0.053929392,0.09724756,-0.032078817,0.02658544,0.009965162,0.07477913,0.05487153,0.023828406,0.06263976,0.06882497,0.08249143,0.062069558,0.08915651,-0.005154778,0.056259956,-0.13729677,0.08404741,0.07149277,0.04482675,-0.058625933,0.0034976404,-0.030747578,0.004520399,0.0007449915,9.660358e-05,-0.022526976,0.11449101,-0.043607008,0.026769284,0.021050733,0.05854427,-0.042627476,-0.022924222,-0.059794623,-0.037738875,-0.018500011,0.017315088,-0.00020744087,-0.0016206327,0.013337528,-0.022439854,-0.0042932644,-0.04706647,-0.06771751,-0.040391076,0.0638978,-0.031776994,0.011536817,-0.04593729,0.08626801,0.0016808647,-0.0046028513,0.13702579,0.02293593,0.043189116,-0.0073873955,-0.06097065,-0.019305069,-0.025651531,0.043129053,-0.033460874,0.03261353,-0.022361644,-0.07769732,-0.021210406,-0.020294553,-0.044899672,0.083500296,0.038056396,-0.052046232,-0.03215008,-0.028185,0.041909587,0.016012225,-0.0058236965,0.021344814,-0.037620485,0.07454872,-0.03517924,0.086520284,0.096695796,0.0937938,-0.04190071,0.072271764,-0.07022541,0.01583733,-0.0017275782,-0.05280332,-0.005904967,-0.046241984,-0.024421731,0.09988276,-0.0077029592,-0.04107849,-0.091607556,0.033811443,-0.1323201,-0.015927043,0.011014193,-0.039773338,0.033963792,-0.053305525,-0.005038948,-0.024107914,-0.0079898145,0.039604105,0.009226985,0.0010978039,-0.015565131,-0.0002796709,0.037623808,-0.059376597,0.015390821,-0.07600872,-0.008280972,0.023050148,0.0777234,0.061332665,-0.13979945,-0.009342198,0.012803744,0.049805813,-0.03578894,-0.05038778,0.048912454,0.032017626,0.015345221,0.10369494,-0.048897773,-0.054201737,-0.015793057,0.08130064,-0.064783126,0.074246705,-0.06964914,-0.025839275,0.030869238,0.06357789,-0.028754702,-0.02960897,-0.04956488,0.030501548,0.005857936,-0.023547728,0.03717584,0.0024309678,0.066338174,-0.009775384,-0.030799516,-0.028462514,-0.058787093,-0.051071096,-0.048674088,0.011397957,0.07817651,-0.03227047,0.027149512,-0.0030777291,0.061677814,0.0025318298,-0.027110869,-0.0691719,-0.033963803,-0.0648151,-0.033951994,-0.0478505,0.0016642202,-0.019602248,-0.030472266,0.015889537,-0.0009066139,0.032841947,0.021004336,-0.029254122,-0.09597239,-0.04359093,-0.15422617,-0.016366383,-0.059343938,-0.064871244,0.07659653,0.023196936,-0.021893008,0.080793895,-0.05248758,0.018764181,0.0008353451,-0.03318359,-0.04830206,-0.05518034,0.038481984,0.06544077,0.019498836,-0.054670736,0.040052623,-0.028875519,-0.047129385,-0.03614192,-0.012638911,-0.0042204396,0.013685266,-0.047130045,0.11024768,0.07135732,-0.017937008,-0.040911496,0.09008783,0.039298594,0.042975742,-0.08974752,0.08711358,-0.021977019,0.051495675,0.0140351625,-0.053809136,-0.08241595,0.04982693,-0.020355707,0.017629888,-0.039196398,0.08688628,-0.051167585,-0.029257154,0.009161573,-0.0021740724,0.027258197,0.015352816,-0.07426982,0.022452697,-0.041628033,-0.023250584,-0.051996145,-0.031867135,-0.01930267,-0.05257186,0.032619886,-0.08220233,-0.017010445,0.038414452,-0.02268424,0.007727591,0.0064041745,-0.024256933,0.0028989788,-0.06191567,-0.020444075,-0.010515549,0.08980986,-0.020033991,0.009208651,0.044014987,0.067944355,0.07915397,0.019362122,-0.010731527,-0.057449125,-0.007854527,-0.067998596,0.036500365,0.037355963,-0.0011789168,0.030410502,-0.012768641,-0.03281059,0.026916556,-0.052477527,0.042145997,-0.023683913,0.099338256,0.035008017,-0.029086927,-0.032222193,-0.14743629,-0.04350868,0.030494612,-0.013000542,0.021753347,0.023393912,0.021320568,0.0031570331,-0.06008047,-0.031103736,0.030275675,0.015258714,0.09004704,0.0033432578,-0.0045539658,0.06602429,0.072156474,-0.0613405,-0.047462273,-0.057639644,-0.008026253,0.03090332,0.12396069,0.04592149,-0.053269017,0.034282286,-0.0045666047,-0.026025562,0.004598449,0.04304216,-0.02252559,-0.040372007,0.08094969,-0.021883471,0.05903653,0.10130699,0.001840184,0.06142003,0.004450253,-0.023686321,0.014760433,0.07669066,-0.08392746,-0.028447477,0.08995419,0.028487092,-0.047503598,-0.026627144,-0.0475691,-0.069141485,-0.039571274,-0.054866526,0.04417342,0.08155949,0.065555565,-0.053984754,-0.04142323,-0.023902748,0.0066344747,-0.065118864,0.02183451,-0.06479133,0.010425607,-0.010283142,0.0940532} ``` !!! -!!! tip +!!! -Create indexes `CONCURRENTLY` to avoid locking your table for other queries. +You can see the near 2.5x speedup when generating 3 embeddings in a batch, because the model weights only need to be streamed from GPU RAM to the processors a single time. You should consider batch sizes from 10-100 embeddings at a time when do bulk operations to improve throughput and reduce costs. -!!! +## Scalability -Building a vector index on a table with this many entries takes a while, so this is a good time to take a coffee break. In the next article we'll look at how to query these embeddings to find the best products and make personalized recommendations for users. We'll also cover updating an index in real time as new data comes in. +PostgresML serverless instances have access to multiple GPUs that be used simultaneously across different PostgreSQL connections. For large jobs, you may want to create multiple worker threads/processes that operate across your dataset in batches on their own Postgres Connection. diff --git a/pgml-cms/docs/guides/embeddings/indexing-w-pgvector.md b/pgml-cms/docs/guides/embeddings/indexing-w-pgvector.md new file mode 100644 index 000000000..e361d5aff --- /dev/null +++ b/pgml-cms/docs/guides/embeddings/indexing-w-pgvector.md @@ -0,0 +1 @@ +# Indexing w/ pgvector diff --git a/pgml-cms/docs/guides/embeddings/re-ranking-nearest-neighbors.md b/pgml-cms/docs/guides/embeddings/re-ranking-nearest-neighbors.md new file mode 100644 index 000000000..a8945376a --- /dev/null +++ b/pgml-cms/docs/guides/embeddings/re-ranking-nearest-neighbors.md @@ -0,0 +1,3 @@ +# Re-ranking Nearest Neighbors + +## Introduction diff --git a/pgml-cms/docs/guides/embeddings/vector-aggregation.md b/pgml-cms/docs/guides/embeddings/vector-aggregation.md index e5f1cd721..2b6e09209 100644 --- a/pgml-cms/docs/guides/embeddings/vector-aggregation.md +++ b/pgml-cms/docs/guides/embeddings/vector-aggregation.md @@ -34,7 +34,7 @@ Vector aggregation is extensively used across various machine learning applicati ## Available Methods of Vector Aggregation ### Example Data -```sql +```postgresql CREATE TABLE documents ( id SERIAL PRIMARY KEY, body TEXT, @@ -44,7 +44,7 @@ CREATE TABLE documents ( Example of inserting text and its corresponding embedding -```sql +```postgresql INSERT INTO documents (body) VALUES -- embedding vectors are automatically generated ('Example text data'), @@ -55,7 +55,7 @@ VALUES -- embedding vectors are automatically generated ### Summation Adding up all the vectors element-wise. This method is simple and effective, preserving all the information from the original vectors, but can lead to large values if many vectors are summed. -```sql +```postgresql SELECT id, pgml.sum(embedding) FROM documents GROUP BY id; @@ -64,7 +64,7 @@ GROUP BY id; ### Averaging (Mean) Computing the element-wise mean of the vectors. This is probably the most common aggregation method, as it normalizes the scale of the vectors against the number of vectors being aggregated, preventing any single vector from dominating the result. -```sql +```postgresql SELECT id, pgml.divide(pgml.sum(embedding), count(*)) AS avg FROM documents GROUP BY id; @@ -73,7 +73,7 @@ GROUP BY id; ### Weighted Average Similar to averaging, but each vector is multiplied by a weight that reflects its importance before averaging. This method is useful when some vectors are more significant than others. -```sql +```postgresql SELECT id, pgml.divide(pgml.sum(pgml.multiply(embedding, id)), count(*)) AS id_weighted_avg FROM documents GROUP BY id; @@ -82,7 +82,7 @@ GROUP BY id; ### Max Pooling Taking the maximum value of each dimension across all vectors. This method is particularly useful for capturing the most pronounced features in a set of vectors. -```sql +```postgresql SELECT id, pgml.max_abs(embedding) FROM documents GROUP BY id; @@ -91,7 +91,7 @@ GROUP BY id; ### Min Pooling Taking the minimum value of each dimension across all vectors, useful for capturing the least dominant features. -```sql +```postgresql SELECT id, pgml.min_abs(embedding) FROM documents GROUP BY id; diff --git a/pgml-cms/docs/guides/embeddings/vector-normalization.md b/pgml-cms/docs/guides/embeddings/vector-normalization.md index b54513e58..31cddab00 100644 --- a/pgml-cms/docs/guides/embeddings/vector-normalization.md +++ b/pgml-cms/docs/guides/embeddings/vector-normalization.md @@ -14,7 +14,7 @@ Vector normalization converts a vector into a unit vector — that is, a vector Assume you've created a table in your database that stores embeddings generated using [pgml.embed()](../../api/sql-extension/pgml.embed.md), although you can normalize any vector. -```sql +```postgresql CREATE TABLE documents ( id SERIAL PRIMARY KEY, body TEXT, @@ -24,7 +24,7 @@ CREATE TABLE documents ( Example of inserting text and its corresponding embedding -```sql +```postgresql INSERT INTO documents (body) VALUES -- embedding vectors are automatically generated ('Example text data'), @@ -34,7 +34,7 @@ VALUES -- embedding vectors are automatically generated You could create a new table from your documents and their embeddings, that uses normalized embeddings. -```sql +```postgresql CREATE TABLE documents_normalized_vectors AS SELECT id AS document_id, @@ -44,7 +44,7 @@ FROM documents; Another valid approach would be to just store the normalized embedding in the documents table. -```sql +```postgresql CREATE TABLE documents ( id SERIAL PRIMARY KEY, body TEXT, @@ -57,26 +57,26 @@ CREATE TABLE documents ( - **L1 Normalization (Manhattan Norm)**: This function scales the vector so that the sum of the absolute values of its components is equal to 1. It's useful when differences in magnitude are important but the components represent independent dimensions. - ```sql + ```postgresql SELECT pgml.normalize_l1(embedding) FROM documents; ``` - **L2 Normalization (Euclidean Norm)**: Scales the vector so that the sum of the squares of its components is equal to 1. This is particularly important for cosine similarity calculations in machine learning. - ```sql + ```postgresql SELECT pgml.normalize_l2(embedding) FROM documents; ``` - **Max Normalization**: Scales the vector such that the maximum absolute value of any component is 1. This normalization is less common but can be useful when the maximum value represents a bounded capacity. - ```sql + ```postgresql SELECT pgml.normalize_max(embedding) FROM documents; ``` ## Querying and Using Normalized Vectors After normalization, you can use these vectors for various applications, such as similarity searches, clustering, or as input for further machine learning models within PostgresML. -```sql +```postgresql -- Querying for similarity using l2 normalized dot product, which is equivalent to cosine similarity WITH normalized_vectors AS ( SELECT id, pgml.normalize_l2(embedding) AS norm_vector diff --git a/pgml-cms/docs/guides/embeddings/vector-similarity.md b/pgml-cms/docs/guides/embeddings/vector-similarity.md index 1c6f24596..f0fa07a1e 100644 --- a/pgml-cms/docs/guides/embeddings/vector-similarity.md +++ b/pgml-cms/docs/guides/embeddings/vector-similarity.md @@ -65,7 +65,7 @@ An optimized version is provided by: !!! code_block time="1191.069 ms" -```sql +```postgresql WITH query AS ( SELECT vector FROM test_data @@ -131,7 +131,7 @@ An optimized version is provided by: !!! code_block time="1359.114 ms" -```sql +```postgresql WITH query AS ( SELECT vector FROM test_data @@ -197,7 +197,7 @@ An optimized version is provided by: !!! code_block time="498.649 ms" -```sql +```postgresql WITH query AS ( SELECT vector FROM test_data @@ -287,7 +287,7 @@ The optimized version is provided by: !!! code_block time="508.587 ms" -```sql +```postgresql WITH query AS ( SELECT vector FROM test_data @@ -304,7 +304,7 @@ Or you could reverse order by `cosine_similarity` for the same ranking: !!! code_block time="502.461 ms" -```sql +```postgresql WITH query AS ( SELECT vector FROM test_data @@ -325,7 +325,7 @@ You should benchmark and compare the computational cost of these distance metric !!! code_block -```sql +```postgresql \timing on ``` @@ -333,7 +333,7 @@ You should benchmark and compare the computational cost of these distance metric !!! code_block -```sql +```postgresql CREATE TABLE test_data ( id BIGSERIAL NOT NULL, vector FLOAT4[] @@ -346,7 +346,7 @@ Insert 10k vectors, that have 1k dimensions each !!! code_block -```sql +```postgresql INSERT INTO test_data (vector) SELECT array_agg(random()) FROM generate_series(1,10000000) i diff --git a/pgml-cms/docs/introduction/getting-started/import-your-data/copy.md b/pgml-cms/docs/introduction/getting-started/import-your-data/copy.md index 29b22b684..850f73b6e 100644 --- a/pgml-cms/docs/introduction/getting-started/import-your-data/copy.md +++ b/pgml-cms/docs/introduction/getting-started/import-your-data/copy.md @@ -73,6 +73,6 @@ We took our export command and changed `TO` to `FROM`, and that's it. Make sure If your data changed, repeat this process again. To avoid duplicate entries in your table, you can truncate (or delete) all rows beforehand: -```sql +```postgresql TRUNCATE your_table; ``` diff --git a/pgml-cms/docs/product/vector-database.md b/pgml-cms/docs/product/vector-database.md index 00086dae2..825b24eaa 100644 --- a/pgml-cms/docs/product/vector-database.md +++ b/pgml-cms/docs/product/vector-database.md @@ -48,7 +48,7 @@ At first, the column is empty. To generate embeddings, we can use the PostgresML {% tabs %} {% tab title="SQL" %} -```sql +```postgresql UPDATE usa_house_prices SET embedding = pgml.embed( @@ -72,7 +72,7 @@ That's it. We just created 5,000 embeddings of the values stored in the address {% tabs %} {% tab title="SQL" %} -```sql +```postgresql SELECT address, (embedding::real[])[1:5] @@ -115,7 +115,7 @@ For example, if we wanted to find three closest matching addresses to `1 Infinit {% tabs %} {% tab title="SQL" %} -```sql +```postgresql SELECT address FROM usa_house_prices @@ -206,7 +206,7 @@ CREATE INDEX {% tabs %} {% tab title="SQL" %} -```sql +```postgresql EXPLAIN SELECT address FROM usa_house_prices diff --git a/pgml-cms/docs/resources/benchmarks/ggml-quantized-llm-support-for-huggingface-transformers.md b/pgml-cms/docs/resources/benchmarks/ggml-quantized-llm-support-for-huggingface-transformers.md index 86e48ed91..1b74e60e4 100644 --- a/pgml-cms/docs/resources/benchmarks/ggml-quantized-llm-support-for-huggingface-transformers.md +++ b/pgml-cms/docs/resources/benchmarks/ggml-quantized-llm-support-for-huggingface-transformers.md @@ -25,7 +25,7 @@ You can select the data type for torch tensors in PostgresML by setting the `tor !!! code\_block time="4584.906 ms" -```sql +```postgresql SELECT pgml.transform( task => '{ "model": "tiiuae/falcon-7b-instruct", @@ -86,7 +86,7 @@ PostgresML will automatically use GPTQ or GGML when a HuggingFace model has one !!! code\_block time="281.213 ms" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", @@ -117,7 +117,7 @@ SELECT pgml.transform( !!! code\_block time="252.213 ms" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", @@ -148,7 +148,7 @@ SELECT pgml.transform( !!! code\_block time="279.888 ms" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", @@ -185,7 +185,7 @@ We can specify the CPU by passing a `"device": "cpu"` argument to the `task`. !!! code\_block time="266.997 ms" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", @@ -217,7 +217,7 @@ SELECT pgml.transform( !!! code\_block time="33224.136 ms" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", @@ -255,7 +255,7 @@ HuggingFace and these libraries have a lot of great models. Not all of these mod !!! code\_block time="3411.324 ms" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", @@ -287,7 +287,7 @@ SELECT pgml.transform( !!! code\_block time="4198.817 ms" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", @@ -319,7 +319,7 @@ SELECT pgml.transform( !!! code\_block time="4198.817 ms" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", @@ -353,7 +353,7 @@ Many of these models are published with multiple different quantization methods !!! code\_block time="6498.597" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", @@ -391,7 +391,7 @@ Shoutout to [Tostino](https://github.com/Tostino/) for the extended example belo !!! code\_block time="3784.565" -```sql +```postgresql SELECT pgml.transform( task => '{ "task": "text-generation", diff --git a/pgml-cms/docs/resources/benchmarks/mindsdb-vs-postgresml.md b/pgml-cms/docs/resources/benchmarks/mindsdb-vs-postgresml.md index 414068275..c82d4eea1 100644 --- a/pgml-cms/docs/resources/benchmarks/mindsdb-vs-postgresml.md +++ b/pgml-cms/docs/resources/benchmarks/mindsdb-vs-postgresml.md @@ -82,7 +82,7 @@ For both implementations, we can just pass in our data as part of the query for !!! code\_block time="4769.337 ms" -```sql +```postgresql SELECT pgml.transform( inputs => ARRAY[ 'I am so excited to benchmark deep learning models in SQL. I can not wait to see the results!' @@ -112,7 +112,7 @@ The first time `transform` is run with a particular model name, it will download !!! code\_block time="45.094 ms" -```sql +```postgresql SELECT pgml.transform( inputs => ARRAY[ 'I don''t really know if 5 seconds is fast or slow for deep learning. How much time is spent downloading vs running the model?' @@ -142,7 +142,7 @@ SELECT pgml.transform( !!! code\_block time="165.036 ms" -```sql +```postgresql SELECT pgml.transform( inputs => ARRAY[ 'Are GPUs really worth it? Sometimes they are more expensive than the rest of the computer combined.' @@ -197,7 +197,7 @@ psql postgres://mindsdb:123@127.0.0.1:55432 And turn timing on to see how long it takes to run the same query: -```sql +```postgresql \timing on ``` diff --git a/pgml-cms/docs/resources/benchmarks/postgresml-is-8-40x-faster-than-python-http-microservices.md b/pgml-cms/docs/resources/benchmarks/postgresml-is-8-40x-faster-than-python-http-microservices.md index be782c0b6..c5812fd56 100644 --- a/pgml-cms/docs/resources/benchmarks/postgresml-is-8-40x-faster-than-python-http-microservices.md +++ b/pgml-cms/docs/resources/benchmarks/postgresml-is-8-40x-faster-than-python-http-microservices.md @@ -162,7 +162,7 @@ Data used for training and inference is available [here](https://static.postgres PostgresML model is trained with: -```sql +```postgresql SELECT * FROM pgml.train( project_name => 'r2', algorithm => 'xgboost', diff --git a/pgml-cms/docs/resources/data-storage-and-retrieval/README.md b/pgml-cms/docs/resources/data-storage-and-retrieval/README.md index 1ddb89b90..f3a995a4a 100644 --- a/pgml-cms/docs/resources/data-storage-and-retrieval/README.md +++ b/pgml-cms/docs/resources/data-storage-and-retrieval/README.md @@ -73,7 +73,7 @@ If you're writing your own application to ingest large amounts of data into Post Querying data stored in tables is what makes PostgresML so powerful. Postgres has one of the most comprehensive querying languages of all databases we've worked with so, for our example, we won't have any trouble calculating some statistics: -```sql +```postgresql SELECT count(*), avg("Avg. Area Income"), @@ -97,7 +97,7 @@ The SQL language is expressive and allows to select, filter and aggregate any nu Because databases store data permanently, adding more data to Postgres can be done in many ways. The simplest and most common way is to just insert it into a table you already have. Using the same example dataset, we can add a new row with just one query: -```sql +```postgresql INSERT INTO usa_house_prices ( "Avg. Area Income", "Avg. Area House Age", @@ -159,7 +159,7 @@ Looking at the USA House Prices dataset, we can find its natural key pretty easi To ensure that our table reflects this, let's add a unique index: -```sql +```postgresql CREATE UNIQUE INDEX ON usa_house_prices USING btree("Address"); ``` @@ -182,7 +182,7 @@ Once the dataset gets large enough, and we're talking millions of rows, it's no Postgres automatically uses indexes when possible and optimal to do so. From our example, if we filter the dataset by the "Address" column, Postgres will use the index we created and return a result quickly: -```sql +```postgresql SELECT "Avg. Area House Age", "Address" diff --git a/pgml-cms/docs/resources/data-storage-and-retrieval/documents.md b/pgml-cms/docs/resources/data-storage-and-retrieval/documents.md index 2182a8550..e45314c78 100644 --- a/pgml-cms/docs/resources/data-storage-and-retrieval/documents.md +++ b/pgml-cms/docs/resources/data-storage-and-retrieval/documents.md @@ -8,7 +8,7 @@ In Postgres, documents are normally stored in regular tables using the `JSONB` d If you're used to document databases like Mongo or Couch, you can replicate the same format and API in Postgres with just a single table: -```sql +```postgresql CREATE TABLE documents ( id BIGSERIAL PRIMARY KEY, document JSONB @@ -19,7 +19,7 @@ CREATE TABLE documents ( To insert a document into our table, you can just use a regular insert query: -```sql +```postgresql INSERT INTO documents ( document ) VALUES ('{"hello": "world", "values": [1, 2, 3, 4]}') @@ -32,7 +32,7 @@ This query will insert the document `{"hello": "world"}` and return its ID to th To get a document by it's ID, you can just select it from the same table, for example: -```sql +```postgresql SELECT document FROM documents WHERE id = 1; ``` @@ -52,7 +52,7 @@ The `id` column is a primary key, which gives it an index automatically. Any fet For example, if we want to fetch all documents that have a key `hello` and the value of that key `world`, we can do so: -```sql +```postgresql SELECT id, document->>'values' @@ -63,7 +63,7 @@ WHERE or if we wanted to fetch the first value inside an array stored in a `values` key, we can: -```sql +```postgresql SELECT document #>> '{values, 0}' FROM documents @@ -77,13 +77,13 @@ WHERE Most key/value databases expect its users to only use primary keys for retrieval. In the real world, things are not always that easy. Postgres makes very few assumptions about how its users interact with JSON data, and allows indexing its top level data structure for fast access: -```sql +```postgresql CREATE INDEX ON documents USING gin(document jsonb_path_ops); ``` When searching the documents for matches, Postgres will now use a much faster GIN index and give us results quickly: -```sql +```postgresql SELECT * FROM diff --git a/pgml-cms/docs/resources/data-storage-and-retrieval/partitioning.md b/pgml-cms/docs/resources/data-storage-and-retrieval/partitioning.md index 78f91279c..0e12409ed 100644 --- a/pgml-cms/docs/resources/data-storage-and-retrieval/partitioning.md +++ b/pgml-cms/docs/resources/data-storage-and-retrieval/partitioning.md @@ -26,7 +26,7 @@ In Postgres, you can create a partition by range with just a few queries. Partit Let's start with the parent table: -```sql +```postgresql CREATE TABLE energy_consumption ( "Datetime" TIMESTAMPTZ, "AEP_MW" REAL @@ -35,7 +35,7 @@ CREATE TABLE energy_consumption ( Now, let's add a couple child tables: -```sql +```postgresql CREATE TABLE energy_consumption_2004_2011 PARTITION OF energy_consumption FOR VALUES FROM ('2004-01-01') TO ('2011-12-31'); @@ -74,7 +74,7 @@ Postgres allows to query each partition individually, which is nice if we know w To make reading this data user-friendly, Postgres allows us to query the parent table instead. As long as we specify the partition key, we are guaranteed to get the most efficient query plan possible: -```sql +```postgresql SELECT avg("AEP_MW") FROM energy_consumption @@ -110,7 +110,7 @@ Partitioning by hash, unlike by range, can be applied to any data type, includin To create a table partitioned by hash, the syntax is similar to partition by range. Let's use the USA House Prices dataset we used in [Vectors](../../product/vector-database.md) and [Tabular data](README.md), and split that table into two (2) roughly equal parts. Since we already have the `usa_house_prices` table, let's create a new one with the same columns, except this one will be partitioned: -```sql +```postgresql CREATE TABLE usa_house_prices_partitioned ( "Avg. Area Income" REAL NOT NULL, "Avg. Area House Age" REAL NOT NULL, @@ -124,7 +124,7 @@ CREATE TABLE usa_house_prices_partitioned ( Let's add two (2) partitions by hash. Hashing uses modulo arithmetic; when creating a child data table with these scheme, you need to specify the denominator and the remainder: -```sql +```postgresql CREATE TABLE usa_house_prices_partitioned_1 PARTITION OF usa_house_prices_partitioned FOR VALUES WITH (modulus 2, remainder 0); @@ -136,7 +136,7 @@ FOR VALUES WITH (modulus 2, remainder 1); Importing data into the new table can be done with just one query: -```sql +```postgresql INSERT INTO usa_house_prices_partitioned SELECT * FROM usa_houses_prices; ``` @@ -196,7 +196,7 @@ unpigz amazon_reviews_with_embeddings.csv.gz Let's get started by creating a partitioned table with three (3) child partitions. We'll be using hash partitioning on the `review_body` column which should produce three (3) roughly equally sized tables. -```sql +```postgresql CREATE TABLE amazon_reviews_with_embedding ( review_body TEXT, review_embedding_e5_large VECTOR(1024) @@ -232,7 +232,7 @@ If you're doing this with `psql`, open up three (3) terminal tabs, connect to yo {% tabs %} {% tab title="Tab 1" %} -```sql +```postgresql SET maintenance_work_mem TO '2GB'; CREATE INDEX ON @@ -242,7 +242,7 @@ USING hnsw(review_embedding_e5_large vector_cosine_ops); {% endtab %} {% tab title="Tab 2" %} -```sql +```postgresql SET maintenance_work_mem TO '2GB'; CREATE INDEX ON @@ -252,7 +252,7 @@ USING hnsw(review_embedding_e5_large vector_cosine_ops); {% endtab %} {% tab title="Tab 3" %} -```sql +```postgresql SET maintenance_work_mem TO '2GB'; CREATE INDEX ON @@ -268,7 +268,7 @@ This is an example of scaling vector search using partitions. We are increasing To perform an ANN search using the indexes we created, we don't have to do anything special. Postgres will automatically scan all three (3) indexes for the closest matches and combine them into one result: -```sql +```postgresql SELECT review_body, review_embedding_e5_large <=> pgml.embed( diff --git a/pgml-cms/docs/resources/developer-docs/contributing.md b/pgml-cms/docs/resources/developer-docs/contributing.md index b5d53f55d..a739d5ac9 100644 --- a/pgml-cms/docs/resources/developer-docs/contributing.md +++ b/pgml-cms/docs/resources/developer-docs/contributing.md @@ -117,13 +117,13 @@ That's it, PostgresML is ready. You can validate the installation by running: {% tabs %} {% tab title="SQL" %} -```sql +```postgresql SELECT pgml.version(); ``` {% endtab %} {% tab title="Output" %} -```sql +```postgresql postgres=# select pgml.version(); version ------------------- @@ -135,7 +135,7 @@ postgres=# select pgml.version(); Basic extension usage: -```sql +```postgresql SELECT * FROM pgml.load_dataset('diabetes'); SELECT * FROM pgml.train('Project name', 'regression', 'pgml.diabetes', 'target', 'xgboost'); SELECT target, pgml.predict('Project name', ARRAY[age, sex, bmi, bp, s1, s2, s3, s4, s5, s6]) FROM pgml.diabetes LIMIT 10; diff --git a/pgml-cms/docs/resources/developer-docs/gpu-support.md b/pgml-cms/docs/resources/developer-docs/gpu-support.md index 0e6e86034..f9176fd17 100644 --- a/pgml-cms/docs/resources/developer-docs/gpu-support.md +++ b/pgml-cms/docs/resources/developer-docs/gpu-support.md @@ -26,7 +26,7 @@ GPU setup for XGBoost is covered in the [documentation](https://xgboost.readthed !!! example -```sql +```postgresql pgml.train( 'GPU project', algorithm => 'xgboost', @@ -42,7 +42,7 @@ GPU setup for LightGBM is covered in the [documentation](https://lightgbm.readth !!! example -```sql +```postgresql pgml.train( 'GPU project', algorithm => 'lightgbm', diff --git a/pgml-cms/docs/resources/developer-docs/self-hosting/replication.md b/pgml-cms/docs/resources/developer-docs/self-hosting/replication.md index 411ed844d..fa189e745 100644 --- a/pgml-cms/docs/resources/developer-docs/self-hosting/replication.md +++ b/pgml-cms/docs/resources/developer-docs/self-hosting/replication.md @@ -50,7 +50,7 @@ archive_command = 'pgbackrest --stanza=main archive-push %p' Postgres requires that a user with replication permissions is used for replicas to connect to the primary. To create this user, login as a superuser and run: -```sql +```postgresql CREATE ROLE replication_user PASSWORD '' LOGIN REPLICATION; ``` diff --git a/pgml-cms/docs/use-cases/embeddings/README.md b/pgml-cms/docs/use-cases/embeddings/README.md index 900ae6c9f..1906c7873 100644 --- a/pgml-cms/docs/use-cases/embeddings/README.md +++ b/pgml-cms/docs/use-cases/embeddings/README.md @@ -18,7 +18,7 @@ For a deeper dive, check out the following articles we've written illustrating t ### API -```sql +```postgresql pgml.embed( transformer TEXT, -- huggingface sentence-transformer name text TEXT, -- input to embed @@ -30,13 +30,13 @@ pgml.embed( Let's use the `pgml.embed` function to generate embeddings for tweets, so we can find similar ones. We will use the `distilbert-base-uncased` model. This model is a small version of the `bert-base-uncased` model. It is a good choice for short texts like tweets. To start, we'll load a dataset that provides tweets classified into different topics. -```sql +```postgresql SELECT pgml.load_dataset('tweet_eval', 'sentiment'); ``` View some tweets and their topics. -```sql +```postgresql SELECT * FROM pgml.tweet_eval LIMIT 10; @@ -44,7 +44,7 @@ LIMIT 10; Get a preview of the embeddings for the first 10 tweets. This will also download the model and cache it for reuse, since it's the first time we've used it. -```sql +```postgresql SELECT text, pgml.embed('distilbert-base-uncased', text) FROM pgml.tweet_eval LIMIT 10; @@ -52,7 +52,7 @@ LIMIT 10; It will take a few minutes to generate the embeddings for the entire dataset. We'll save the results to a new table. -```sql +```postgresql CREATE TABLE tweet_embeddings AS SELECT text, pgml.embed('distilbert-base-uncased', text) AS embedding FROM pgml.tweet_eval; @@ -60,7 +60,7 @@ FROM pgml.tweet_eval; Now we can use the embeddings to find similar tweets. We'll use the `pgml.cosign_similarity` function to find the tweets that are most similar to a given tweet (or any other text input). -```sql +```postgresql WITH query AS ( SELECT pgml.embed('distilbert-base-uncased', 'Star Wars christmas special is on Disney') AS embedding ) @@ -75,7 +75,7 @@ On small datasets (<100k rows), a linear search that compares every row to the q * [Cube](https://www.postgresql.org/docs/current/cube.html) is a built-in extension that provides a fast indexing strategy for finding similar vectors. By default it has an arbitrary limit of 100 dimensions, unless Postgres is compiled with a larger size. * [PgVector](https://github.com/pgvector/pgvector) supports embeddings up to 2000 dimensions out of the box, and provides a fast indexing strategy for finding similar vectors. -```sql +```postgresql CREATE EXTENSION vector; CREATE TABLE items (text TEXT, embedding VECTOR(768)); INSERT INTO items SELECT text, embedding FROM tweet_embeddings; diff --git a/pgml-cms/docs/use-cases/improve-search-results-with-machine-learning.md b/pgml-cms/docs/use-cases/improve-search-results-with-machine-learning.md index 5a6f20cef..0fde75c55 100644 --- a/pgml-cms/docs/use-cases/improve-search-results-with-machine-learning.md +++ b/pgml-cms/docs/use-cases/improve-search-results-with-machine-learning.md @@ -14,7 +14,7 @@ Our search application will start with a **documents** table. Our documents have !!! code\_block time="10.493 ms" -```sql +```postgresql CREATE TABLE documents ( id BIGSERIAL PRIMARY KEY, title TEXT, @@ -32,7 +32,7 @@ We can add new documents to our _text corpus_ with the standard SQL `INSERT` sta !!! code\_block time="3.417 ms" -```sql +```postgresql INSERT INTO documents (title, body) VALUES ('This is a title', 'This is the body of the first document.'), ('This is another title', 'This is the body of the second document.'), @@ -57,7 +57,7 @@ You can configure the grammatical rules in many advanced ways, but we'll use the !!! code\_block time="0.651 ms" -```sql +```postgresql SELECT * FROM documents WHERE to_tsvector('english', body) @@ to_tsquery('english', 'second'); @@ -87,7 +87,7 @@ The first step is to store the `tsvector` in the table, so we don't have to gene !!! code\_block time="17.883 ms" -```sql +```postgresql ALTER TABLE documents ADD COLUMN title_and_body_text tsvector GENERATED ALWAYS AS (to_tsvector('english', title || ' ' || body )) STORED; @@ -103,7 +103,7 @@ One nice aspect of generated columns is that they will backfill the data for exi !!! code\_block time="5.145 ms" -```sql +```postgresql CREATE INDEX documents_title_and_body_text_index ON documents USING GIN (title_and_body_text); @@ -119,7 +119,7 @@ And now, we'll demonstrate a slightly more complex `tsquery`, that requires both !!! code\_block time="3.673 ms" -```sql +```postgresql SELECT * FROM documents WHERE title_and_body_text @@ to_tsquery('english', 'another & second'); @@ -149,7 +149,7 @@ With multiple query terms OR `|` together, the `ts_rank` will add the numerators !!! code\_block time="0.561 ms" -```sql +```postgresql SELECT ts_rank(title_and_body_text, to_tsquery('english', 'second | title')), * FROM documents ORDER BY ts_rank DESC; @@ -179,7 +179,7 @@ A quick improvement we could make to our search query would be to differentiate !!! code\_block time="0.561 ms" -```sql +```postgresql SELECT ts_rank(title, to_tsquery('english', 'second | title')) AS title_rank, ts_rank(body, to_tsquery('english', 'second | title')) AS body_rank, @@ -208,7 +208,7 @@ First things first, we need to record some user clicks on our search results. We !!! code\_block time="0.561 ms" -```sql +```postgresql CREATE TABLE search_result_clicks ( title_rank REAL, body_rank REAL, @@ -228,7 +228,7 @@ I've made up 4 example searches, across our 3 documents, and recorded the `ts_ra !!! code\_block time="2.161 ms" -```sql +```postgresql INSERT INTO search_result_clicks (title_rank, body_rank, clicked) VALUES @@ -267,7 +267,7 @@ Here goes some machine learning: !!! code\_block time="6.867 ms" -```sql +```postgresql SELECT * FROM pgml.train( project_name => 'Search Ranking', task => 'regression', @@ -314,7 +314,7 @@ Once a model is trained, you can use `pgml.predict` to use it on new inputs. `pg !!! code\_block time="3.119 ms" -```sql +```postgresql SELECT clicked, pgml.predict('Search Ranking', array[title_rank, body_rank]) @@ -367,7 +367,7 @@ It's nice to organize the query into logical steps, and we can use **Common Tabl !!! code\_block time="2.118 ms" -```sql +```postgresql WITH first_pass_ranked_documents AS ( SELECT -- Compute the ts_rank for the title and body text of each document diff --git a/pgml-cms/docs/use-cases/supervised-learning.md b/pgml-cms/docs/use-cases/supervised-learning.md index 8dcf59dd9..6d7b4dc2d 100644 --- a/pgml-cms/docs/use-cases/supervised-learning.md +++ b/pgml-cms/docs/use-cases/supervised-learning.md @@ -8,7 +8,7 @@ description: A machine learning approach that uses labeled data A large part of the machine learning workflow is acquiring, cleaning, and preparing data for training algorithms. Naturally, we think Postgres is a great place to store your data. For the purpose of this example, we'll load a toy dataset, the classic handwritten digits image collection, from scikit-learn. -```sql +```postgresql SELECT * FROM pgml.load_dataset('digits'); ``` @@ -25,7 +25,7 @@ This `NOTICE` can safely be ignored. PostgresML attempts to do a clean reload by PostgresML loaded the Digits dataset into the `pgml.digits` table. You can examine the 2D arrays of image data, as well as the label in the `target` column: -```sql +```postgresql SELECT target, image @@ -48,7 +48,7 @@ target | Now that we've got data, we're ready to train a model using an algorithm. We'll start with the default `linear` algorithm to demonstrate the basics. See the [Algorithms](../../../docs/training/algorithm\_selection/) for a complete list of available algorithms. -```sql +```postgresql SELECT * FROM pgml.train( 'Handwritten Digit Image Classifier', 'classification', @@ -85,7 +85,7 @@ The output gives us information about the training run, including the `deployed` Now we can inspect some of the artifacts a training run creates. -```sql +```postgresql SELECT * FROM pgml.overview; ``` @@ -105,7 +105,7 @@ The `pgml.predict()` function is the key value proposition of PostgresML. It pro The API for predictions is very simple and only requires two arguments: the project name and the features used for prediction. -```sql +```postgresql select pgml.predict ( project_name TEXT, features REAL[] @@ -154,7 +154,7 @@ LIMIT 25; If you've already been through the [Training Overview](../../../docs/training/overview/), you can see the results of those efforts: -```sql +```postgresql SELECT target, pgml.predict('Handwritten Digit Image Classifier', image) AS prediction @@ -182,7 +182,7 @@ LIMIT 10; Since it's so easy to train multiple algorithms with different hyperparameters, sometimes it's a good idea to know which deployed model is used to make predictions. You can find that out by querying the `pgml.deployed_models` view: -```sql +```postgresql SELECT * FROM pgml.deployed_models; ``` @@ -201,7 +201,7 @@ Take a look at [Deploying Models](../../../docs/predictions/deployments/) docume You may also specify a model\_id to predict rather than a project name, to use a particular training run. You can find model ids by querying the `pgml.models` table. -```sql +```postgresql SELECT models.id, models.algorithm, models.metrics FROM pgml.models JOIN pgml.projects @@ -220,7 +220,7 @@ recision": 0.9175060987472534, "score_time": 0.019625699147582054} For example, making predictions with `model_id = 1`: -```sql +```postgresql SELECT target, pgml.predict(1, image) AS prediction diff --git a/pgml-dashboard/static/images/gym/quick_start.md b/pgml-dashboard/static/images/gym/quick_start.md index 3662b0c45..026d8ddf8 100644 --- a/pgml-dashboard/static/images/gym/quick_start.md +++ b/pgml-dashboard/static/images/gym/quick_start.md @@ -25,7 +25,7 @@ Once you have your PostgresML instance running, we'll be ready to get started. The first part of machine learning is getting your data in a format you can use. That's usually the hardest part, but thankfully we have a few example datasets we can use. To load one of them, navigate to the IDE tab and run this query: -```sql +```postgresql SELECT * FROM pgml.load_dataset('diabetes'); ``` @@ -46,7 +46,7 @@ To load them into PostgresML, use the same function above with the desired datas The SQL editor you just used can run arbitrary queries on the PostgresML instance. For example, if we want to see what dataset we just loaded looks like, we can run: -```sql +```postgresql SELECT * FROM pgml.diabetes LIMIT 5; ``` @@ -78,7 +78,7 @@ PostgresML organizes itself into projects. A project is just a name for model(s) Using the IDE, run: -```sql +```postgresql SELECT * FROM pgml.train( 'My First Project', task => 'regression', @@ -106,7 +106,7 @@ Inference is the act of predicting labels that we haven't necessarily used in tr Let's try and predict some new values. Using the IDE, run: -```sql +```postgresql SELECT pgml.predict( 'My First Project', ARRAY[ @@ -130,7 +130,7 @@ You should see something like this: The `prediction` column represents the possible value of the `target` column given the new features we just passed into the `pgml.predict()` function. You can just as easily predict multiple points and compare them to the actual labels in the dataset: -```sql +```postgresql SELECT pgml.predict('My First Project 2', ARRAY[ age, sex, bmi, bp, s1, s3, s3, s4, s5, s6 @@ -151,7 +151,7 @@ As you can see, we automatically performed some analysis on the data. Visualizin XGBoost is a good algorithm, but what if there are better ones? Let's try training a few more using the IDE. Run these one at a time: -```sql +```postgresql -- Simple linear regression. SELECT * FROM pgml.train( 'My First Project', diff --git a/pgml-extension/examples/regression.sql b/pgml-extension/examples/regression.sql index e355b6393..5b4a05390 100644 --- a/pgml-extension/examples/regression.sql +++ b/pgml-extension/examples/regression.sql @@ -125,3 +125,11 @@ SELECT * FROM pgml.deploy('Diabetes Progression', 'best_score', 'svm'); SELECT target, pgml.predict('Diabetes Progression', ARRAY[age, sex, bmi, bp, s1, s2, s3, s4, s5, s6]) AS prediction FROM pgml.diabetes LIMIT 10; + +begin; +delete from pgml.models; +delete from pgml.projects; +delete from pgml.snapshots; +delete from pgml.files; +delete from pgml.deployments; +commit; diff --git a/pgml-extension/src/api.rs b/pgml-extension/src/api.rs index 623d8f872..14efde32b 100644 --- a/pgml-extension/src/api.rs +++ b/pgml-extension/src/api.rs @@ -619,7 +619,7 @@ pub fn embed_batch( /// Returns `true` if the GPU cache was successfully cleared, `false` otherwise. /// # Example /// -/// ```sql +/// ```postgresql /// SELECT pgml.clear_gpu_cache(memory_usage => 0.5); /// ``` #[cfg(all(feature = "python", not(feature = "use_as_lib")))] From a424db0ec6924564fe29ac97c42eeb07b6907bf6 Mon Sep 17 00:00:00 2001 From: Montana Low Date: Fri, 24 May 2024 10:31:34 -0700 Subject: [PATCH 09/11] cleanup --- pgml-cms/docs/SUMMARY.md | 33 ++--- .../docs/api/sql-extension/pgml.decompose.md | 10 +- .../{use-cases => guides}/chatbots/README.md | 0 pgml-cms/docs/guides/embeddings/README.md | 15 +- .../embeddings/dimensionality-reduction.md | 140 +++++++++++++++++- .../embeddings/in-database-generation.md | 3 +- .../guides/embeddings/proprietary-models.md | 0 ...ve-search-results-with-machine-learning.md | 0 .../natural-language-processing.md | 0 .../{use-cases => guides}/opensourceai.md | 0 .../supervised-learning.md | 0 pgml-cms/docs/use-cases/README.md | 1 + pgml-cms/docs/use-cases/test-hello.md | 1 - .../navigation/left_nav/docs/template.html | 2 +- pgml-extension/examples/regression.sql | 10 +- 15 files changed, 174 insertions(+), 41 deletions(-) rename pgml-cms/docs/{use-cases => guides}/chatbots/README.md (100%) create mode 100644 pgml-cms/docs/guides/embeddings/proprietary-models.md rename pgml-cms/docs/{use-cases => guides}/improve-search-results-with-machine-learning.md (100%) rename pgml-cms/docs/{use-cases => guides}/natural-language-processing.md (100%) rename pgml-cms/docs/{use-cases => guides}/opensourceai.md (100%) rename pgml-cms/docs/{use-cases => guides}/supervised-learning.md (100%) create mode 100644 pgml-cms/docs/use-cases/README.md delete mode 100644 pgml-cms/docs/use-cases/test-hello.md diff --git a/pgml-cms/docs/SUMMARY.md b/pgml-cms/docs/SUMMARY.md index 5fe8dad33..e207aa1be 100644 --- a/pgml-cms/docs/SUMMARY.md +++ b/pgml-cms/docs/SUMMARY.md @@ -55,12 +55,21 @@ * [Embeddings](guides/embeddings/README.md) * [In-database Generation](guides/embeddings/in-database-generation.md) - * [Dimensionality Reduction](uides/embeddings/dimensionality-reduction.md) - * [Re-ranking nearest neighbors](uides/embeddings/re-ranking-nearest-neighbors.md) - * [Indexing w/ pgvector]() + * [Dimensionality Reduction](guides/embeddings/dimensionality-reduction.md) * [Aggregation](guides/embeddings/vector-aggregation.md) * [Similarity](guides/embeddings/vector-similarity.md) - * [Normalization](guides/embeddings/vector-normalization.md) + * [Normalization](guides/embeddings/vector-normalization.md) + + + +* [Search](guides/improve-search-results-with-machine-learning.md) +* [Chatbots](guides/chatbots/README.md) + * [Example Application](use-cases/chatbots.md) +* [Supervised Learning](guides/supervised-learning.md) +* [OpenSourceAI](guides/opensourceai.md) +* [Natural Language Processing](guides/natural-language-processing.md) + + ## Product @@ -113,18 +124,6 @@ * [Installation](product/pgcat/installation.md) * [Configuration](product/pgcat/configuration.md) -## Use Cases - -* [OpenSourceAI](use-cases/opensourceai.md) -* [Chatbots](use-cases/chatbots/README.md) - * [Example Application](use-cases/chatbots.md) -* [Search](use-cases/improve-search-results-with-machine-learning.md) -* [Embeddings](use-cases/embeddings/README.md) - * [Generating LLM embeddings with open source models](use-cases/embeddings/generating-llm-embeddings-with-open-source-models-in-postgresml.md) - * [Tuning vector recall while generating query embeddings in the database](use-cases/embeddings/tuning-vector-recall-while-generating-query-embeddings-in-the-database.md) - * [Personalize embedding results with application data in your database](use-cases/embeddings/personalize-embedding-results-with-application-data-in-your-database.md) -* [Supervised Learning](use-cases/supervised-learning.md) -* [Natural Language Processing](use-cases/natural-language-processing.md) ## Resources diff --git a/pgml-cms/docs/api/sql-extension/pgml.decompose.md b/pgml-cms/docs/api/sql-extension/pgml.decompose.md index 94db1ac91..16d4dfd46 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.decompose.md +++ b/pgml-cms/docs/api/sql-extension/pgml.decompose.md @@ -4,7 +4,7 @@ description: Decompose an input vector into it's principal components # pgml.decompose() -Matrix decomposition reduces the number of dimensions in a vector, to improve relevance and reduce computation required. +Matrix decomposition reduces the number of dimensions in a vector, to improve relevance and reduce computation required. ## API @@ -17,10 +17,10 @@ pgml.decompose( ### Parameters -| Parameter | Example | Description | -|----------------|---------------------------------|----------------------------------------------------------| -| `project_name` | `'My First PostgresML Project'` | The project name used to train models in `pgml.train()`. | -| `vector` | `ARRAY[0.1, 0.45, 1.0]` | The feature vector to transform. | +| Parameter | Example | Description | +|----------------|---------------------------------|-------------------------------------------------------------------------| +| `project_name` | `'My First PostgresML Project'` | The project name used to train a decomposition model in `pgml.train()`. | +| `vector` | `ARRAY[0.1, 0.45, 1.0]` | The feature vector to transform. | ## Example diff --git a/pgml-cms/docs/use-cases/chatbots/README.md b/pgml-cms/docs/guides/chatbots/README.md similarity index 100% rename from pgml-cms/docs/use-cases/chatbots/README.md rename to pgml-cms/docs/guides/chatbots/README.md diff --git a/pgml-cms/docs/guides/embeddings/README.md b/pgml-cms/docs/guides/embeddings/README.md index 888582144..39557d79f 100644 --- a/pgml-cms/docs/guides/embeddings/README.md +++ b/pgml-cms/docs/guides/embeddings/README.md @@ -21,12 +21,15 @@ This guide will introduce you to the fundamentals of embeddings within PostgresM In this guide, we will cover: * [In-database Generation](guides/embeddings/in-database-generation.md) -* [Dimensionality Reduction]() -* [Re-ranking nearest neighbors]() -* [Indexing w/ pgvector]() -* [Aggregation](guides/embeddings/vector-aggregation) -* [Similarity](guides/embeddings/vector-similarity) -* [Normalization](guides/embeddings/vector-normalization) +* [Dimensionality Reduction](guides/embeddings/dimensionality-reduction.md) +* [Aggregation](guides/embeddings/vector-aggregation.md) +* [Similarity](guides/embeddings/vector-similarity.md) +* [Normalization](guides/embeddings/vector-normalization.md) + ## Embeddings are vectors diff --git a/pgml-cms/docs/guides/embeddings/dimensionality-reduction.md b/pgml-cms/docs/guides/embeddings/dimensionality-reduction.md index 7c25516f4..350e4a59f 100644 --- a/pgml-cms/docs/guides/embeddings/dimensionality-reduction.md +++ b/pgml-cms/docs/guides/embeddings/dimensionality-reduction.md @@ -1,6 +1,144 @@ - # Dimensionality Reduction +In the case of embedding models trained on large bodies of text, most of the concepts they learn will be unused when dealing with any single piece of text. For collections of documents that deal with specific topics, only a fraction of the language models learned associations will be relevant. Dimensionality reduction is an important technique to improve performance _on your documents_, both in terms of quality and latency for embedding recall using nearest neighbor search. + +## Why Dimensionality Reduction? + +- **Improved Performance**: Reducing the number of dimensions can significantly improve the computational efficiency of machine learning algorithms. +- **Reduced Storage**: Lower-dimensional data requires less storage space. +- **Enhanced Visualization**: It is easier to visualize data in two or three dimensions. + +## What is Matrix Decomposition? +Dimensionality reduction is a key technique in machine learning and data analysis, particularly when dealing with high-dimensional data such as embeddings. A table full of embeddings can be considered a matrix, aka a 2-dimensional array with rows and columns, where the embedding dimensions are the columns. We can use matrix decomposition methods, such as Principal Component Analysis (PCA) and Singular Value Decomposition (SVD), to reduce the dimensionality of embeddings. + +Matrix decomposition involves breaking down a matrix into simpler, constituent matrices. The most common decomposition techniques for this purpose are: + +- **Principal Component Analysis (PCA)**: Reduces dimensionality by projecting data onto a lower-dimensional subspace that captures the most variance. +- **Singular Value Decomposition (SVD)**: Factorizes a matrix into three matrices, capturing the essential features in a reduced form. + +## Dimensionality Reduction with PostgresML +PostgresML allows in-database execution of matrix decomposition techniques, enabling efficient dimensionality reduction directly within the database environment. + +## Step-by-Step Guide to Using Matrix Decomposition + +1. **Preparing the data** + We'll create a set of embeddings using modern embedding model with 384 dimensions. + + ```postgresql + CREATE TABLE documents_with_embeddings ( + id SERIAL PRIMARY KEY, + body TEXT, + embedding FLOAT[] GENERATED ALWAYS AS (pgml.normalize_l2(pgml.embed('intfloat/e5-small-v2', body))) STORED + ); + ``` + + !!! generic + + !!! code_block time="46.823" + + ```postgresql + INSERT INTO documents_with_embeddings (body) + VALUES -- embedding vectors are automatically generated + ('Example text data'), + ('Another example document'), + ('Some other thing'); + ``` + + !!! + + !!! results + + ```postgresql + INSERT 0 3 + ``` + + !!! + + !!! + +2. + +Ensure that your data is loaded into the Postgres database and is in a suitable format for decomposition. For example, we'll treat the if you have embeddings stored in a table: + +```postgresql +SELECT pgml.load_dataset('digits'); +``` + +-- create an unlabeled table of the images for unsupervised learning +CREATE VIEW pgml.digit_vectors AS +SELECT image FROM pgml.digits; + +-- view the dataset +SELECT left(image::text, 40) || ',...}' FROM pgml.digit_vectors LIMIT 10; + +-- train a simple model to cluster the data +SELECT * FROM pgml.train('Handwritten Digit Components', 'decomposition', 'pgml.digit_vectors', hyperparams => '{"n_components": 3}'); + +-- check out the compenents +SELECT target, pgml.decompose('Handwritten Digit Components', image) AS pca +FROM pgml.digits +LIMIT 10; + + + + + + + + + + + + + + + +```sql +SELECT * FROM embeddings_table; ## Introduction ## Principal Component Analysis + + +# Decomposition + +Models can be trained using `pgml.train` on unlabeled data to identify important features within the data. To decompose a dataset into it's principal components, we can use the table or a view. Since decomposition is an unsupervised algorithm, we don't need a column that represents a label as one of the inputs to `pgml.train`. + +## Example + +This example trains models on the sklearn digits dataset -- which is a copy of the test set of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits). This demonstrates using a table with a single array feature column for principal component analysis. You could do something similar with a vector column. + +```postgresql +SELECT pgml.load_dataset('digits'); + +-- create an unlabeled table of the images for unsupervised learning +CREATE VIEW pgml.digit_vectors AS +SELECT image FROM pgml.digits; + +-- view the dataset +SELECT left(image::text, 40) || ',...}' FROM pgml.digit_vectors LIMIT 10; + +-- train a simple model to cluster the data +SELECT * FROM pgml.train('Handwritten Digit Components', 'decomposition', 'pgml.digit_vectors', hyperparams => '{"n_components": 3}'); + +-- check out the compenents +SELECT target, pgml.decompose('Handwritten Digit Components', image) AS pca +FROM pgml.digits +LIMIT 10; +``` + +Note that the input vectors have been reduced from 64 dimensions to 3, which explain nearly half of the variance across all samples. + +## Algorithms + +All decomposition algorithms implemented by PostgresML are online versions. You may use the [pgml.decompose](../../../api/sql-extension/pgml.decompose "mention") function to decompose novel data points after the model has been trained. + +| Algorithm | Reference | +|---------------------------|---------------------------------------------------------------------------------------------------------------------| +| `pca` | [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) | + +### Examples + +```postgresql +SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'pca', hyperparams => '{"n_components": 10}'); +``` diff --git a/pgml-cms/docs/guides/embeddings/in-database-generation.md b/pgml-cms/docs/guides/embeddings/in-database-generation.md index 7f885cbe9..f4e23f174 100644 --- a/pgml-cms/docs/guides/embeddings/in-database-generation.md +++ b/pgml-cms/docs/guides/embeddings/in-database-generation.md @@ -214,8 +214,9 @@ FROM documents; !!! -You can see the near 2.5x speedup when generating 3 embeddings in a batch, because the model weights only need to be streamed from GPU RAM to the processors a single time. You should consider batch sizes from 10-100 embeddings at a time when do bulk operations to improve throughput and reduce costs. +You can see the near 2.5x speedup when generating 3 embeddings in a batch, because the model weights only need to be streamed from GPU RAM to the processors a single time. You should consider batch sizes from 10-100 embeddings at a time when do bulk operations to improve throughput and reduce costs. ## Scalability PostgresML serverless instances have access to multiple GPUs that be used simultaneously across different PostgreSQL connections. For large jobs, you may want to create multiple worker threads/processes that operate across your dataset in batches on their own Postgres Connection. + diff --git a/pgml-cms/docs/guides/embeddings/proprietary-models.md b/pgml-cms/docs/guides/embeddings/proprietary-models.md new file mode 100644 index 000000000..e69de29bb diff --git a/pgml-cms/docs/use-cases/improve-search-results-with-machine-learning.md b/pgml-cms/docs/guides/improve-search-results-with-machine-learning.md similarity index 100% rename from pgml-cms/docs/use-cases/improve-search-results-with-machine-learning.md rename to pgml-cms/docs/guides/improve-search-results-with-machine-learning.md diff --git a/pgml-cms/docs/use-cases/natural-language-processing.md b/pgml-cms/docs/guides/natural-language-processing.md similarity index 100% rename from pgml-cms/docs/use-cases/natural-language-processing.md rename to pgml-cms/docs/guides/natural-language-processing.md diff --git a/pgml-cms/docs/use-cases/opensourceai.md b/pgml-cms/docs/guides/opensourceai.md similarity index 100% rename from pgml-cms/docs/use-cases/opensourceai.md rename to pgml-cms/docs/guides/opensourceai.md diff --git a/pgml-cms/docs/use-cases/supervised-learning.md b/pgml-cms/docs/guides/supervised-learning.md similarity index 100% rename from pgml-cms/docs/use-cases/supervised-learning.md rename to pgml-cms/docs/guides/supervised-learning.md diff --git a/pgml-cms/docs/use-cases/README.md b/pgml-cms/docs/use-cases/README.md new file mode 100644 index 000000000..9b163e6e0 --- /dev/null +++ b/pgml-cms/docs/use-cases/README.md @@ -0,0 +1 @@ +use-cases section is deprecated, and is being refactored into guides, or a new section under product \ No newline at end of file diff --git a/pgml-cms/docs/use-cases/test-hello.md b/pgml-cms/docs/use-cases/test-hello.md deleted file mode 100644 index 18f2549a5..000000000 --- a/pgml-cms/docs/use-cases/test-hello.md +++ /dev/null @@ -1 +0,0 @@ -# Test hekki diff --git a/pgml-dashboard/src/components/navigation/left_nav/docs/template.html b/pgml-dashboard/src/components/navigation/left_nav/docs/template.html index 15390a368..06459e291 100644 --- a/pgml-dashboard/src/components/navigation/left_nav/docs/template.html +++ b/pgml-dashboard/src/components/navigation/left_nav/docs/template.html @@ -3,7 +3,7 @@ match title.to_lowercase().as_str() { "api" => "sdk", "product" => "dashboard", - "use cases" => "account_circle", + "guides" => "menu_book", "resources" => "school", "introduction" => "list_alt", _ => "dashboard", diff --git a/pgml-extension/examples/regression.sql b/pgml-extension/examples/regression.sql index 5b4a05390..dfc469165 100644 --- a/pgml-extension/examples/regression.sql +++ b/pgml-extension/examples/regression.sql @@ -81,7 +81,7 @@ SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'random_forest', h -- gradient boosting SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}'); SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'catboost', hyperparams => '{"n_estimators": 10}'); ---SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}'); +-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}'); -- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lightgbm', hyperparams => '{"n_estimators": 1}'); -- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset -- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'hist_gradient_boosting', hyperparams => '{"max_iter": 10}'); @@ -125,11 +125,3 @@ SELECT * FROM pgml.deploy('Diabetes Progression', 'best_score', 'svm'); SELECT target, pgml.predict('Diabetes Progression', ARRAY[age, sex, bmi, bp, s1, s2, s3, s4, s5, s6]) AS prediction FROM pgml.diabetes LIMIT 10; - -begin; -delete from pgml.models; -delete from pgml.projects; -delete from pgml.snapshots; -delete from pgml.files; -delete from pgml.deployments; -commit; From 6178765adf595c5c83362ec7c79561709c56e83b Mon Sep 17 00:00:00 2001 From: Montana Low Date: Fri, 24 May 2024 10:53:57 -0700 Subject: [PATCH 10/11] finish article --- .../embeddings/dimensionality-reduction.md | 165 +++++++++--------- 1 file changed, 82 insertions(+), 83 deletions(-) diff --git a/pgml-cms/docs/guides/embeddings/dimensionality-reduction.md b/pgml-cms/docs/guides/embeddings/dimensionality-reduction.md index 350e4a59f..ea829ed0b 100644 --- a/pgml-cms/docs/guides/embeddings/dimensionality-reduction.md +++ b/pgml-cms/docs/guides/embeddings/dimensionality-reduction.md @@ -21,124 +21,123 @@ PostgresML allows in-database execution of matrix decomposition techniques, enab ## Step-by-Step Guide to Using Matrix Decomposition -1. **Preparing the data** - We'll create a set of embeddings using modern embedding model with 384 dimensions. - - ```postgresql - CREATE TABLE documents_with_embeddings ( - id SERIAL PRIMARY KEY, - body TEXT, - embedding FLOAT[] GENERATED ALWAYS AS (pgml.normalize_l2(pgml.embed('intfloat/e5-small-v2', body))) STORED - ); - ``` - - !!! generic - - !!! code_block time="46.823" - - ```postgresql - INSERT INTO documents_with_embeddings (body) - VALUES -- embedding vectors are automatically generated - ('Example text data'), - ('Another example document'), - ('Some other thing'); - ``` - - !!! - - !!! results - - ```postgresql - INSERT 0 3 - ``` - - !!! - - !!! - -2. - -Ensure that your data is loaded into the Postgres database and is in a suitable format for decomposition. For example, we'll treat the if you have embeddings stored in a table: +### Preparing the data +We'll create a set of embeddings using modern embedding model with 384 dimensions. ```postgresql -SELECT pgml.load_dataset('digits'); +CREATE TABLE documents_with_embeddings ( +id SERIAL PRIMARY KEY, +body TEXT, +embedding FLOAT[] GENERATED ALWAYS AS (pgml.normalize_l2(pgml.embed('intfloat/e5-small-v2', body))) STORED +); ``` + +!!! generic + +!!! code_block time="46.823" + +```postgresql +INSERT INTO documents_with_embeddings (body) +VALUES -- embedding vectors are automatically generated + ('Example text data'), + ('Another example document'), + ('Some other thing'), + ('We need a few more documents'), + ('At least as many documents as dimensions in the reduction'), + ('Which normally isn''t a problem'), + ('Unless you''re typing out a bunch of demo data'); +``` + +!!! + +!!! results + +```postgresql +INSERT 0 3 +``` + +!!! + +!!! --- create an unlabeled table of the images for unsupervised learning -CREATE VIEW pgml.digit_vectors AS -SELECT image FROM pgml.digits; --- view the dataset -SELECT left(image::text, 40) || ',...}' FROM pgml.digit_vectors LIMIT 10; +!!! generic --- train a simple model to cluster the data -SELECT * FROM pgml.train('Handwritten Digit Components', 'decomposition', 'pgml.digit_vectors', hyperparams => '{"n_components": 3}'); +!!! code_block time="14.259ms" --- check out the compenents -SELECT target, pgml.decompose('Handwritten Digit Components', image) AS pca -FROM pgml.digits -LIMIT 10; +```postgresql +CREATE VIEW just_embeddings AS +SELECT embedding +FROM documents_with_embeddings; +``` +!!! +!!! results +```postgresql + CREATE VIEW +``` +!!! +!!! +### Decomposition +Models can be trained using `pgml.train` on unlabeled data to identify important features within the data. To decompose a dataset into it's principal components, we can use the table or a view. Since decomposition is an unsupervised algorithm, we don't need a column that represents a label as one of the inputs to `pgml.train`. +Train a simple model to find reduce dimensions for 384, to the 3: +!!! generic +!!! code_block time="48.087 ms" +```postgresql +SELECT * FROM pgml.train('Embedding Components', 'decomposition', 'just_embeddings', hyperparams => '{"n_components": 3}'); +``` +!!! +!!! results +```postgresql +INFO: Metrics: {"cumulative_explained_variance": 0.69496775, "fit_time": 0.008234134, "score_time": 0.001717504} +INFO: Deploying model id: 2 -```sql -SELECT * FROM embeddings_table; -## Introduction + project | task | algorithm | deployed +----------------------+---------------+-----------+---------- + Embedding Components | decomposition | pca | t +``` -## Principal Component Analysis +!!! +!!! -# Decomposition +Note that the input vectors have been reduced from 384 dimensions to 3 that explain 69% of the variance across all samples. That's a more than 100x size reduction, while preserving 69% of the information. These 3 dimensions may be plenty for a course grained first pass ranking with a vector database distance function, like cosine similarity. You can then choose to use the full embeddings, or some other reduction, or the raw text with a reranker model to improve final relevance over the baseline with all the extra time you have now that you've reduced the cost of initial nearest neighbor recall 100x. -Models can be trained using `pgml.train` on unlabeled data to identify important features within the data. To decompose a dataset into it's principal components, we can use the table or a view. Since decomposition is an unsupervised algorithm, we don't need a column that represents a label as one of the inputs to `pgml.train`. +You can check out the components for any vector in this space using the reduction model: -## Example +!!! generic -This example trains models on the sklearn digits dataset -- which is a copy of the test set of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits). This demonstrates using a table with a single array feature column for principal component analysis. You could do something similar with a vector column. +!!! code_block time="14.259ms" ```postgresql -SELECT pgml.load_dataset('digits'); - --- create an unlabeled table of the images for unsupervised learning -CREATE VIEW pgml.digit_vectors AS -SELECT image FROM pgml.digits; - --- view the dataset -SELECT left(image::text, 40) || ',...}' FROM pgml.digit_vectors LIMIT 10; - --- train a simple model to cluster the data -SELECT * FROM pgml.train('Handwritten Digit Components', 'decomposition', 'pgml.digit_vectors', hyperparams => '{"n_components": 3}'); - --- check out the compenents -SELECT target, pgml.decompose('Handwritten Digit Components', image) AS pca -FROM pgml.digits +SELECT pgml.decompose('Embedding Components', embedding) AS pca +FROM just_embeddings LIMIT 10; ``` -Note that the input vectors have been reduced from 64 dimensions to 3, which explain nearly half of the variance across all samples. +!!! -## Algorithms +!!! results -All decomposition algorithms implemented by PostgresML are online versions. You may use the [pgml.decompose](../../../api/sql-extension/pgml.decompose "mention") function to decompose novel data points after the model has been trained. +```postgresql + CREATE VIEW +``` -| Algorithm | Reference | -|---------------------------|---------------------------------------------------------------------------------------------------------------------| -| `pca` | [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) | +!!! -### Examples +!!! -```postgresql -SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'pca', hyperparams => '{"n_components": 10}'); -``` +Exercise for the reader: Where is the sweet spot for number of dimensions, yet preserving say, 99% of the relevance data? How much of the cumulative explained variance do you need to preserve 100% to return the top N results for the reranker, if you feed the reranker top K using cosine similarity or another vector distance function? From a6da195d55a1410a316ecb55ffc2b209251ea33f Mon Sep 17 00:00:00 2001 From: Montana Low Date: Fri, 24 May 2024 11:24:52 -0700 Subject: [PATCH 11/11] Update pgml-dashboard/static/css/scss/pages/_docs.scss Co-authored-by: Lev Kokotov --- pgml-dashboard/static/css/scss/pages/_docs.scss | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pgml-dashboard/static/css/scss/pages/_docs.scss b/pgml-dashboard/static/css/scss/pages/_docs.scss index 2fb5aebde..f7a68650e 100644 --- a/pgml-dashboard/static/css/scss/pages/_docs.scss +++ b/pgml-dashboard/static/css/scss/pages/_docs.scss @@ -245,7 +245,7 @@ .cm-gutters { background: inherit; - border-right: 1px solid white; + border-right: 1px solid #{$white}; } .code-highlight {