Skip to content

Commit 81e80d7

Browse files
authored
announce 2.6.0 (#767)
1 parent 02b8d2c commit 81e80d7

File tree

2 files changed

+7
-9
lines changed

2 files changed

+7
-9
lines changed

pgml-dashboard/content/blog/announcing-gptq-and-ggml-quantized-llm-support-for-huggingface-transformers.md

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,12 @@ image_alt: Discrete quantization is not a new idea. It's been used by both algor
1515
</div>
1616
</div>
1717

18-
Quantization allows PostgresML to fit larger models in less RAM. These algorithms perform inference significantly faster on NVIDIA, Apple and Intel hardware. Half-precision floating point and quantized optimizations are now available for your favorite LLMs downloaded from Huggingface.
1918

2019
## Introduction
2120

2221
Large Language Models (LLMs) are... large. They have a lot of parameters, which make up the weights and biases of the layers inside deep neural networks. Typically, these parameters are represented by individual 32-bit floating point numbers, so a model like GPT-2 that has 1.5B parameters would need `4 bytes * 1,500,000,000 = 6GB RAM`. The Leading Open Source models like LLaMA, Alpaca, and Guanaco, currently have 65B parameters, which requires about 260GB RAM. This is a lot of RAM, and it's not even counting what's needed to store the input and output data.
2322

24-
Bandwidth between RAM and CPU often becomes a bottleneck for performing inference with these models, rather than the number of processing cores or their speed, because the processors become starved for data. One way to reduce the amount of RAM and memory bandwidth needed is to use a smaller datatype, like 16-bit floating point numbers, which would reduce the model size in RAM by half. There are a couple competing 16-bit standards, but NVIDIA has introduced support for bfloat16 in their latest hardware generation, which keeps the full exponential range of float32, but gives up a 2/3rs of the precision. Most research has shown this is a good quality/performance tradeoff, and that model outputs are not terribly sensitive when truncating the least significant bits.
23+
Bandwidth between RAM and CPU often becomes a bottleneck for performing inference with these models, rather than the number of processing cores or their speed, because the processors become starved for data. One way to reduce the amount of RAM and memory bandwidth needed is to use a smaller datatype, like 16-bit floating point numbers, which would reduce the model size in RAM by half. There are a couple competing 16-bit standards, but NVIDIA has introduced support for bfloat16 in their latest hardware generation, which keeps the full exponential range of float32, but gives up a 2/3rs of the precision. Most research has shown this is a good quality/performance tradeoff, and that model outputs are not terribly sensitive when truncating the least significant bits.
2524

2625
| Format | Significand | Exponent |
2726
|----------|-------------|----------|
@@ -120,7 +119,6 @@ SELECT pgml.transform(
120119
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
121120
| ["Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger. The world was a place of great danger. The world"] |
122121

123-
124122
!!!
125123

126124
!!!
@@ -173,10 +171,11 @@ SELECT pgml.transform(
173171
args => '{"max_new_tokens": 32}'::JSONB
174172
);
175173
```
174+
176175
!!!
177176

178-
!!! results
179-
177+
!!! results
178+
180179
| transform |
181180
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
182181
| [[{"Once upon a time, I'd get angry over the fact that my house was going to have some very dangerous things from outside. To be honest, I know it's going to be"}]] |
@@ -210,6 +209,7 @@ SELECT pgml.transform(
210209
args => '{"max_new_tokens": 32}'::JSONB
211210
);
212211
```
212+
213213
!!!
214214

215215
!!! results
@@ -248,7 +248,7 @@ SELECT pgml.transform(
248248

249249
| transform |
250250
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
251-
| [[{"generated_text": "Once upon a time, we were able, due to our experience at home, to put forward the thesis that we're essentially living life as a laboratory creature with the help of other humans"}]] |
251+
| [[{"generated_text": "Once upon a time, we were able, due to our experience at home, to put forward the thesis that we're essentially living life as a laboratory creature with the help of other humans"}]] |
252252

253253
!!!
254254

@@ -431,12 +431,10 @@ ASSISTANT:$$
431431
||
432432
| [" Meet Sarah, a strong-willed woman who has always had a passion for adventure. Born and raised in the bustling city of New York, she was no stranger to the hustle and bustle of life in the big apple. However, Sarah longed for something more than the monotonous routine that had become her daily life.\n\nOne day, while browsing through a travel magazine, Sarah stumbled upon an ad for a wildlife conservation program in Africa. Intrigued by the opportunity to make a difference in the world and expand her horizons, she decided to take the leap and apply for the position.\n\nTo her surprise, Sarah was accepted into the program and found herself on a plane bound for the African continent. She spent the next several months living and working among some of the most incredible wildlife she had ever seen. It was during this time that Sarah discovered a love for exploration and a desire to see more of the world.\n\nAfter completing her program, Sarah returned to New York with a newfound sense of purpose and ambition. She was determined to use her experiences to fuel her love for adventure and make the most out of every opportunity that came her way. Whether it was traveling to new destinations or taking on new challenges in her daily life, Sarah was not afraid to step outside of her comfort zone and embrace the unknown.\n\nAnd so, Sarah's journey continued as she made New York her home base for all of her future adventures. She became a role model for others who longed for something more out of life, inspiring them to chase their dreams and embrace the exciting possibilities that lay ahead."] |
433433
434-
435434
!!!
436435
437436
!!!
438437
439-
440438
### Conclusion
441439
442440
There are many open source LLMs. If you're looking for a list to try, check out [the leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). You can also [search for GPTQ](https://huggingface.co/models?search=gptq) and [GGML](https://huggingface.co/models?search=ggml) versions of those models on the hub to see what is popular in the community. If you're looking for a model that is not available in a quantized format, you can always quantize it yourself. If you're successful, please consider sharing your quantized model with the community!

pgml-extension/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ FROM nvidia/cuda:12.1.1-devel-ubuntu22.04
22
LABEL maintainer="team@postgresml.com"
33

44
ARG DEBIAN_FRONTEND=noninteractive
5-
ARG PGML_VERSION=2.5.3
5+
ARG PGML_VERSION=2.6.0
66
ENV TZ=Etc/UTC
77
ENV PATH="/usr/local/cuda/bin:${PATH}"
88

0 commit comments

Comments
 (0)