You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: pgml-dashboard/content/blog/announcing-gptq-and-ggml-quantized-llm-support-for-huggingface-transformers.md
+6-8Lines changed: 6 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,13 +15,12 @@ image_alt: Discrete quantization is not a new idea. It's been used by both algor
15
15
</div>
16
16
</div>
17
17
18
-
Quantization allows PostgresML to fit larger models in less RAM. These algorithms perform inference significantly faster on NVIDIA, Apple and Intel hardware. Half-precision floating point and quantized optimizations are now available for your favorite LLMs downloaded from Huggingface.
19
18
20
19
## Introduction
21
20
22
21
Large Language Models (LLMs) are... large. They have a lot of parameters, which make up the weights and biases of the layers inside deep neural networks. Typically, these parameters are represented by individual 32-bit floating point numbers, so a model like GPT-2 that has 1.5B parameters would need `4 bytes * 1,500,000,000 = 6GB RAM`. The Leading Open Source models like LLaMA, Alpaca, and Guanaco, currently have 65B parameters, which requires about 260GB RAM. This is a lot of RAM, and it's not even counting what's needed to store the input and output data.
23
22
24
-
Bandwidth between RAM and CPU often becomes a bottleneck for performing inference with these models, rather than the number of processing cores or their speed, because the processors become starved for data. One way to reduce the amount of RAM and memory bandwidth needed is to use a smaller datatype, like 16-bit floating point numbers, which would reduce the model size in RAM by half. There are a couple competing 16-bit standards, but NVIDIA has introduced support for bfloat16 in their latest hardware generation, which keeps the full exponential range of float32, but gives up a 2/3rs of the precision. Most research has shown this is a good quality/performance tradeoff, and that model outputs are not terribly sensitive when truncating the least significant bits.
23
+
Bandwidth between RAM and CPU often becomes a bottleneck for performing inference with these models, rather than the number of processing cores or their speed, because the processors become starved for data. One way to reduce the amount of RAM and memory bandwidth needed is to use a smaller datatype, like 16-bit floating point numbers, which would reduce the model size in RAM by half. There are a couple competing 16-bit standards, but NVIDIA has introduced support for bfloat16 in their latest hardware generation, which keeps the full exponential range of float32, but gives up a 2/3rs of the precision. Most research has shown this is a good quality/performance tradeoff, and that model outputs are not terribly sensitive when truncating the least significant bits.
|["Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger. The world was a place of great danger. The world"]|
|[[{"Once upon a time, I'd get angry over the fact that my house was going to have some very dangerous things from outside. To be honest, I know it's going to be"}]]|
|[[{"generated_text": "Once upon a time, we were able, due to our experience at home, to put forward the thesis that we're essentially living life as a laboratory creature with the help of other humans"}]]|
251
+
|[[{"generated_text": "Once upon a time, we were able, due to our experience at home, to put forward the thesis that we're essentially living life as a laboratory creature with the help of other humans"}]]|
| [" Meet Sarah, a strong-willed woman who has always had a passion for adventure. Born and raised in the bustling city of New York, she was no stranger to the hustle and bustle of life in the big apple. However, Sarah longed for something more than the monotonous routine that had become her daily life.\n\nOne day, while browsing through a travel magazine, Sarah stumbled upon an ad for a wildlife conservation program in Africa. Intrigued by the opportunity to make a difference in the world and expand her horizons, she decided to take the leap and apply for the position.\n\nTo her surprise, Sarah was accepted into the program and found herself on a plane bound for the African continent. She spent the next several months living and working among some of the most incredible wildlife she had ever seen. It was during this time that Sarah discovered a love for exploration and a desire to see more of the world.\n\nAfter completing her program, Sarah returned to New York with a newfound sense of purpose and ambition. She was determined to use her experiences to fuel her love for adventure and make the most out of every opportunity that came her way. Whether it was traveling to new destinations or taking on new challenges in her daily life, Sarah was not afraid to step outside of her comfort zone and embrace the unknown.\n\nAnd so, Sarah's journey continued as she made New York her home base for all of her future adventures. She became a role model for others who longed for something more out of life, inspiring them to chase their dreams and embrace the exciting possibilities that lay ahead."] |
433
433
434
-
435
434
!!!
436
435
437
436
!!!
438
437
439
-
440
438
### Conclusion
441
439
442
440
There are many open source LLMs. If you're looking for a list to try, check out [the leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). You can also [search for GPTQ](https://huggingface.co/models?search=gptq) and [GGML](https://huggingface.co/models?search=ggml) versions of those models on the hub to see what is popular in the community. If you're looking for a model that is not available in a quantized format, you can always quantize it yourself. If you're successful, please consider sharing your quantized model with the community!
0 commit comments