LLMLingua – Compress Prompts to Speed Up LLMs and Reduce Costs

Have you ever found yourself frustrated by token limits when asking ChatGPT to summarize long texts? Or discouraged by the high costs of the GPT-3.5/4 API despite excellent results? If so, LLMLingua is made for you!

Developed by Microsoft researchers, LLMLingua-2 is a revolutionary prompt compression tool that accelerates the inference of large language models (LLMs) like GPT-3 and GPT-4. With advanced techniques for identifying and removing non-essential tokens, it can reduce prompt size by up to 20 times while preserving model performance.

Whether you’re a developer looking to optimize API costs or a user wanting to surpass context limits, LLMLingua offers numerous benefits:

💰 Cost Reduction: By compressing both prompts and generated responses, LLMLingua allows significant savings on your API bill.
📝 Extended Context Support: No more “lost in the middle” puzzle! LLMLingua efficiently handles long contexts and boosts overall performance.
⚖️ Robustness: No additional training needed for LLMs. LLMLingua seamlessly integrates.
🕵️ Knowledge Preservation: All key information from original prompts, such as contextual learning and reasoning, is retained.
📜 KV Cache Compression: Inference process is accelerated through key-value cache optimization.
🪃 Full Recovery: GPT-4 can reconstruct all information from compressed prompts. Impressive!
Let’s take a simple example and imagine you want to compress the following prompt with LLMLingua:

from llmlingua import PromptCompressor

llm_lingua = PromptCompressor()

prompt = "Sam bought a dozen boxes each containing 30 highlighters, for $10 each..."

compressed_prompt = llm_lingua.compress_prompt(prompt)

print(compressed_prompt)

And there you have it! In just a few lines of code, you get a compressed prompt ready to be sent to your favorite model:

Sam bought boxes each containing 30 highlighters, $10 each.
With a compression rate of 11.2x, the number of tokens goes from 2365 to just 211! And that’s just the beginning. On more complex examples like Chain-of-Thought prompts, LLMLingua maintains similar performance with compression rates of up to 20x.

Of course, having tested it thoroughly, you should understand that you may not always get an identical result between the compressed and uncompressed prompts, but for a gain of 60/70 or even 80%, the result generated from the compressed prompt remains accurate to about 70/80%, which is very good.

To get started with LLMLingua, it’s simple. Install the package with pip:

pip install llmlingua

Then let your creativity flow! Whether you’re a fan of Retrieval Augmented Generation (RAG), online meetings, Chain-of-Thought, or even coding, LLMLingua will meet your needs. Many examples and comprehensive documentation are available to guide you.

Source

Mohamed SAKHRI
Mohamed SAKHRI

I'm the creator and editor-in-chief of Tech To Geek. Through this little blog, I share with you my passion for technology. I specialize in various operating systems such as Windows, Linux, macOS, and Android, focusing on providing practical and valuable guides.

Articles: 1751

Newsletter Updates

Enter your email address below and subscribe to our newsletter

Leave a Reply

Your email address will not be published. Required fields are marked *