How to Run Quantized AI Models on Precision Workstations | Dell (2024)

How to Run Quantized AI Models on Precision Workstations | Dell (1)How to Run Quantized AI Models on Precision Workstations | Dell (2)

Precision

A practical guide to understand how to configure your PC to work effectively with generative AI large language models.

By Matthew Allard |

Generative AI (GenAI) has crashed the world of computing, and our customers want to start working with large language models (LLMs) to develop innovative new capabilities to drive productivity, efficiency and innovation in their companies. Dell Technologies has the world’s broadest AI infrastructure portfolio that spans from cloud to client devices, all in one place*—providing end-to-end AI solutions and services designed to meet customers wherever they are in their AI journey. Dell also offers hardware solutions engineered to support AI workloads, from workstation PCs (mobile and fixed) to servers for high-performance computing, data storage, cloud native software-defined infrastructure, networking switches, data protection,HCI and services.But one of the biggest questions from our customers is how to determine whether a PC can work effectively with a particular LLM. We’ll try to help answer that question and provide some guidance on configuration choices that users should consider when working with GenAI.

First, consider some basics on what is helpful to handle an LLM in a PC. While AI routines can be processed in the CPU or a new class of dedicated AI circuitry called an NPU, NVIDIA RTX GPUs currently hold the pole position for AI processing in PCs with dedicated circuits called Tensor cores. RTX Tensor cores are designed to enable mixed precision mathematical computing that is at the heart of AI processing. But performing the math is only part of the story, LLMs have the additional consideration of available memory space given their potential memory footprint. To maximize performance of AI in the GPU, you want the LLM processing to fit into the GPU VRAM. NVIDIA’s line of GPUs is scalable across both the mobile and fixed workstation offerings to provide options for the number of Tensor cores and GPU VRAM, so a system can be easily sized to fit. Keep in mind that some fixed workstations can host multiple GPUs expanding capacities even further.

There are an increasing number and variety of LLMs coming onto the market, but one of the most important considerations for determining hardware requirements is the parameter size of the LLM selected. Take Meta AI’s Llama-2 LLM. It is available in three different parameter sizes—seven, 13 and 70 billion parameters. Generally, with higher parameter sizes, one can expect greater accuracy from the LLM and greater applicability for general knowledge applications.

Whether a customer’s goal is to take the foundation model and run it as is for inferencing or to adapt it to their specific use case and data, they need to be aware of the demands the LLM will put on the machine and how to best manage the model. Developing and training a model against a specific use case using customer-specific data is where customers have seen the greatest innovation and return on their AI projects. The largest parameter size models can come with extreme performance requirements for the machine when developing new features and applications with the LLMs, so data scientists have developed approaches that help reduce the processing overhead and manage the accuracy of the LLM output simultaneously.

Quantization is one of those approaches. It is a technique used to reduce the size of LLMs by modifying the math precision of their internal parameters (i.e., weights). Reducing the bit precision has two impacts to the LLM, reducing the processing footprint and memory requirements and also impacting the output accuracy of the LLM. Quantization can be viewed as analogous to JPEG image compression where applying more compression can create more efficient images, but applying too much compression can create images that may not be legible for some use cases.

Let’s look at an example of how quantizing an LLM can reduce the required GPU memory.

To put this into practical terms, customers who want to run the Llama-2 model quantized at 4-bit precision have a range of choices in the Dell Precision workstation range.

Running at higher precision (BF16) ramps the requirements, but Dell has solutions that can serve any size LLM and whatever precision needed.

Given the potential impacts to output accuracy, another technique called fine-tuning can improve accuracy by retraining a subset of the LLM’s parameters on your specific data to improve the output accuracy for a specific use case. Fine-tuning adjusts the weight of some parameters trained and can accelerate the training process and improve output accuracy. Combining fine-tuning with quantization can result in application-specific small language models that are ideal to deploy to a broader range of devices with even lower AI processing power requirements. Again, a developer who wants to fine-tune an LLM can be confident using Precision workstations as a sandbox in that process for building GenAI solutions.

Another technique to manage the output quality of LLMs is a technique called Retrieval-Augmented Generation (RAG). This approach provides up-to-date information in contrast to conventional AI training techniques, which are static and dated by the information used when they were trained. RAG creates a dynamic connection between the LLM andrelevant information from authoritative, pre-determined knowledge sources. Using RAG, organizations have greater control over the generated output, and users have better understanding of how the LLM generates the response.

These various techniques in working with LLMs are not mutually exclusive and often deliver greater performance efficiency and accuracy when combined and integrated.

In summary, there are key decisions regarding the size of the LLM and which techniques can best inform the configuration of the computing system needed to work effectively with LLMs. Dell Technologies is confident that whatever direction our customers want to take on their AI journey, we have solutions, from desktop to data center, to support them.

*Based on Dell analysis, August 2023.

About the Author: Matthew Allard

Matt leads the Strategic Alliances and Solutions team for the Dell Performance PC product group, working closely with Independent Software Vendors (ISVs), customers and technology partners across multiple industries.He has more than 20 years of experience within the tech sector, and prior to Dell, he held marketing and product management roles at Autodesk, Avid Technology, Schneider Electric, Microsoft Softimage, Media 100 and X-Rite. Matthew lives with his family in the greater Boston area and loves movies, seafood and checking out live bands.

How to Run Quantized AI Models on Precision Workstations | Dell (2024)

FAQs

What is Dell doing with AI? ›

As part of the Dell AI Factory's growing AI devices and infrastructure offerings, Dell expands our broad portfolio of AI PCs and workstations with the introduction of the most Copilot+ PCs powered by Snapdragon® X Elite and Snapdragon® X Plus processors.

How to update Dell Precision Optimizer? ›

Restart your system when notified by Dell Precision Optimizer. Click the Apply Update button on the confirmation window to download currently selected updates and install them on your system. Click this link to display the logs containing detailed information of the previous Update History.

Why is Dell declining? ›

Shares of Dell Technologies closed down nearly 18% Friday after investors were discouraged by the company's lower-than-expected artificial intelligence server backlog and an estimated decline in margins. Dell reported fiscal first-quarter results on Thursday that beat analysts' expectations and offered rosy guidance.

Why did VMware leave Dell? ›

Dell ends deal with VMware after Broadcom takeover

Dell's termination of the agreement follows Broadcom's decision to change licensing policies for VMware products, cutting perpetual licenses in favor of subscription-based models – a move that proved to be unpopular among partners, distributors, and customers alike.

What does the Dell Precision Optimizer do? ›

The Dell Precision Optimizer improves system reliability with automated system updates from one convenient location, provides certified graphics drivers and value-add plug-ins for storage and GPU, and provides system utilization reports to be certain you have the necessary resources.

How to run Dell optimizer? ›

To launch the application, open the Windows Start menu, and search for Dell Optimizer.

What models are supported by Dell Optimizer? ›

Dell Optimizer is only supported on Latitude, Dell Precision Workstations, and OptiPlex computers launched after March 2020. Why does the Dell Optimizer installer fail with a message that Dell Precision Optimizer must be removed before the installation of Dell Optimizer?

What is the Dell company controversy? ›

Nov 16 (Reuters) - Dell Technologies Inc (DELL. N) , opens new tab on Wednesday said it reached a $1 billion settlement of a lawsuit accusing it of short-changing some shareholders in a controversial $23.9 billion transaction in 2018 that marked its return as a publicly traded company.

Is Dell and NVIDIA partner on AI for Enterprise? ›

The Dell AI Factory with NVIDIA enables organizations with industry-leading capabilities, solutions and services with Dell's leading AI infrastructure and services portfolio along with the NVIDIA AI Enterprise software platform, underpinned by NVIDIA® Tensor Core GPUs, NVIDIA Spectrum™-X Ethernet networking platform ...

How much does a Dell AI engineer make? ›

Dell Machine Learning Engineer Salary FAQs

Average Dell Machine Learning Engineer salary in India is ₹17.1 Lakhs for experience between 2 years to 6 years.

What is Dell changing the face of tech? ›

Our Changing the Face of Tech (CFT) initiative is a network of programs that focuses on attracting and empowering talent from diverse backgrounds and looking beyond the standard pool of candidates to recruit and establish a uniquely strong diverse workforce.

References

Top Articles
Latest Posts
Article information

Author: Duncan Muller

Last Updated:

Views: 6144

Rating: 4.9 / 5 (79 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Duncan Muller

Birthday: 1997-01-13

Address: Apt. 505 914 Phillip Crossroad, O'Konborough, NV 62411

Phone: +8555305800947

Job: Construction Agent

Hobby: Shopping, Table tennis, Snowboarding, Rafting, Motor sports, Homebrewing, Taxidermy

Introduction: My name is Duncan Muller, I am a enchanting, good, gentle, modern, tasty, nice, elegant person who loves writing and wants to share my knowledge and understanding with you.