Beginner's Guide | Setting Up an AI Model Development Environment

2024年9月1日 52点热度 2人点赞 2条评论
内容目录

学习模型开发时,搭建环境可能会碰到很多曲折,这里提供一些通用的环境搭建安装方法,以便读者能够快速搭建出一套 AI 模型开发调试环境。

Installing Graphics Card Drivers and Development Libraries

This article only covers the installation method for NVIDIA graphics card drivers.

There are multiple series of NVIDIA graphics cards, commonly including Tensor and GeForce RTX series. The driver installation methods for these two types of graphics cards differ; the following sections will explain how to install the drivers separately.

The first step is to check whether the computer correctly recognizes the graphics card or if the driver is installed.

Open Device Manager, click on Display adapters, and check if the graphics card is listed in the device list.

image-20240831193543224

image-20240831193501897

If the computer has recognized the graphics card, you can update the driver to the latest version through NVIDIA GeForce Experience or other driver management tools.

1725110469061

Alternatively, you can go directly to the official driver page to search for the driver program suitable for your graphics card model. The official NVIDIA driver search download page is: https://www.nvidia.cn/drivers/lookup/

image-20240831194432476

For Tesla Series Graphics Cards

For example, after creating a GPU server on cloud platforms like Azure, if the graphics card is a Tesla and does not recognize the graphics card upon startup, you need to install the driver first before the graphics device can be displayed.

For Windows, refer to this link for installation: https://learn.microsoft.com/zh-CN/azure/virtual-machines/windows/n-series-driver-setup

For Linux, refer to this link for installation: https://learn.microsoft.com/zh-CN/azure/virtual-machines/linux/n-series-driver-setup

For Windows, the installation method is relatively simple; you just need to download the GRID program installation package according to the documentation.

image-20240831193113478

After installing the driver, use the following command to check the supported CUDA version:

nvidia-smi

file

You can see that this driver version only supports CUDA version 12.2.

For GeForce Cards

For graphics cards like GeForce RTX 4060TI and GeForce RTX 4070, you can directly download the driver installer from the official website:

https://www.nvidia.cn/geforce/drivers/

Generally speaking, home desktops usually come with the drivers already installed from the factory.

Installing CUDA and cuDNN

image-20240831195641685

CUDA is a parallel computing platform and programming model developed by NVIDIA specifically for general-purpose computing on graphics processing units (GPUs). With CUDA, developers can significantly accelerate computing applications by leveraging the powerful performance of GPUs.

In simple terms, CUDA is a programming model that supports CPU distribution and GPU parallel computation. To use CUDA, you need to install the development toolkit.

Introduction to CUDA:
https://developer.nvidia.cn/cuda-zone
https://developer.nvidia.com/zh-cn/blog/cuda-intro-cn/

CUDA installation package download address: https://developer.nvidia.com/cuda-downloads

Download the installation package and follow the prompts to install it. The simple installation will install it on the C drive, while the advanced installation allows you to customize the installation location. It is recommended to use the simple installation to avoid additional issues.

1725105003545

After installation, two entries will be added to the environment variables:

image-20240831195802036

cuDNN is a GPU-accelerated library for deep learning. After downloading, the file is a compressed package.

Download address: https://developer.nvidia.com/cudnn-downloads

1725105639726

Open C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\, locate the version directory, or find the installation directory via the environment variable CUDA_PATH, and copy the contents of the cuDNN compressed package into the CUDA directory.

image-20240831220117612

Finally, add the five directories: bin, lib, lib\x64, include, and libnvvp to the environment variable Path.

It is unclear exactly how many environment variables are needed, just add them.

Installing Miniconda

Miniconda is a Python package manager that can create multiple isolated Python environments in the system.

Download address: https://docs.anaconda.com/miniconda/

After downloading, search for the miniconda3 shortcut menu, run it as an administrator, and you can open the console; the menu list will have shortcuts for cmd and powershell; it is recommended to use the powershell entry.

Subsequent execution of conda commands must be done with administrator privileges.

image-20240901072421293

Configure domestic sources to speed up downloads:

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/  

Execute conda env list to view the default environment installation directory.

image-20240901072824863

If Python has already been installed on the computer and added to the environment variables, do not add G:\ProgramData\miniconda3 to the environment variables, as this will cause the environments to become disorganized.

If Python has not been installed on the computer yet, you can directly add G:\ProgramData\miniconda3 and G:\ProgramData\miniconda3\Scripts to the environment variables.

The author uninstalled the manually installed Python and only uses the environment provided by miniconda3.

If Python or pip is using a self-installed version, when executing commands to install dependencies, it will be isolated from the miniconda3 environment. To install dependencies in the miniconda3 environment, you need to open the miniconda3 console and execute the pip command, so that the installed packages will appear in the miniconda3 environment.

After installing the dependencies in one environment, different projects can share the downloaded dependencies without needing to download each one separately.

Installing PyTorch and Transformers

Flax, PyTorch, or TensorFlow are all deep learning frameworks, while Transformers can use Flax, PyTorch, or TensorFlow as its underlying deep learning framework for loading, training, and other functionalities.

For PyTorch installation reference: https://pytorch.org/get-started/locally/

You can install either the GPU version (CUDA) or the CPU version, then copy the installation command provided below.

1725148068180(1)

conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia  

Next, execute the command to install Transformers and some dependency libraries.

pip install protobuf 'transformers>=4.41.2' cpm_kernels 'torch>=2.0' gradio mdtex2html sentencepiece accelerate  

It may automatically install the latest version of transformers, which may cause issues; the following sections will discuss how to resolve this.

Using Modelscope to Download and Load Models

ModelScope is an AI model community led by Alibaba Cloud, providing various models, datasets, and development toolkits. Due to the high difficulty of using huggingface and issues with foreign networks, we will use Modelscope to download and load models.

To install modelscope:

pip install modelscope  

PyCharm Project Configuration

PyCharm is the most commonly used Python programming tool, so here we will explain how to configure the miniconda3 environment in PyCharm.

Open PyCharm, and add the miniconda3 environment in the settings, as shown in the steps below.

1725148940379

1725148968981(1)

Then create a project, selecting a conda-based environment in the project.

1725149018283

Model Loading and Conversation

Create a main.py file in the project directory.

image-20240901080538372

Copy the following code into main.py, and when you run the code, it will automatically download the model, load the model, and initiate a conversation.

from modelscope import AutoTokenizer, AutoModel, snapshot_download  
  
# Download the model  
# ZhipuAI/chatglm3-6b model repository  
# D:/modelscope model file cache storage directory  
model_dir = snapshot_download("ZhipuAI/chatglm3-6b",cache_dir="D:/modelscope", revision="v1.0.0")  
  
# Load the model  
# float represents 32-bit, half represents 16-bit floating point, which can reduce memory by half  
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)  
model = AutoModel.from_pretrained(model_dir, trust_remote_code=True).half().cuda()  
model = model.eval()  
  
# Start a conversation  
response, history = model.chat(tokenizer, "你好", history=[])  
print(response)  
response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history)  
print(response)  

1725150688028

"ZhipuAI/chatglm3-6b" refers to the chatglm3-6b model in the ZhipuAI repository. You can view various models uploaded by the community on ModelScope:

https://www.modelscope.cn/models

The revision="v1.0.0" download version number corresponds with the repository branch name and can be modified to download different versions based on different branch names.

image-20240901093307337

CPU and GPU Issues

If you encounter the following error, it is possible that the CPU version, rather than the GPU version of PyTorch, is installed.

    raise AssertionError("Torch not compiled with CUDA enabled")  
AssertionError: Torch not compiled with CUDA enabled  

image-20240901111744905

Execute the following code:

import torch  
print(torch.__version__)  

image-20240901113934658

Based on experience, if libraries were installed using pip instead of conda commands, you should execute the following command to uninstall PyTorch:

pip uninstall torch torchvision torchaudio  
conda uninstall pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia  

Then execute the command to reinstall PyTorch:

conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia  

You should be able to run the commands successfully again afterwards:

image-20240901120654336

Transformers Version Errors

Since the libraries were installed with the latest versions, some libraries may be incompatible, and when executing the following line of code, an error may be thrown.

response, history = model.chat(tokenizer, "你好", history=[])  

First, you may see the following warning and then encounter an error:

Torch was not compiled with flash attention. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)  
  context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer,  

file

You need to install the required latest version of transformers (upgrade).

pip install transformers==4.41.2  

file

After overcoming various hurdles, it was finally successful:

image-20240901122852869

TORCH_USE_CUDA_DSA Error

The issue encountered by the author seems to be caused by insufficient GPU performance, which appeared on an Azure A10 machine, while a home RTX 4060TI did not encounter this issue.

However, it could also be due to inconsistencies between the graphics card driver and the CUDA version.

  File "C:\ProgramData\miniconda3\Lib\site-packages\transformers\generation\utils.py", line 2410, in _sample
    next_token_scores = logits_processor(input_ids, next_token_logits)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\miniconda3\Lib\site-packages\transformers\generation\logits_process.py", line 98, in __call__
    scores = processor(input_ids, scores)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xxx\.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chatglm.py", line 55, in __call__
    if torch.isnan(scores).any() or torch.isinf(scores).any():
       ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

file

Using CPU works fine:

file

Ran a demo casually, and it executed successfully.

https://github.com/pytorch/examples/blob/main/mnist/main.py

The issue might be caused by an inconsistency between the CUDA library and driver library versions. First, run the nvidia-smi command to check the CUDA version compatible with your graphics card driver library.

file

Download and install the corresponding version of CUDA, then unzip cuDNN again and set the environment variables.

file

Finally, the server has successfully set up the AI environment.

file

痴者工良

高级程序员劝退师

文章评论