How to Run DeepSeek-R1
In this document, we will show you how to install and run KTransformers on your local machine. There are two versions:
- V0.2 is the current main branch.
- V0.3 is a preview version only provides binary distribution for now.
- To reproduce our DeepSeek-R1/V3 results, please refer to Deepseek-R1/V3 Tutorial for more detail settings after installation.
Preparation
Some preparation:
-
CUDA 12.1 and above, if you didn't have it yet, you may install from here.
# Adding CUDA to PATH if [ -d "/usr/local/cuda/bin" ]; then export PATH=$PATH:/usr/local/cuda/bin fi if [ -d "/usr/local/cuda/lib64" ]; then export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64 # Or you can add it to /etc/ld.so.conf and run ldconfig as root: # echo "/usr/local/cuda-12.x/lib64" | sudo tee -a /etc/ld.so.conf # sudo ldconfig fi if [ -d "/usr/local/cuda" ]; then export CUDA_PATH=$CUDA_PATH:/usr/local/cuda fi
-
Linux-x86_64 with gcc, g++ and cmake (using Ubuntu as an example)
sudo apt-get update sudo apt-get install build-essential cmake ninja-build
-
We recommend using Miniconda3 or Anaconda3 to create a virtual environment with Python=3.11 to run our program. Assuming your Anaconda installation directory is
~/anaconda3
, you should ensure that the version identifier of the GNU C++standard library used by Anaconda includesGLIBCXX-3.4.32
conda create --name ktransformers python=3.11 conda activate ktransformers # you may need to run ‘conda init’ and reopen shell first conda install -c conda-forge libstdcxx-ng # Anaconda provides a package called `libstdcxx-ng` that includes a newer version of `libstdc++`, which can be installed via `conda-forge`. strings ~/anaconda3/envs/ktransformers-0.3/lib/libstdc++.so.6 | grep GLIBCXX
-
Make sure that PyTorch, packaging, ninja is installed You can also install previous versions of PyTorch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126 pip3 install packaging ninja cpufeature numpy
-
At the same time, you should download and install the corresponding version of flash-attention from https://github.com/Dao-AILab/flash-attention/releases.
Installation
-
Download source code and compile:
-
init source code
git clone https://github.com/kvcache-ai/ktransformers.git cd ktransformers git submodule init git submodule update
-
[Optional] If you want to run with website, please compile the website before execute
bash install.sh
-
For Linux
-
For simple install:
bash install.sh
-
For those who have two cpu and 1T RAM:
# Make sure your system has dual sockets and double size RAM than the model's size (e.g. 1T RAM for 512G model) export USE_NUMA=1 bash install.sh # or `make dev_install`
-
-
For Windows
install.bat
-
-
If you are developer, you can make use of the makefile to compile and format the code.
the detailed usage of makefile is here
Local Chat
We provide a simple command-line local chat Python script that you can run for testing.Note: this is a very simple test tool only support one round chat without any memory about last input, if you want to try full ability of the model, you may go to RESTful API and Web UI.
Run Example
# Begin from root of your cloned repo!
# Begin from root of your cloned repo!!
# Begin from root of your cloned repo!!!
# Download mzwing/DeepSeek-V2-Lite-Chat-GGUF from huggingface
mkdir DeepSeek-V2-Lite-Chat-GGUF
cd DeepSeek-V2-Lite-Chat-GGUF
wget https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/resolve/main/DeepSeek-V2-Lite-Chat.Q4_K_M.gguf -O DeepSeek-V2-Lite-Chat.Q4_K_M.gguf
cd .. # Move to repo's root dir
# Start local chat
python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
# If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try:
# GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite
# python ktransformers.local_chat --model_path ./DeepSeek-V2-Lite --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
It features the following arguments:
-
--model_path
(required): Name of the model (such as "deepseek-ai/DeepSeek-V2-Lite-Chat" which will automatically download configs from Hugging Face). Or if you already got local files you may directly use that path to initialize the model.Note: .safetensors files are not required in the directory. We only need config files to build model and tokenizer.
-
--gguf_path
(required): Path of a directory containing GGUF files which could that can be downloaded from Hugging Face. Note that the directory should only contains GGUF of current model, which means you need one separate directory for each model. -
--optimize_rule_path
(required except for Qwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-written in the ktransformers/optimize/optimize_rules directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models. -
--max_new_tokens
: Int (default=1000). Maximum number of new tokens to generate. -
--cpu_infer
: Int (default=10). The number of CPUs used for inference. Should ideally be set to the (total number of cores - 2).
Supported Models/quantization
Supported models include:
✅ Supported Models | ❌ Deprecated Models |
---|---|
DeepSeek-R1 | |
DeepSeek-V3 | |
DeepSeek-V2 | |
DeepSeek-V2.5 | |
Qwen2-57B | |
DeepSeek-V2-Lite | |
Mixtral-8x7B | |
Mixtral-8x22B |
Support quantize format:
✅ Supported Formats | ❌ Deprecated Formats |
---|---|
Q2_K_L | |
Q2_K_XS | |
Q3_K_M | |
Q4_K_M | |
Q5_K_M | |
Q6_K | |
Q8_0 |
Suggested Model
Model Name | Model Size | VRAM | Minimum DRAM | Recommended DRAM |
---|---|---|---|---|
DeepSeek-R1-q4_k_m | 377G | 14G | 382G | 512G |
DeepSeek-V3-q4_k_m | 377G | 14G | 382G | 512G |
DeepSeek-V2-q4_k_m | 133G | 11G | 136G | 192G |
DeepSeek-V2.5-q4_k_m | 133G | 11G | 136G | 192G |
DeepSeek-V2.5-IQ4_XS | 117G | 10G | 107G | 128G |
Qwen2-57B-A14B-Instruct-q4_k_m | 33G | 8G | 34G | 64G |
DeepSeek-V2-Lite-q4_k_m | 9.7G | 3G | 13G | 16G |
Mixtral-8x7B-q4_k_m | 25G | 1.6G | 51G | 64G |
Mixtral-8x22B-q4_k_m | 80G | 4G | 86.1G | 96G |
InternLM2.5-7B-Chat-1M | 15.5G | 15.5G | 8G(32K context) | 150G (1M context) |
More will come soon. Please let us know which models you are most interested in.
Be aware that you need to be subject to their corresponding model licenses when using DeepSeek and QWen.
Click To Show how to run other examples
-
Qwen2-57B
pip install flash_attn # For Qwen2 mkdir Qwen2-57B-GGUF && cd Qwen2-57B-GGUF wget https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct-GGUF/resolve/main/qwen2-57b-a14b-instruct-q4_k_m.gguf?download=true -O qwen2-57b-a14b-instruct-q4_k_m.gguf cd .. python -m ktransformers.local_chat --model_name Qwen/Qwen2-57B-A14B-Instruct --gguf_path ./Qwen2-57B-GGUF # If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try: # GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct # python ktransformers/local_chat.py --model_path ./Qwen2-57B-A14B-Instruct --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
-
Deepseek-V2
mkdir DeepSeek-V2-Chat-0628-GGUF && cd DeepSeek-V2-Chat-0628-GGUF # Download weights wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00001-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00001-of-00004.gguf wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00002-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00002-of-00004.gguf wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00003-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00003-of-00004.gguf wget https://huggingface.co/bartowski/DeepSeek-V2-Chat-0628-GGUF/resolve/main/DeepSeek-V2-Chat-0628-Q4_K_M/DeepSeek-V2-Chat-0628-Q4_K_M-00004-of-00004.gguf -o DeepSeek-V2-Chat-0628-Q4_K_M-00004-of-00004.gguf cd .. python -m ktransformers.local_chat --model_name deepseek-ai/DeepSeek-V2-Chat-0628 --gguf_path ./DeepSeek-V2-Chat-0628-GGUF # If you see “OSError: We couldn't connect to 'https://huggingface.co' to load this file”, try: # GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat-0628 # python -m ktransformers.local_chat --model_path ./DeepSeek-V2-Chat-0628 --gguf_path ./DeepSeek-V2-Chat-0628-GGUF
model name | weights download link |
---|---|
Qwen2-57B | Qwen2-57B-A14B-gguf-Q4K-M |
DeepseekV2-coder | DeepSeek-Coder-V2-Instruct-gguf-Q4K-M |
DeepseekV2-chat | DeepSeek-V2-Chat-gguf-Q4K-M |
DeepseekV2-lite | DeepSeek-V2-Lite-Chat-GGUF-Q4K-M |
DeepSeek-R1 | DeepSeek-R1-gguf-Q4K-M |
RESTful API and Web UI
Start without website:
ktransformers --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path /path/to/DeepSeek-V2-Lite-Chat-GGUF --port 10002
Start with website:
ktransformers --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path /path/to/DeepSeek-V2-Lite-Chat-GGUF --port 10002 --web True
Or you want to start server with transformers, the model_path should include safetensors
ktransformers --type transformers --model_path /mnt/data/model/Qwen2-0.5B-Instruct --port 10002 --web True
Access website with url http://localhost:10002/web/index.html#/chat :
More information about the RESTful API server can be found here. You can also find an example of integrating with Tabby here.