Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

KT-FT Fine-Tuning and Inference Loop

Last updated: 2026-06-01

This guide documents the current KT-FT loop for Qwen3.5 MoE: train with KT SFT, convert the output once, and serve the fine-tuned result through SGLang with a single merged adapter path.

KT SFT raw output
  -> convert_kt_to_sglang_adapter.py
  -> <MERGED_ADAPTER_DIR>
  -> sglang --lora-paths <name>=<MERGED_ADAPTER_DIR>
  -> server auto-splits expert / non-expert internally
  -> request model=<served_model>:<name>

Training-side KT SFT docs remain separate. This page focuses on the bridge from trained LoRA artifacts to online inference.

1. Scope

Current supported and validated workflow:

  • Base model: Qwen3.5 MoE, for example Qwen3.5-35B-A3B
  • KT expert weights: AMX/BF16 SFT-compatible KT CPU expert path
  • User-facing serving input: one converted merged adapter directory
  • Runtime split: expert LoRA goes to the KT CPU expert path; non-expert LoRA goes to SGLang’s LoRA manager. This split happens automatically at server startup.

2. Artifacts At Each Stage

Raw KT SFT output

After LLaMA-Factory + KT training, the output directory contains two LoRA artifacts:

<KT_SFT_OUTPUT_DIR>/
  adapter_model.safetensors      # non-expert LoRA
  fused_expert_lora.safetensors  # expert LoRA in KT fused format
  adapter_config.json

Do not pass this raw directory directly to SGLang serving.

Converted merged adapter

Run the converter once to produce the serving input:

<MERGED_ADAPTER_DIR>/
  adapter_config.json
  adapter_model.safetensors

This merged directory contains both expert and non-expert LoRA tensors in one PEFT-style adapter. Pass only this directory to --lora-paths.

3. Convert Once

python kt-kernel/scripts/convert_kt_to_sglang_adapter.py \
  <KT_SFT_OUTPUT_DIR> \
  <MERGED_ADAPTER_DIR> \
  --base-model-name-or-path /path/to/Qwen3.5-35B-A3B \
  --overwrite

Example:

python kt-kernel/scripts/convert_kt_to_sglang_adapter.py \
  saves/KT_FT_qwen35B_Moe_nekoqa_eod_240 \
  saves/KT_FT_qwen35B_Moe_nekoqa_eod_240_sglang \
  --base-model-name-or-path /mnt/data3/models/Qwen3.5-35B-A3B \
  --overwrite

The converter reads fused_expert_lora.safetensors and the existing non-expert adapter_model.safetensors, then writes one merged adapter directory.

Optional split outputs for debugging:

python kt-kernel/scripts/convert_kt_to_sglang_adapter.py \
  <KT_SFT_OUTPUT_DIR> \
  <MERGED_ADAPTER_DIR> \
  --base-model-name-or-path /path/to/Qwen3.5-35B-A3B \
  --expert-output-dir <EXPERT_ADAPTER_DIR> \
  --nonexpert-output-dir <NONEXPERT_ADAPTER_DIR> \
  --overwrite

For normal serving, only <MERGED_ADAPTER_DIR> is needed.

4. Launch SGLang

Use the KTransformers SGLang fork from this repository and point PYTHONPATH at both kt-kernel/python and third_party/sglang/python.

cd /path/to/ktransformers

PYTHONPATH=/path/to/ktransformers/kt-kernel/python:/path/to/ktransformers/third_party/sglang/python:$PYTHONPATH \
python -m sglang.launch_server \
  --host 127.0.0.1 \
  --port 30006 \
  --model-path /path/to/Qwen3.5-35B-A3B \
  --tokenizer-path /path/to/Qwen3.5-35B-A3B \
  --kt-weight-path /path/to/Qwen3.5-35B-A3B-AMXINT4 \
  --kt-method AMXINT4 \
  --kt-cpuinfer 60 \
  --kt-threadpool-count 2 \
  --kt-numa-nodes 0 1 \
  --kt-num-gpu-experts 0 \
  --attention-backend flashinfer \
  --trust-remote-code \
  --mem-fraction-static 0.98 \
  --chunked-prefill-size 4096 \
  --max-running-requests 2 \
  --max-total-tokens 32000 \
  --served-model-name qwen3.5-kt-ft \
  --enable-mixed-chunk \
  --tensor-parallel-size 4 \
  --enable-p2p-check \
  --disable-cuda-graph \
  --disable-custom-all-reduce \
  --enable-lora \
  --lora-backend triton \
  --lora-paths qwen35b_neko=/path/to/KT_FT_qwen35B_Moe_nekoqa_eod_240_sglang \
  --log-level info

Important points:

  • Pass only one merged adapter through --lora-paths.
  • Do not also pass --kt-expert-lora-path in the normal user workflow.
  • At startup, the server detects the merged KT MoE adapter, splits it internally, and writes runtime cache directories under $TMPDIR/sglang_kt_lora_cache/ (or $SGLANG_KT_LORA_CACHE_DIR if set).
  • Prefer --lora-backend triton for Qwen3.5 full-LoRA generation.

Current constraints:

  • single merged KT composite adapter only
  • --kt-num-gpu-experts 0
  • do not enable --kt-enable-dynamic-expert-update
  • do not use --kt-gpu-prefill-token-threshold
  • use an AMX/BF16 SFT-compatible KT method such as AMXINT4, AMXINT8, AMXBF16, or BF16

5. Request Semantics

The OpenAI-compatible request model field uses names, not paths.

--served-model-name qwen3.5-kt-ft
--lora-paths qwen35b_neko=/path/to/merged_adapter

Request behavior in the current single-adapter implementation:

model=qwen3.5-kt-ft
=> base + KT expert LoRA

model=qwen3.5-kt-ft:qwen35b_neko
=> base + KT expert LoRA + SGLang non-expert LoRA

The suffix after : must match the left-side name in --lora-paths.

6. Smoke Test

curl -sS http://127.0.0.1:30006/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3.5-kt-ft:qwen35b_neko",
    "messages": [{"role": "user", "content": "我回来了,你在干嘛?"}],
    "temperature": 0.7,
    "max_tokens": 160,
    "chat_template_kwargs": {"enable_thinking": false}
  }'

Startup logs should include lines similar to:

Prepared merged KT LoRA adapter ... for runtime: expert=... nonexpert=...
Loaded KT expert LoRA for layer ...
Using triton as backend of LoRA kernels.

7. Advanced: Manual Split Serving

The older split-runtime contract is still available for debugging:

--kt-expert-lora-path <EXPERT_ADAPTER_DIR> \
--enable-lora \
--lora-paths <NONEXPERT_LORA_NAME>=<NONEXPERT_ADAPTER_DIR>

This is not the recommended user-facing path. Normal users should pass one merged adapter directory through --lora-paths only.

8. Troubleshooting

Got LoRA adapter that has never been loaded: lora0

The adapter name in the request must match the left side of --lora-paths. If you launched with qwen35b_neko=..., request model=qwen3.5-kt-ft:qwen35b_neko, not :lora0.

No visible adapter effect

Make sure you are serving the intended merged adapter directory. For example, use the Neko adapter at ..._nekoqa_eod_240_sglang, not a generic sanity adapter such as ..._Moe_sglang.

connection refused

Check that the server is listening on the port you curl, and remember the example above binds to 127.0.0.1, not 0.0.0.0.

Server resolves upstream SGLang instead of this checkout

python - <<'PY'
import inspect
import sglang.srt.models.qwen3_5 as qwen3_5
print(inspect.getfile(qwen3_5))
PY

The path should come from this repository’s third_party/sglang.