Running Qwen3.5-27B Multimodal on a Single A100 GPU

← Back to Misc

The Qwen3.5-27B model card on Hugging Face is clear about the deployment setup: tensor-parallel-size 8, meaning eight A100 GPUs. Every official vLLM and SGLang command in their hf model card targets an 8-GPU cluster for the full 262,144-token context window. That's great if you have a multi-GPU node sitting around. Most of us don't.

So I wanted to only use a single NVIDIA A100-SXM4-80GB. I wanted to know whether Qwen3.5-27B could actually serve real multimodal image-question requests on that one card not just load, but work. The answer turned out to be yes, with some important caveats though.

This post covers the full path: environment setup, the problems I hit, the launch configurations that worked, real GPU memory measurements, latency numbers, context scaling behavior, and a bit more.


Why This Matters

Qwen3.5-27B is not a small model. At 27B parameters with a vision encoder, it uses early fusion training on multimodal tokens and features a hybrid architecture combining Gated Delta Networks with sparse attention. So Qwen has a native context length of 262,144 tokens, extensible up to 1,010,000 with YaRN. It scores competitively against GPT-5-mini and Claude Sonnet 4.5 across vision-language benchmarks.

The problem is that nearly every deployment example assumes you have multiple GPUs. If you're a researcher with a single rented A100-SXM4, an engineer prototyping a multimodal pipeline, or anyone who just wants to test this model without a cluster, the official docs leave you guessing.

I ran the experiment so you don't have to guess.


Hardware and Software

ComponentValue
GPUNVIDIA A100-SXM4-80GB (81,920 MiB)
Driver580.126.09
CUDA13.0
OSUbuntu, bare metal
FrameworkvLLM nightly/dev build
ModelQwen/Qwen3.5-27B
PrecisionBF16
Attention backendFlashAttention-2 (auto-enabled by vLLM)
APIOpenAI-compatible endpoint

I did not use any PyTorch or CUDA Docker images,just bare Ubuntu. I set up the driver, CUDA, Python, vLLM, and everything else from scratch.

The nightly vLLM build was essential. Older stable releases failed with an error saying Qwen3_5ForConditionalGeneration was not a supported architecture. The Qwen model card now explicitly points users toward the latest vLLM recipe for Qwen3.5 support.


Setting Up the Environment

Straightforward on paper, but a couple of traps will waste your time. Full path from bare Ubuntu.

Update the system and grab the basics:

sudo apt update && sudo apt upgrade -y
sudo apt install -y git wget curl build-essential python3 python3-pip python3-venv

Run nvidia-smi and confirm the A100 shows up. Then spin up a venv — I keep LLM installs isolated:

python3 -m venv qwen-env
source qwen-env/bin/activate
pip install --upgrade pip

PyTorch next. For A100 I used the CUDA 12.1 wheels:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Quick sanity check that CUDA sees the GPU:

python - <<'PY'
import torch
print("CUDA:", torch.cuda.is_available())
print("GPU:", torch.cuda.get_device_name(0))
PY

vLLM needs the nightly build — stable won't work for Qwen3.5. Then the usual client libs:

pip install vllm --extra-index-url https://wheels.vllm.ai/nightly
pip install openai pillow transformers accelerate

Install python3-dev and build-essential before your first vLLM launch. Without them, Triton bombs on kernel compilation because Python.h is missing. The error is opaque and easy to miss. Cost me half an hour.

sudo apt install -y python3-dev python3-venv build-essential

For gated models, huggingface-cli login (after pip install huggingface_hub).

First launch, use a small context so you don't hit OOM while the model is still loading:

vllm serve Qwen/Qwen3.5-27B \
  --port 8000 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096 \
  --reasoning-parser qwen3

First run downloads ~45–50GB, loads the vision encoder, builds the KV cache. Expect 2–5 minutes. When it's up you'll get Uvicorn running on http://0.0.0.0:8000. Hit curl http://localhost:8000/v1/models from another terminal to confirm the model is registered.


Launching with Full Context

Once the basic launch works, you can bump --max-model-len to 262,144 for full context. The launch command follows the same pattern, except with tensor-parallel-size 1:

vllm serve Qwen/Qwen3.5-27B \
  --port 8000 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

Two parameters that are easy to confuse:

Mixing these up is a common source of cryptic failures. The vLLM docs define --max-model-len as the full prompt-plus-output context length.

Once the server starts, it exposes an OpenAI-compatible endpoint at http://localhost:8000/v1/chat/completions.


What the GPU Memory Actually Looks Like

This is the part the official docs don't give you for a single-GPU setup. From the vLLM server logs during the 262,144 context launch:

ComponentMemory
Model weights51.1 GB
Available KV cache18.3 GB
CUDA graph capture~1 GB
Peak observed GPU usage~74 GB

That leaves roughly 6 GiB of headroom on the 81,920 MiB card. It's tight but stable. The runtime reported a KV cache capacity of 74,480 tokens — well under the configured 262,144 ceiling. This is an important distinction: the server initializes at the full context setting, but the KV cache can only physically hold about 74k tokens at once. I'll come back to why this matters.


Context Scaling: What Actually Launches

I tested four context configurations, stopping and restarting the server each time. The benchmark script automated the full cycle: kill any existing server, launch with the new context, wait for the health check endpoint, then run real image-VQA requests.

Configured contextServer startedReal multimodal requests succeeded
4,096YesYes
65,536YesYes
131,072YesYes
262,144YesYes

All four configurations launched cleanly and served real image requests. This was a stronger result than I expected. The official examples don't suggest this is possible on a single card.


Latency Numbers

I measured both warmup and steady-state latency across a few food images.

ScenarioLatency
warmup request (includes graph compilation)~20 seconds
Steady-state requests~0.45 seconds per image

The warmup start is dominated by CUDA graph compilation and model warmup. Once that's done, multimodal inference is fast. For a 27B-parameter model doing vision-language reasoning on a single GPU, sub-500ms per image is quite usable.

From the vLLM runtime logs, prompt throughput was approximately 286 tokens per second, with generation throughput in the range of 1.6–4 tokens per second depending on request timing and prompt length.


Long-Context Stress Test

Launching at a given context setting is not the same as actually using that much context. To test this, I sent increasingly long text prefixes (a repeated sentence, thousands of times) alongside an image, and asked the model to describe the food in the image.

Context settingPrompt sizeSuccessLatency
4,096largeNoexceeds limit
65,536medium (~2k repeats)Yes5.1 s
65,536very large (~8k repeats)Yes18.6 s
65,536extremely large (~16k repeats)Noexceeds limit
131,072extremely largeYes44.7 s
262,144extremely largeYes44.8 s

The results confirm what you'd expect: requests that fit within the configured context succeed, and latency scales with input length. The server handles genuinely long multimodal prompts correctly at higher context settings.


Lessons

Three things will save you time:

Use a nightly vLLM build. The stable release at the time of writing does not support Qwen3_5ForConditionalGeneration. The model card points to the latest vLLM path for a reason.

Install python3-dev before launching. The missing Python.h error during Triton compilation is not intuitive, and it happens after you've already waited for the model to download.

Separate "server configured at X" from "I ran an X-token request." The server can initialize at 262,144 context, but the KV cache on a single A100-SXM4-80GB physically holds about 74k tokens. Requests exceeding that will queue or fail at runtime. This distinction matters when making claims about long-context capability.


Conclusion

A single A100-SXM4-80GB can load the model, initialize at the full 262k context configuration, serve real image-VQA requests with sub-500ms steady-state latency, and handle long multimodal prompts at higher context settings. The total GPU memory footprint peaks around 74 GB, leaving a workable margin.

The most interesting takeaway isn't about hardware at all. It's that long text prefixes measurably degrade multimodal grounding, even when the request fits within the context budget. For anyone building multimodal applications, this means context management isn't just a memory problem, it's a quality problem. Well, I'm not sure but I guess so lol.