Local AI Dev Environment on Windows 11

Local AI Dev Environment on Windows 11

vLLM - I have played with Hugging Face Transformers, Ollama, and a couple others. But I picked this one because it can be used via OpenAI API compatible endpoints as well as being accepted platform for production use. And it supports a vast array of model formats (including SafeTensors from Hugging Face). Which means anything I do can be easily migrated to Azure as-is, or swapped out for Azure AI or directly against OpenAI services etc. So yeah, wish me luck!

Python - don't fight it, it is what it is. If it makes you feel any better the core maths libraries that underpin all of this are written in well optimised C, just think of Python as the nice friendly layer that calls all of the hard working compiled code. And Python has been long adopted by scientific and mathematics purposes. So that's why the vast majority of AI/ML development so far has been using Python, and doesn't seem to have any reason to change any time soon.

Linux - because this is the OS of the internet, there's very little Windows native support out there. So get used to it, or use some of the easier 'out of the box' solutions like Ollama, and live with the limitations.

But the title of this article mentions Windows 11, so fear not, I'm not going to tell you to wipe your desktop just to install Ubuntu 🤓. There is another solution.

Windows Subsystem for Linux

To be fair, this has got pretty decent lately, and is more suited to what I'm doing than running Docker on Windows. So as long as you're running a fairly up to date build of Windows (i.e., 24H2) then just typing wsl into a command prompt will get you started.

I've you used WSL before, it may have been an older version. Type wsl --update to ensure you're on the latest v2. And then type wsl --set-default-version 2 otherwise it may still default to creating new distros using WSL1.

wsl --install ubuntu-22.04

Once that's completed, it'll open up a WSL terminal.

Why Ubuntu 22.04?

I've asked myself this and similar questions so many times during this process... the short and annoying answer is...

Because all this AI lark is pretty new, people have picked stuff that's stable and works, and often re-used existing build environments they're used to. So Ubuntu 22.04 because, well, just because everybody else used that.

The same goes for Python versions.. I picked 3.10.12 because I was told to, and I am an obedient sheep🐑.

There's soo many dependencies involved that sticking strictly to the releases each module or package requires will lead to an easier life.

The only places I veered from the well trodden path was with the CUDA Toolkit version, purely because my GPU is one of the new ones, and is only supported from v12.8+. And believe me, this whole thing would have been way easier with an older GPU.

A lot of the later steps in this post were based on this very helpful repo, which includes a Dockerfile for building vllm with the correct pre-reqs for supporting the 50-series GPUs. I ended up adapting it slightly in places to work with WSL instead of Docker.

Get Ubuntu Ready

Like every good adventure, it starts with one thing (cue Lincoln Park songs in my head for the rest of the week).

I'm going to guess you are a rebel like me and always log in as root, so I've excluded sudo from all of these commands.

Update the package manager's source list, and install any known upgrades to the packages that are already installed.

apt-get update && apt-get upgrade

Install all these wonderful things that may or may not be needed, there's no way to ever tell, because I'm not going through trial and error or reading docs. pfft.

apt-get install build-essential \
  git \
  wget \
  curl \
  python3-dev \
  python3-pip \
  python3-venv \
  ninja-build \
  cmake \
  libopenblas-dev \
  libssl-dev \
  libffi-dev \
  libxml2-dev \
  libxslt1-dev \
  zlib1g-dev \
  unzip

From this point onwards, I'm going to stick as much as I can into a folder and setup a Python Virtual Environment, and then everything can be kept together, and all the annoyingly strict version requirements won't interfere if I used this 'machine' for other AI projects using different tools.

If you've not touched Python virtual environments before, then you've probably not ran any python before either. These days it seems essential, otherwise it's a minefield of dependencies and errors. To be clear this isn't any sort of 'virtualisation' it's just a bunch of environment variables that sets the path to the specified python binary and any packages will be installed relative to the folder the environment was created in. So it's more like a 'scope' than anything else.

mkdir llm-dev && cd llm-dev
python3 -m venv .venv
source .venv/bin/activate

Now that I'm actually inside my python virtual environment (you can tell because the prompt will change to include the environment name), I'm going to start installing packages and other pre-reqs. The first line tells pip to use any packages it's asked to install over anything that's been installed on the system (i.e., via apt-get).

pip config set global.break-system-packages true
pip install --no-cache-dir --upgrade --ignore-installed pip setuptools wheel

PyTorch

The first thing I'll need is PyTorch. This is a framework for working with deep learning neural networks. PyTorch has been picked by many researchers and developers due to it being based on Python, whereas alternatives like TensorFlow are going out of fashion and lack support for Windows or Mac, amongst other things.

This installs the latest stable version of PyTorch that also supports CUDA v12.8 - which is the earliest release that supports my GPU.

pip install --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

NVidia CUDA Toolkit

This will install all of the CUDA tools and libraries that you'll need to do pretty much anything.

Various ways to install the CUDA Toolkit can be found on NVIDIA's website.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8

xFormers

This is the core library that deals with Transformers, the magic that kicked off LLMs such as GPT. And defined the neural network structure we use today. It also implements the FastAttention algorithm, which as the name suggests, is faster.

pip install --no-cache-dir git+https://github.com/facebookresearch/xformers.git@main#egg=xformers

Bits and Bytes

This library provides optimisation through quantising models. Essentially converting 32-bit floating point values into 8-bit or 4-bit integers, while still being compatible with GPUs. It reduces memory requirements and is quicker to process, with 'minimal' accuracy loss (although it is inevitable). Meaning vLLM can use models larger than the available VRAM by reducing the 'resolution' as it loads them on the fly.

git clone https://github.com/bitsandbytes-foundation/bitsandbytes.git ./bitsandbytes
cd bitsandbytes
cmake -DCOMPUTE_BACKEND=cuda -S . \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
  -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda \
  -DCMAKE_PREFIX_PATH=/usr/local/cuda
pip install -e .

For some reason cmake kept picking up an old version of nvcc in /usr/bin but that's not where it should be looking based on the environment variables. Adding the extra parameters solved it, although they shouldn't have been needed. Ah well, it was only my own time wasted 😡.

FlashInfer

This actually isn't needed. But it helps with speed and efficiency by doing some really clever stuff with multiple approaches to handle attention (which is quite involved). Such as being able to page data between VRAM and system RAM. Again, allowing vLLM to utilise larger models that would otherwise struggle to fit in a consumer-grade GPU's VRAM.

If this isn't installed it will use PyTorch's functions instead.

But I want it to be fast, and use every CUDA core of my silly new GPU. So I had to build from source so it included the correct versions of PyTorch and CUDA libraries.

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive ./flashinfer
cd flashinfer
pip install --no-cache-dir ninja build packaging "setuptools>=75.6.0"
python3 -m flashinfer.aot
python3 -m build --no-isolation --wheel
pip install dist/flashinfer*.whl

vLLM

And here we are, at the main event. Starting off with some additional pre-reqs to ensure that the expected versions are installed. Or with scipy, any compatible version is installed that matches the distro and python version - if you leave that out, it'll try to install v1.6 that needs a newer version Python. Such a pain.

pip install aiohttp==3.11.18
pip install protobuf==5.29.4
pip install click==8.1.8
pip install rich==13.7.1
pip install starlette==0.46.2
pip install scipy

And then on to building vLLM with the previously installed PyTorch and other libraries and frameworks above.

git clone https://github.com/vllm-project/vllm.git ./vllm
python3 use_existing_torch.py
pip install --no-cache-dir -r requirements/build.txt
pip install --no-cache-dir setuptools_scm
python3 setup.py develop

Accelerate

This is a library specifically for building and training models across any device rather than accounting for each architecture in your code.

pip install --no-cache-dir accelerate

Getting Models

For vLLM to download models from Hugging Face you need to provide a token, or use the huggingface_hub package.

pip install --upgrade huggingface_hub
huggingface-cli login

Depending on the model you may even need to accept the licence for the model or the collection.

💡
TIP! It's worth keeping an eye on the models you've tried, delete ones you are not using - as they all get cached in ~/.cache/huggingface/hub

Finally

Start vLLM service with it's OpenAI API compatible endpoints.

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen3-32B-AWQ --max-model-len=2048 --gpu-memory-utilization=0.9

and we have a compatible OpenAI API service running on http://localhost:8000

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

# Streamed chat completion
stream = client.chat.completions.create(
    model="Qwen/Qwen3-32B-AWQ",
    messages=[
        {"role": "system", "content":"You're a frustrated friend, trying to explain movie plots to me, but I just don't get it. So you always try to oversimplyfy sometimes to comic effect. No thinking or reasoning. Answer in 100 words or less."},
        {"role": "user", "content": "Explain the plot of Bladerunner."}
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
Okay, so Blade Runner is about these replicants, which are like fake humans, right? They’re used for stuff on other planets. The main guy, Deckard, hunts them down because they’re breaking the rules. The twist is he might be one too. It’s all super confusing but kinda cool. Like, who are you if you look and feel human? And there’s rain, a lot of rain. And a unicorn. Yeah, a unicorn dream. It’s cyberpunk. It’s sad. It’s... deep? I don’t get it either, but it’s famous.

And on that bombshell! Thanks for reading. I hope you've managed to get this working as smoothly as I did (laughs in blatant lie).