In the rapidly evolving world of Large Language Models (LLMs), the ability to run these powerful tools locally on your hardware is likely appealing. Running AI locally grants you unparalleled control, endless customization, and complete data privacy. This guide will walk you through building your personal AI playground, assembling an open-source stack from the model server to a feature-rich user interface and Integrated Development Environment (IDE) support.

Ollama: Your Local LLM Workhorse

At the core of our local LLM stack is Ollama, a powerful and easy-to-use tool for managing and running LLMs. On a Linux environment, you can use the provided one-line script to install Ollama, which will set up the dependencies and GPU drivers. Alternatively, you can follow the manual installation instructions. At its core, the Ollama installation is as simple as unzipping the provided archive. The more challenging part is ensuring it uses the available GPU, a process that is highly dependent on the operating system, GPU model, and driver availability.

Running Ollama

Once installed, start the Ollama server via ollama serve. The most critical step is ensuring it uses your available GPU. If detected correctly, you should see log output referencing your GPU:

> ollama serve
source=gpu.go:217 msg="looking for compatible GPUs"
source=types.go:130 msg="inference compute" id=GPU-9ee00d0c-6562-af57-fa91-f023dae63b1e library=cuda variant=v12 compute=7.5 driver=12.9 name="NVIDIA T1200 Laptop GPU" total="3.6 GiB" available="3.6 GiB"

You can think of the Ollama server as a service similar to Docker. Once started, you can download and run models similar to Docker images. To download and start interacting with a model, use:

> ollama run phi3

>>> Send a message (/? for help)

This will download the Phi-3 model and start a chat session. The ollama serve command runs the server that downloads and loads the model, while the ollama run command starts a client that allows you to interact with it.

You can use various ollama commands to manage the images:

> ollama list
NAME                ID              SIZE      MODIFIED
deepseek-r1:1.5b    e0979632db5a    1.1 GB    10 hours ago
phi3:latest         4f2222927938    2.2 GB    23 hours ago
gemma3:latest       a2af6cc3eb7f    3.3 GB    23 hours ago

> ollama rm deepseek-r1:7b

GPU vs. CPU: A Tale of Two Speeds

The performance of your local LLM is heavily dependent on your hardware.

  • With a GPU: If you have a GPU, such as one from NVIDIA, Ollama will automatically leverage it to run the models. This results in a significant performance boost, making the model’s response time fast and interactive.
  • Without a GPU: If a GPU is not available, Ollama will fall back to using the CPU. While functional, the response times will be noticeably slower, making it less practical for real-time interaction.
  • Hybrid Approach: If a model is too large to fit entirely into your GPU’s VRAM, Ollama is smart enough to split the load between the GPU and your system’s memory. This allows you to run larger models than your GPU could handle alone, albeit with a performance trade-off.

The nvidia-smi command will give you an idea if your NVIDIA GPU card is configured correctly.

Screenshot of the nvidia-smi command output in a terminal, showing NVIDIA driver and CUDA versions.

Moreover, the btop tool is very valuable in monitoring the GPU vs CPU usage when running the inference. Notice the heavy GPU usage and the light CPU usage in the screenshot below.

Screenshot of the btop system monitor showing CPU usage at 10% and GPU usage at 97% during LLM inference, demonstrating how Ollama offloads work to the GPU.

Finally, the ollama ps command will tell you which models are loaded and how they are utilizing the hardware (CPU or GPU):

> ollama ps
NAME           ID              SIZE      PROCESSOR          UNTIL
phi3:latest    4f2222927938    6.5 GB    42%/58% CPU/GPU    4 minutes from now

Pay attention to the model sizes in relation to your available memory. Start small and work your way up.

Open WebUI: A Rich Interface for Your LLMs

While the command line is great for quick interactions, a dedicated web interface provides a much richer user experience. This is where Open WebUI comes in.

Deploying Open WebUI is a breeze with Docker. A simple docker run command is all it takes to get a beautiful and powerful UI for your local LLMs. The GPU can be made available to the Docker container and will be leveraged for more advanced features such as voice support and Retrieval-Augmented Generation (RAG).

Screenshot of the Open WebUI chat interface showing model selection, chat history, as well as the voice mode button.

Features Galore

Open WebUI is packed with features that enhance your interaction with LLMs, including:

  • Chat History: Keep track of your conversations and revisit them later.
  • Model Management: Easily switch between different models you have downloaded with Ollama.
  • RAG: This powerful feature allows your LLM to access external knowledge sources, such as your documents. By providing relevant context, RAG enables your models to answer questions about specific data with high accuracy.
  • Voice Support: Bring your conversations to life via voice input and output.

Many of the features can be configured via the rich admin UI.

Screenshot of the Open WebUI admin settings page, showing comprehensive list of components that can be configured.

In order for the containerized Open WebUI to reach the Ollama server running on your host machine, Ollama needs to listen to all network interfaces, not just localhost. You can achieve this by starting the server with: OLLAMA_HOST=0.0.0.0 ollama serve.

Continue: Supercharge Your Coding with Local LLMs

For developers, the ability to integrate LLMs directly into their IDE is a game-changer. Continue is an open-source VSCode extension that brings the power of your local Ollama server to your coding workflow.

With Continue, you can:

  • Get Code Suggestions: Receive intelligent code completions as you type.
  • Refactor Code: Ask your local LLM to refactor a block of code for better readability or performance.
  • Explain Code: Highlight a piece of code and get a clear explanation of what it does.
  • Generate Unit Tests: Automatically generate unit tests for your functions.

Continue allows you to select a Local Assistant setup (no login required), and it can connect to your Ollama server and detect the available models.

Screenshot of the Continue extension setup in VSCode. The "Local Assistant" provider is highlighted, which allows using local models without needing to login.

Screenshot of the Continue extension in VSCode showing a list of available LLMs from a local Ollama server, including models like phi3, gemma3, and deepseek-r1.

For Continue to be effective, especially for tasks like explaining or refactoring large blocks of code, you should increase Ollama’s context window. This allows the model to consider more of your code at once. Start the server with a larger context length like so: OLLAMA_CONTEXT_LENGTH=8192 ollama serve.

Side Note: WebLLM Chat - Local LLMs in Your Browser

As a final note, it’s worth mentioning the WebLLM Chat application. This innovative project allows you to run open-source models directly within your Chrome browser.

Screenshot of the WebLLM Chat application running in a Chrome browser. It shows a chat with the Phi-3-mini-4k-instruct-q4f16_1-MLC-1B model, demonstrating an LLM running entirely client-side.

Once the first prompt is provided, the application will download the selected model locally and run the inference in your browser. By leveraging the WebGPU feature, WebLLM Chat can tap into your local GPU for accelerated inference, all without needing to install any software. This is a fantastic demonstration of the power and flexibility of modern web technologies.

Conclusion

Setting up a personal, high-performance LLM stack on your own hardware is more accessible than ever. We’ve walked through a powerful, open-source toolkit that puts you in control:

  • Ollama provides the engine for running models efficiently.
  • Open WebUI delivers a polished, feature-rich interface for interaction.
  • Continue seamlessly integrates AI capabilities directly into your coding workflow.

These tools form the building blocks of a robust local AI ecosystem, offering a sandbox for experimentation, development, and learning — all on your machine. The world of open-source AI is evolving at a breakneck pace, and with this setup, you are perfectly positioned to explore it. What cool projects would you build with a local LLM stack?

Additional Resources