Introduction

In recent years, Apple Silicon Macs (especially the M4 Max and newer chips) have emerged as powerful platforms for local AI model inference.

This article presents a performance comparison of LLM (Large Language Model) runtime environments — specifically Ollama, LMStudio, MLX-ML, and Llama — to understand their efficiency and behavior when running on macOS.

🔧Test Environment

OS: macOS Sequoia 15.6.1
CPU: Apple M4 Max
Memory: 48 GiB

🧠 Models Used

ModelFrameworkNotes
openai/gpt-oss-20bLMStudio, OllamaStandard model
InferenceIllusionist/gpt-oss-20b-MLX-4bitMLX-MLMLX-optimized version for Apple Silicon

Model parameter

{
  "temperature": 0.8,
  "top_p": 0.95,
  "min_p": 0.05,
  "top_k": 40,
  "num_ctx": 4096,
  "num_batch": 64
}

Performance Evaluation Metrics

Two primary performance metrics were measured:
TTFT (Time To First Token) — the time between sending a prompt and receiving the first token of output.
TPS (Tokens Per Second) — the number of tokens generated per second.
In addition, CPU and memory usage were continuously monitored throughout the tests.

🧩 Observations and Results

🧱 LMStudio + gpt-oss-20b⚙️ MLX-ML + gpt-oss-20b-MLX🧰 Ollama
Low CPU load
Memory usage roughly proportional to model size
During warm-up (initialization) and cold-down (idle), memory usage temporarily increases
Optimized for Apple Silicon; highest performance observed
Excellent CPU efficiency with consistently high token throughput
MLX not yet supported (as of Sept 2025, related pull request pending)
Stable inference performance but slower than MLX-enabled models

⚡ With Apple’s MLX library, even a standalone Mac can achieve practical-level large language model inference.
Across 1,200 test runs, MLX consistently showed superior CPU and memory efficiency, along with the best throughput.

For slide in japanese follow this: https://file.m-cloud.dev/index.php/s/FbMXMj8Yqf8BbbG?openfile=true

Leave a Reply

Your email address will not be published. Required fields are marked *