Introduction
In recent years, Apple Silicon Macs (especially the M4 Max and newer chips) have emerged as powerful platforms for local AI model inference.
This article presents a performance comparison of LLM (Large Language Model) runtime environments — specifically Ollama, LMStudio, MLX-ML, and Llama — to understand their efficiency and behavior when running on macOS.
🔧Test Environment
OS: macOS Sequoia 15.6.1
CPU: Apple M4 Max
Memory: 48 GiB
🧠 Models Used
Model | Framework | Notes |
openai/gpt-oss-20b | LMStudio, Ollama | Standard model |
InferenceIllusionist/gpt-oss-20b-MLX-4bit | MLX-ML | MLX-optimized version for Apple Silicon |
Model parameter
{
"temperature": 0.8,
"top_p": 0.95,
"min_p": 0.05,
"top_k": 40,
"num_ctx": 4096,
"num_batch": 64
}
Performance Evaluation Metrics
Two primary performance metrics were measured:
TTFT (Time To First Token) — the time between sending a prompt and receiving the first token of output.
TPS (Tokens Per Second) — the number of tokens generated per second.
In addition, CPU and memory usage were continuously monitored throughout the tests.
🧩 Observations and Results
🧱 LMStudio + gpt-oss-20b | ⚙️ MLX-ML + gpt-oss-20b-MLX | 🧰 Ollama |
Low CPU load Memory usage roughly proportional to model size During warm-up (initialization) and cold-down (idle), memory usage temporarily increases | Optimized for Apple Silicon; highest performance observed Excellent CPU efficiency with consistently high token throughput | MLX not yet supported (as of Sept 2025, related pull request pending) Stable inference performance but slower than MLX-enabled models |
⚡ With Apple’s MLX library, even a standalone Mac can achieve practical-level large language model inference.
Across 1,200 test runs, MLX consistently showed superior CPU and memory efficiency, along with the best throughput.
For slide in japanese follow this: https://file.m-cloud.dev/index.php/s/FbMXMj8Yqf8BbbG?openfile=true