Basics
What is a Large Language Model (LLM)?
How language models work, why they sometimes hallucinate, and what you need to run them locally.
At a Glance
A Large Language Model (LLM) is a neural network trained on massive amounts of text that understands, generates, and translates language. LLMs predict word by word which token should come next. They can run locally on your own hardware β no cloud, no dependency.
What exactly is an LLM?
A Large Language Model is a form of artificial intelligence based on the Transformer architecture. The model was trained on billions of texts from the internet β books, Wikipedia, forums, scientific papers β and learned statistical patterns in language.
The core idea: An LLM calculates a probability for each possible next word and then selects the most likely one. That sounds simple, but with 70 billion parameters and 128,000 tokens of context, the results are surprisingly good.
The Transformer Architecture (Simplified)
Since the paper "Attention is All You Need" (2017), all modern LLMs are based on Transformers. The core idea: every word in a text "looks at" all other words simultaneously and learns which connections matter.
Simplified Transformer Architecture
What is Self-Attention?
Self-Attention is the mechanism that helps the model understand which words in a sentence belong together. When you write "The dog chased the cat because he was hungry" β Attention helps the model understand that "he" refers to "dog", not "cat".
Tokens: How LLMs Process Text
LLMs do not read words β they read tokens. A token is a text fragment, often a word or part of a word. "Datenschutzgrundverordnung" (German for GDPR), for example, is split into 3-4 tokens. English texts require fewer tokens than German because most models were trained primarily on English data.
Rule of Thumb for Tokens
1 token is roughly 3/4 of an English word. For German, count about 1 token per half word. A typical paragraph (100 words) is approximately 130-150 tokens. A context window of 128K equals roughly 200 pages of text.
LLM vs. Search Engine
A common misconception: LLMs are not search engines. They do not "know" anything β they calculate which answer is statistically most likely.
| Property | Search Engine (Google) | LLM (ChatGPT, Llama) |
|---|---|---|
| Data Source | Live index of the internet | Training data (cutoff date) |
| Timeliness | Real-time | Knowledge ends at training cutoff |
| Response Format | Links to websites | Flowing text, code, tables |
| Accuracy | Source verifiable | Can hallucinate (made-up facts) |
| Personalization | Based on search history | Based on conversation |
| Cost | Free (with ads) | API costs or local hardware |
Hallucinations: When LLMs Make Things Up
LLMs can invent facts that sound convincing but are wrong. This happens because they do not "know" β they calculate statistical probabilities. When no good answer exists in the training data, they still generate something plausible.
Hallucination Risk
LLMs invent citations, laws, URLs, and statistics. Especially dangerous for: legal texts, medical advice, historical facts, and technical specifications. ALWAYS verify output before using it.
Reducing Hallucinations
RAG (Retrieval Augmented Generation) is the best approach: instead of letting the model "guess", you feed it real documents as context. The model then responds based on your data rather than its training data. More in the RAG Complete Guide.
Model Sizes: From 7B to 70B
"B" stands for billion parameters. More parameters means more "knowledge" and better quality β but also more VRAM and slower responses. The art lies in finding the right trade-off.
| Size | VRAM (Q4) | Speed (RTX 3090) | Quality | Example Models |
|---|---|---|---|---|
| 7-8B | ~5 GB | ~112 tok/s | Good for simple tasks | Llama 3.3 8B, Mistral 7B, Qwen 2.5 7B |
| 13-14B | ~10 GB | 43-57 tok/s | Solid all-rounders | Qwen3 14B, DeepSeek R1 14B |
| 24-32B | ~16-20 GB | ~20-30 tok/s | Near cloud quality | Mistral Small 3.1 24B, Qwen 2.5 32B |
| 70B | ~40 GB | Does NOT fit on 24 GB GPU | Best local quality | Llama 3.3 70B, Qwen 2.5 72B |
70B Needs More Than 24 GB VRAM
A 70B model in Q4_K_M quantization requires about 40 GB VRAM. That does NOT fit on a single RTX 3090 or RTX 4090 (each 24 GB). For 70B you need 48 GB+ (e.g., 2x RTX 3090 or an RTX 6000 Ada). With 24 GB VRAM, the limit is around 34B models.
Hardware Recommendation
RTX 4060 (8 GB VRAM): 7B models, no problem. RTX 4070 Ti Super (16 GB): Up to 14B comfortably. RTX 3090/4090 (24 GB): Up to 32-34B quantized. The RTX 3090 used (EUR 750-1,123) remains the value king for local AI.
Quantization: Large Models on Small Hardware
Quantization reduces the precision of model weights from 32-bit floating point numbers to 4 or 8 bits. This halves the VRAM requirement with minimal quality loss.
| Format | Size vs. Original | Quality | Recommendation |
|---|---|---|---|
| FP16 / BF16 | 50% | 100% (lossless) | When VRAM is not an issue |
| Q5_K_M | ~35% | ~99% | Highest quality with compression |
| Q4_K_M | ~25% | ~95% | Best trade-off (standard) |
| Q3_K_M | ~20% | ~85% | Only when VRAM is extremely tight |
With Ollama, most models default to Q4_K_M quantization. You do not need to configure anything extra β just ollama run llama3.3 and go.
Local vs. Cloud: Where Should the LLM Run?
The key question for every business: own hardware or cloud API? Both have their place.
| Criterion | Cloud API | Local (Self-hosted) |
|---|---|---|
| Quality | Best available models | For simple tasks ~95% equivalent, reasoning 20-25% weaker |
| Privacy | Data goes to third parties (US) | Data stays with you (GDPR) |
| Monthly Cost | EUR 50-500+ (usage-based) | ~EUR 49 electricity (AT, 50% load) + EUR 750-2,000 hardware one-time |
| Hardware Needed | No | GPU from EUR 350, used RTX 3090 from EUR 750 |
| Availability | Internet required | Runs offline |
| Maintenance | None | Updates, monitoring (~1h/month) |
Honest Benchmark: Cloud vs. Local
The quality gap between cloud models and local models is real. Here are honest comparison numbers (as of March 2026):
| Benchmark | GPT-4o (Cloud) | Llama 3.3 70B (Local) | Source |
|---|---|---|---|
| MMLU (Knowledge) | 85.9% | 86.0% | Vellum |
| HumanEval (Code) | 84% | 88.4% | Vellum / Bind AI |
| IFEval (Instructions) | 84.6 | 92.1 | Vellum |
| MATH (Mathematics) | -- | 77% | Vellum |
Reading Benchmarks Correctly
Llama 3.3 70B surpasses GPT-4o in some benchmarks (MMLU, HumanEval, IFEval). But: 70B does NOT fit on a single 24 GB GPU. For local use, 8B-34B models are realistic β and the gap to cloud models is larger there, especially for complex reasoning.
The Quality Gap is REAL
Especially for complex reasoning (logical deductions, multi-step analysis, legal argumentation), cloud is clearly ahead. Local models are not "almost as good" β they are measurably worse. Hiding this would be dishonest.
Where Local is Still Enough
For 80% of everyday tasks (data extraction, classification, simple Q&A, summaries), local models are sufficient. For complex reasoning: use a cloud API as backup. The most honest approach is hybrid β local where it works, cloud where it counts.
Our Recommendation
Start local with Ollama + a 7B or 14B model. For tasks where quality is critical (e.g., contracts, complex analyses), use a cloud API as backup. This saves money and keeps your data under control. More: Local vs. Cloud: The TCO Comparison
Get Started in 5 Minutes
You do not need an ML degree to run an LLM locally. With Ollama it takes 3 steps:
Install Ollama
Download from ollama.com β available for Windows, Mac, and Linux.
Start a model
ollama run llama3.3Ask questions
The model runs on your GPU. No cloud, no API keys, no costs. The REST API is available at http://localhost:11434.
Key Takeaways
- LLMs predict the next token β they do not "know" anything, they calculate probabilities.
- More parameters = better quality, but more VRAM and slower. Q4_K_M quantization is the best trade-off.
- LLMs hallucinate. Always verify critical outputs. RAG significantly reduces the risk.
- Local LLMs on your own hardware (Ollama) are GDPR-compliant. RTX 3090 at 50% load: ~EUR 49/month electricity (AT: EUR 0.34/kWh).
- To get started: install Ollama, run llama3.3, up and running in 5 minutes.
Sources
- Vellum: Llama 3.3 70B vs GPT-4o β MMLU, HumanEval, IFEval, MATH benchmark data
- Bind AI: Llama 3.3 70B vs GPT-4o Coding β HumanEval comparison
- IntuitionLabs: 24GB GPU Optimization β VRAM limit 24 GB, max ~34B quantized
- LocalAIMaster: Best GPUs for AI β Inference speed RTX 3090 (tok/s)
- CoreLab: LLM GPU Benchmarks β 8B models ~112 tok/s on RTX 3090
- GlobalPetrolPrices: Austria Electricity Prices β Electricity price AT residential EUR 0.34/kWh (2026)
- BestValueGPU: RTX 3090 Price History β Used prices EUR 750-1,123
Further Reading
War dieser Artikel hilfreich?
Next step: move from knowledge to implementation
If you want more than theory: setups, workflows and templates from real operations for teams that want local, documented AI systems.
- Local and self-hosted by default
- Documented and auditable
- Built from our own runtime
- Made in Austria