Zum Inhalt springen
>_<
AI EngineeringWiki

Basics

What is a Large Language Model (LLM)?

How language models work, why they sometimes hallucinate, and what you need to run them locally.

Reading time: 12 minLast updated: March 2026

At a Glance

A Large Language Model (LLM) is a neural network trained on massive amounts of text that understands, generates, and translates language. LLMs predict word by word which token should come next. They can run locally on your own hardware β€” no cloud, no dependency.

What exactly is an LLM?

A Large Language Model is a form of artificial intelligence based on the Transformer architecture. The model was trained on billions of texts from the internet β€” books, Wikipedia, forums, scientific papers β€” and learned statistical patterns in language.

The core idea: An LLM calculates a probability for each possible next word and then selects the most likely one. That sounds simple, but with 70 billion parameters and 128,000 tokens of context, the results are surprisingly good.

The Transformer Architecture (Simplified)

Since the paper "Attention is All You Need" (2017), all modern LLMs are based on Transformers. The core idea: every word in a text "looks at" all other words simultaneously and learns which connections matter.

Simplified Transformer Architecture

Input Text→Tokenizer→Embedding→Self-Attention→Feed-Forward→Output (next token)

What is Self-Attention?

Self-Attention is the mechanism that helps the model understand which words in a sentence belong together. When you write "The dog chased the cat because he was hungry" β€” Attention helps the model understand that "he" refers to "dog", not "cat".

Tokens: How LLMs Process Text

LLMs do not read words β€” they read tokens. A token is a text fragment, often a word or part of a word. "Datenschutzgrundverordnung" (German for GDPR), for example, is split into 3-4 tokens. English texts require fewer tokens than German because most models were trained primarily on English data.

Rule of Thumb for Tokens

1 token is roughly 3/4 of an English word. For German, count about 1 token per half word. A typical paragraph (100 words) is approximately 130-150 tokens. A context window of 128K equals roughly 200 pages of text.

LLM vs. Search Engine

A common misconception: LLMs are not search engines. They do not "know" anything β€” they calculate which answer is statistically most likely.

PropertySearch Engine (Google)LLM (ChatGPT, Llama)
Data SourceLive index of the internetTraining data (cutoff date)
TimelinessReal-timeKnowledge ends at training cutoff
Response FormatLinks to websitesFlowing text, code, tables
AccuracySource verifiableCan hallucinate (made-up facts)
PersonalizationBased on search historyBased on conversation
CostFree (with ads)API costs or local hardware

Hallucinations: When LLMs Make Things Up

LLMs can invent facts that sound convincing but are wrong. This happens because they do not "know" β€” they calculate statistical probabilities. When no good answer exists in the training data, they still generate something plausible.

Hallucination Risk

LLMs invent citations, laws, URLs, and statistics. Especially dangerous for: legal texts, medical advice, historical facts, and technical specifications. ALWAYS verify output before using it.

Reducing Hallucinations

RAG (Retrieval Augmented Generation) is the best approach: instead of letting the model "guess", you feed it real documents as context. The model then responds based on your data rather than its training data. More in the RAG Complete Guide.

Model Sizes: From 7B to 70B

"B" stands for billion parameters. More parameters means more "knowledge" and better quality β€” but also more VRAM and slower responses. The art lies in finding the right trade-off.

SizeVRAM (Q4)Speed (RTX 3090)QualityExample Models
7-8B~5 GB~112 tok/sGood for simple tasksLlama 3.3 8B, Mistral 7B, Qwen 2.5 7B
13-14B~10 GB43-57 tok/sSolid all-roundersQwen3 14B, DeepSeek R1 14B
24-32B~16-20 GB~20-30 tok/sNear cloud qualityMistral Small 3.1 24B, Qwen 2.5 32B
70B~40 GBDoes NOT fit on 24 GB GPUBest local qualityLlama 3.3 70B, Qwen 2.5 72B

70B Needs More Than 24 GB VRAM

A 70B model in Q4_K_M quantization requires about 40 GB VRAM. That does NOT fit on a single RTX 3090 or RTX 4090 (each 24 GB). For 70B you need 48 GB+ (e.g., 2x RTX 3090 or an RTX 6000 Ada). With 24 GB VRAM, the limit is around 34B models.

Hardware Recommendation

RTX 4060 (8 GB VRAM): 7B models, no problem. RTX 4070 Ti Super (16 GB): Up to 14B comfortably. RTX 3090/4090 (24 GB): Up to 32-34B quantized. The RTX 3090 used (EUR 750-1,123) remains the value king for local AI.

Quantization: Large Models on Small Hardware

Quantization reduces the precision of model weights from 32-bit floating point numbers to 4 or 8 bits. This halves the VRAM requirement with minimal quality loss.

FormatSize vs. OriginalQualityRecommendation
FP16 / BF1650%100% (lossless)When VRAM is not an issue
Q5_K_M~35%~99%Highest quality with compression
Q4_K_M~25%~95%Best trade-off (standard)
Q3_K_M~20%~85%Only when VRAM is extremely tight

With Ollama, most models default to Q4_K_M quantization. You do not need to configure anything extra β€” just ollama run llama3.3 and go.

Local vs. Cloud: Where Should the LLM Run?

The key question for every business: own hardware or cloud API? Both have their place.

CriterionCloud APILocal (Self-hosted)
QualityBest available modelsFor simple tasks ~95% equivalent, reasoning 20-25% weaker
PrivacyData goes to third parties (US)Data stays with you (GDPR)
Monthly CostEUR 50-500+ (usage-based)~EUR 49 electricity (AT, 50% load) + EUR 750-2,000 hardware one-time
Hardware NeededNoGPU from EUR 350, used RTX 3090 from EUR 750
AvailabilityInternet requiredRuns offline
MaintenanceNoneUpdates, monitoring (~1h/month)

Honest Benchmark: Cloud vs. Local

The quality gap between cloud models and local models is real. Here are honest comparison numbers (as of March 2026):

BenchmarkGPT-4o (Cloud)Llama 3.3 70B (Local)Source
MMLU (Knowledge)85.9%86.0%Vellum
HumanEval (Code)84%88.4%Vellum / Bind AI
IFEval (Instructions)84.692.1Vellum
MATH (Mathematics)--77%Vellum

Reading Benchmarks Correctly

Llama 3.3 70B surpasses GPT-4o in some benchmarks (MMLU, HumanEval, IFEval). But: 70B does NOT fit on a single 24 GB GPU. For local use, 8B-34B models are realistic β€” and the gap to cloud models is larger there, especially for complex reasoning.

The Quality Gap is REAL

Especially for complex reasoning (logical deductions, multi-step analysis, legal argumentation), cloud is clearly ahead. Local models are not "almost as good" β€” they are measurably worse. Hiding this would be dishonest.

Where Local is Still Enough

For 80% of everyday tasks (data extraction, classification, simple Q&A, summaries), local models are sufficient. For complex reasoning: use a cloud API as backup. The most honest approach is hybrid β€” local where it works, cloud where it counts.

Our Recommendation

Start local with Ollama + a 7B or 14B model. For tasks where quality is critical (e.g., contracts, complex analyses), use a cloud API as backup. This saves money and keeps your data under control. More: Local vs. Cloud: The TCO Comparison

Get Started in 5 Minutes

You do not need an ML degree to run an LLM locally. With Ollama it takes 3 steps:

1

Install Ollama

Download from ollama.com β€” available for Windows, Mac, and Linux.

2

Start a model

ollama run llama3.3
3

Ask questions

The model runs on your GPU. No cloud, no API keys, no costs. The REST API is available at http://localhost:11434.

Key Takeaways

  • LLMs predict the next token β€” they do not "know" anything, they calculate probabilities.
  • More parameters = better quality, but more VRAM and slower. Q4_K_M quantization is the best trade-off.
  • LLMs hallucinate. Always verify critical outputs. RAG significantly reduces the risk.
  • Local LLMs on your own hardware (Ollama) are GDPR-compliant. RTX 3090 at 50% load: ~EUR 49/month electricity (AT: EUR 0.34/kWh).
  • To get started: install Ollama, run llama3.3, up and running in 5 minutes.

Sources

War dieser Artikel hilfreich?

Next step: move from knowledge to implementation

If you want more than theory: setups, workflows and templates from real operations for teams that want local, documented AI systems.

Why AI Engineering
  • Local and self-hosted by default
  • Documented and auditable
  • Built from our own runtime
  • Made in Austria
Not legal advice.