What is an LLM? Large Language Models Explained | AI Engineering Wiki

At a Glance

A Large Language Model (LLM) is a neural network trained on massive amounts of text that understands, generates, and translates language. LLMs predict word by word which token should come next. They can run locally on your own hardware — no cloud, no dependency.

What exactly is an LLM?

A Large Language Model is a form of artificial intelligence based on the Transformer architecture. The model was trained on billions of texts from the internet — books, Wikipedia, forums, scientific papers — and learned statistical patterns in language.

The core idea: An LLM calculates a probability for each possible next word and then selects the most likely one. That sounds simple, but with 70 billion parameters and 128,000 tokens of context, the results are surprisingly good.

The Transformer Architecture (Simplified)

Since the paper "Attention is All You Need" (2017), all modern LLMs are based on Transformers. The core idea: every word in a text "looks at" all other words simultaneously and learns which connections matter.

Simplified Transformer Architecture

Input Text→Tokenizer→Embedding→Self-Attention→Feed-Forward→Output (next token)

What is Self-Attention?

Self-Attention is the mechanism that helps the model understand which words in a sentence belong together. When you write "The dog chased the cat because he was hungry" — Attention helps the model understand that "he" refers to "dog", not "cat".

Tokens: How LLMs Process Text

LLMs do not read words — they read tokens. A token is a text fragment, often a word or part of a word. "Datenschutzgrundverordnung" (German for GDPR), for example, is split into 3-4 tokens. English texts require fewer tokens than German because most models were trained primarily on English data.

Rule of Thumb for Tokens

1 token is roughly 3/4 of an English word. For German, count about 1 token per half word. A typical paragraph (100 words) is approximately 130-150 tokens. A context window of 128K equals roughly 200 pages of text.

LLM vs. Search Engine

A common misconception: LLMs are not search engines. They do not "know" anything — they calculate which answer is statistically most likely.

Property	Search Engine (Google)	LLM (ChatGPT, Llama)
Data Source	Live index of the internet	Training data (cutoff date)
Timeliness	Real-time	Knowledge ends at training cutoff
Response Format	Links to websites	Flowing text, code, tables
Accuracy	Source verifiable	Can hallucinate (made-up facts)
Personalization	Based on search history	Based on conversation
Cost	Free (with ads)	API costs or local hardware

Hallucinations: When LLMs Make Things Up

LLMs can invent facts that sound convincing but are wrong. This happens because they do not "know" — they calculate statistical probabilities. When no good answer exists in the training data, they still generate something plausible.

Hallucination Risk

LLMs invent citations, laws, URLs, and statistics. Especially dangerous for: legal texts, medical advice, historical facts, and technical specifications. ALWAYS verify output before using it.

Reducing Hallucinations

RAG (Retrieval Augmented Generation) is the best approach: instead of letting the model "guess", you feed it real documents as context. The model then responds based on your data rather than its training data. More in the RAG Complete Guide.

Model Sizes: From 7B to 70B

"B" stands for billion parameters. More parameters means more "knowledge" and better quality — but also more VRAM and slower responses. The art lies in finding the right trade-off.

Size	VRAM (Q4)	Speed (RTX 3090)	Quality	Example Models
7-8B	~5 GB	~112 tok/s	Good for simple tasks	Llama 3.3 8B, Mistral 7B, Qwen 2.5 7B
13-14B	~10 GB	43-57 tok/s	Solid all-rounders	Qwen3 14B, DeepSeek R1 14B
24-32B	~16-20 GB	~20-30 tok/s	Near cloud quality	Mistral Small 3.1 24B, Qwen 2.5 32B
70B	~40 GB	Does NOT fit on 24 GB GPU	Best local quality	Llama 3.3 70B, Qwen 2.5 72B

70B Needs More Than 24 GB VRAM

A 70B model in Q4_K_M quantization requires about 40 GB VRAM. That does NOT fit on a single RTX 3090 or RTX 4090 (each 24 GB). For 70B you need 48 GB+ (e.g., 2x RTX 3090 or an RTX 6000 Ada). With 24 GB VRAM, the limit is around 34B models.

Hardware Recommendation

RTX 4060 (8 GB VRAM): 7B models, no problem. RTX 4070 Ti Super (16 GB): Up to 14B comfortably. RTX 3090/4090 (24 GB): Up to 32-34B quantized. The RTX 3090 used (EUR 750-1,123) remains the value king for local AI.

Quantization: Large Models on Small Hardware

Quantization reduces the precision of model weights from 32-bit floating point numbers to 4 or 8 bits. This halves the VRAM requirement with minimal quality loss.

Format	Size vs. Original	Quality	Recommendation
FP16 / BF16	50%	100% (lossless)	When VRAM is not an issue
Q5_K_M	~35%	~99%	Highest quality with compression
Q4_K_M	~25%	~95%	Best trade-off (standard)
Q3_K_M	~20%	~85%	Only when VRAM is extremely tight

With Ollama, most models default to Q4_K_M quantization. You do not need to configure anything extra — just ollama run llama3.3 and go.

Local vs. Cloud: Where Should the LLM Run?

The key question for every business: own hardware or cloud API? Both have their place.

Criterion	Cloud API	Local (Self-hosted)
Quality	Best available models	For simple tasks ~95% equivalent, reasoning 20-25% weaker
Privacy	Data goes to third parties (US)	Data stays with you (GDPR)
Monthly Cost	EUR 50-500+ (usage-based)	~EUR 49 electricity (AT, 50% load) + EUR 750-2,000 hardware one-time
Hardware Needed	No	GPU from EUR 350, used RTX 3090 from EUR 750
Availability	Internet required	Runs offline
Maintenance	None	Updates, monitoring (~1h/month)

Honest Benchmark: Cloud vs. Local

The quality gap between cloud models and local models is real. Here are honest comparison numbers (as of March 2026):

Benchmark	GPT-4o (Cloud)	Llama 3.3 70B (Local)	Source
MMLU (Knowledge)	85.9%	86.0%	Vellum
HumanEval (Code)	84%	88.4%	Vellum / Bind AI
IFEval (Instructions)	84.6	92.1	Vellum
MATH (Mathematics)	--	77%	Vellum

Reading Benchmarks Correctly

Llama 3.3 70B surpasses GPT-4o in some benchmarks (MMLU, HumanEval, IFEval). But: 70B does NOT fit on a single 24 GB GPU. For local use, 8B-34B models are realistic — and the gap to cloud models is larger there, especially for complex reasoning.

The Quality Gap is REAL

Especially for complex reasoning (logical deductions, multi-step analysis, legal argumentation), cloud is clearly ahead. Local models are not "almost as good" — they are measurably worse. Hiding this would be dishonest.

Where Local is Still Enough

For 80% of everyday tasks (data extraction, classification, simple Q&A, summaries), local models are sufficient. For complex reasoning: use a cloud API as backup. The most honest approach is hybrid — local where it works, cloud where it counts.

Our Recommendation

Start local with Ollama + a 7B or 14B model. For tasks where quality is critical (e.g., contracts, complex analyses), use a cloud API as backup. This saves money and keeps your data under control. More: Local vs. Cloud: The TCO Comparison

Get Started in 5 Minutes

You do not need an ML degree to run an LLM locally. With Ollama it takes 3 steps:

Install Ollama

Download from ollama.com — available for Windows, Mac, and Linux.

Start a model

ollama run llama3.3

Ask questions

The model runs on your GPU. No cloud, no API keys, no costs. The REST API is available at http://localhost:11434.

Key Takeaways

LLMs predict the next token — they do not "know" anything, they calculate probabilities.
More parameters = better quality, but more VRAM and slower. Q4_K_M quantization is the best trade-off.
LLMs hallucinate. Always verify critical outputs. RAG significantly reduces the risk.
Local LLMs on your own hardware (Ollama) are GDPR-compliant. RTX 3090 at 50% load: ~EUR 49/month electricity (AT: EUR 0.34/kWh).
To get started: install Ollama, run llama3.3, up and running in 5 minutes.

Sources

Vellum: Llama 3.3 70B vs GPT-4o — MMLU, HumanEval, IFEval, MATH benchmark data
Bind AI: Llama 3.3 70B vs GPT-4o Coding — HumanEval comparison
IntuitionLabs: 24GB GPU Optimization — VRAM limit 24 GB, max ~34B quantized
LocalAIMaster: Best GPUs for AI — Inference speed RTX 3090 (tok/s)
CoreLab: LLM GPU Benchmarks — 8B models ~112 tok/s on RTX 3090
GlobalPetrolPrices: Austria Electricity Prices — Electricity price AT residential EUR 0.34/kWh (2026)
BestValueGPU: RTX 3090 Price History — Used prices EUR 750-1,123

What is a Large Language Model (LLM)?