LLM

Large Language Model
LLMs shifted the paradigm. Instead of many tiny, specialized models, we now have one massive model trained on nearly the entire internet.

Pre-training: They are trained on massive amounts of data to simply “understand language” in general.
General Capabilities: Because of this huge scale, a single model (like GPT-4) can now perform all those tasks (summarizing, translating, sentiment analysis) without needing to be retrained for each one.

LLMs also have important limitations:

Hallucinations: They can generate incorrect information confidently
Lack of true understanding: They lack true understanding of the world and operate purely on statistical patterns
Bias: They may reproduce biases present in their training data or inputs.
Context windows: They have limited context windows (though this is improving)
Computational resources: They require significant computational resources

LLM Fundamentals: Transformers, Encoders, & Decoders

1. Core Architecture Concepts

Auto-regressive: A model that predicts the next token based on its own previous outputs (e.g., GPT).
Encoder: The “Reader.” Optimized to understand context and extract features from the full input at once.
Decoder: The “Writer.” Optimized to generate text step-by-step using “masked attention” to look only at past words.
Encoder-Decoder: The “Translator.” The Encoder reads the input, and the Decoder writes the output while “glancing back” at the Encoder’s work (e.g., T5).

2. Self-Attention (The “Context” Builder)

In the Encoder: Every word in the input sentence looks at every other word (past and future) at the same time.
- Example: In “The bank of the river,” the word “bank” attends to “river” to realize it’s a landform, not a building.
In the Decoder: Every word looks only at previous words. It is “masked” so it cannot see the future words it hasn’t written yet.

3. How They Communicate (Cross-Attention)

The Hand-off: The Encoder sends a sequence of vectors (one per token) to the Decoder, not just one single summary.
The Bridge: The Decoder uses Cross-Attention to “query” those vectors, deciding which parts of the original input are most relevant for the specific word it is currently writing.

4. Word Meanings & Vectors

Permanent Vectors (Static Embeddings): A “dictionary” of pre-stored vectors that represent the mathematical average of a word’s meaning across all training data.
Temporary Vectors (Hidden States): Created in real-time by the Encoder. It takes the “blurry” permanent vector and uses the surrounding words to “sharpen” the meaning for that specific sentence.
- Example: For “bank,” the Encoder boosts “finance” features if the sentence mentions “money,” or “nature” features if it mentions “river.”