Transformer

What is a transformer:
Transformer: A neural network architecture that uses attention mechanism to process entire sequences of data simultaneously, allowing it to understand the context and relationships between words regardless of their distance.

Transformer Model: A specific system built using the transformer architecture. It includes both the architecture and the specific “weights” (the learned patterns) acquired during training.
A transformer model can be a language model when it is trained on raw text data, but it doesn’t just have to be trained on language, it can be trained on things other than language, like Vision Transformers (ViT) for images. In those cases, it’s a Transformer, but not a language model.

Language Model: A system trained on large amounts of raw text to understand statistical patterns in language, typically by predicting missing or subsequent words in a sentence.
Example of language models using transformer architecture (Transformer models trained with raw text data): GPT, BERT, or Llama.
A language model doesn’t have to be a transformer model though, earlier we used to have many other types of language models that doesn’t use transformer architecture.
N-gram models: Simple statistical models that look at the last 2 or 3 words (very old school).RNNs (Recurrent Neural Networks): These processed words one by one in a chain.LSTMs (Long Short-Term Memory): A better version of RNNs that could remember slightly longer sentences.

How Transformer/language models are trained:
The Transformer models (GPT, BERT, T5, etc.) have been trained as language models. This means they have been trained on large amounts of raw text in a self-supervised fashion.

Self-supervised learning (pre-training):

Pretraining is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge.This pretraining is usually done on very large amounts of data. Therefore, it requires a very large corpus of data, and training can take up to several weeks.
Self-supervised learning is used during pre-training, in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!
This type of model develops a statistical understanding of the language it has been trained on, but it’s less useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called transfer learning or fine-tuning.

Self-supervised learning methods:
Causal language modeling is used to train the model from scratch.

The Goal: The model is fed vast amounts of raw text. For every sequence of words, it must guess what the next word is.
The “Causal” Part: During this phase, the model is strictly forbidden from looking at “future” words in the sentence. It can only use the context of the words that came before it.
Self-Supervision: This is considered “self-supervised” because the labels (the correct next words) are already present in the text itself—no humans need to manually label the data.

It also uses masked language modeling, in which the model predicts a masked word in the sentence.

Once the model is trained, When you use a model like GPT to write a story or answer a question, it is performing causal language modeling. It predicts one word, adds that word to the sequence, and then predicts the next one based on the new, longer sequence.

transfer learning or fine-tuning:

Process where the pre-trained model (with self supervised learning) is fine-tuned in a supervised way. that is, using human-annotated labels. on a given task.
Pre-trained model develops a statistical understanding of the language it has been trained on, but it’s less useful for specific practical tasks without fine tuning.
Fine-tuning is done after a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task