Attention Score
The numerical value representing how much one token should focus on another token in the attention mechanism. Higher scores mean stronger relationships between tokens.
Why It Matters
Attention scores are what make transformers interpretable — you can visualize them to see which parts of the input the model is focusing on for each output.
Example
In translating 'black cat,' the word 'noir' (French for black) would have a high attention score with 'black' and a low score with 'cat'.
Think of it like...
Like a heat map showing where a student's eyes focus while reading — bright spots indicate the words they are paying most attention to.
Related Terms
Attention Mechanism
A component in neural networks that allows the model to focus on the most relevant parts of the input when producing each part of the output. It assigns different weights to different input elements based on their relevance.
Self-Attention
A mechanism where each element in a sequence attends to all other elements to compute a representation, determining how much focus to place on each part of the input. It is the core innovation of the transformer.
Transformer
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequential data in parallel rather than sequentially. Transformers are the foundation of modern LLMs like GPT, Claude, and Gemini.
Softmax
A function that converts a vector of numbers into a probability distribution, where each value is between 0 and 1 and all values sum to 1. It is typically used as the final layer in classification models.