^ _ ^
Reference
A Closer Look at How Fine-tuning Changes BERT
BERT-fine-tuning-analysis
Introduction
The architecture that Pretrained Language Model + Fine-Tuning Layer
has been universally successful. Yet, how fine-tuning changes the embedding space is less studied.
The article proposed a hypothesis: Fine-tune affects classification performance by increasing the distance between examples associated with diffrent labels.
Conlusion: Fine-tuning does not introduce arbitrary changes to representations; instead, it adjusts the representations to downstream tasks while largely preserving the original spatial structure of the data points.
Question
- Does fine-tuning always improve performance?
- How does fine-tuning alter the representation to adjust for downstream tasks?
- How does fine-tuning changes the geometric structure of different layers?
Contribution
- Fine-tuning introduces a divergence between training and test sets, which is not severe enough to hurt generalization in most cases. However, we do find one exception where fine-tuning hurts the performance; this setting also has the largest divergence between training and test set after fine-tuning.
- For a representation where task labels are not linearly separable, we find that fine-tuning adjusts it by grouping points with the same label into a small number of clusters (ideally one), thus simplifying the underlying representation.
- Fine-tuning does not change the higher layers arbitrarily. This confirms previous findings. Additionally, we find that fine-tuning largely preserves the relative positions of the label clusters, while reconfiguring the space to adjust for downstream tasks.
Probing Methods
- Classifiers as Probes: To understand how well a representation encodes the labels, for a task, a probing classifier is trained over it, with the embeddings themselves kept frozen when the classifier is trained. Classifier probes aim to measure how well a
contextualized representation captures a linguistic property. - DirectProbe: Probing the geometric structure.
- Number of clusters: The number of clusters indicates the linearity of the representation for a task.
- Distance between clusters: Distances3 between clusters can reveal the internal structure of a representation.
- Spatial Similarity: Intuitively, if two representations have similar relative distances between clusters, the representations themselves are similar to each other for the task at hand. Distance vector is used for representation.