Theme NexT works best with JavaScript enabled
0%

How Fine-tune Changes BERT

^ _ ^

Reference

A Closer Look at How Fine-tuning Changes BERT
BERT-fine-tuning-analysis

Introduction

The architecture that Pretrained Language Model + Fine-Tuning Layer has been universally successful. Yet, how fine-tuning changes the embedding space is less studied.

The article proposed a hypothesis: Fine-tune affects classification performance by increasing the distance between examples associated with diffrent labels.

Conlusion: Fine-tuning does not introduce arbitrary changes to representations; instead, it adjusts the representations to downstream tasks while largely preserving the original spatial structure of the data points.

Question

  1. Does fine-tuning always improve performance?
  2. How does fine-tuning alter the representation to adjust for downstream tasks?
  3. How does fine-tuning changes the geometric structure of different layers?

Contribution

  1. Fine-tuning introduces a divergence between training and test sets, which is not severe enough to hurt generalization in most cases. However, we do find one exception where fine-tuning hurts the performance; this setting also has the largest divergence between training and test set after fine-tuning.
  2. For a representation where task labels are not linearly separable, we find that fine-tuning adjusts it by grouping points with the same label into a small number of clusters (ideally one), thus simplifying the underlying representation.
  3. Fine-tuning does not change the higher layers arbitrarily. This confirms previous findings. Additionally, we find that fine-tuning largely preserves the relative positions of the label clusters, while reconfiguring the space to adjust for downstream tasks.

Probing Methods

  1. Classifiers as Probes: To understand how well a representation encodes the labels, for a task, a probing classifier is trained over it, with the embeddings themselves kept frozen when the classifier is trained. Classifier probes aim to measure how well a
    contextualized representation captures a linguistic property.
  2. DirectProbe: Probing the geometric structure.
    • Number of clusters: The number of clusters indicates the linearity of the representation for a task.
    • Distance between clusters: Distances3 between clusters can reveal the internal structure of a representation.
    • Spatial Similarity: Intuitively, if two representations have similar relative distances between clusters, the representations themselves are similar to each other for the task at hand. Distance vector is used for representation.