arxiv preprint - Weight subcloning: direct initialization of transformers using larger pretrained ones

In this episode we discuss Weight subcloning: direct initialization of transformers using larger pretrained ones
by Mohammad Samragh, Mehrdad Farajtabar, Sachin Mehta, Raviteja Vemulapalli, Fartash Faghri, Devang Naik, Oncel Tuzel, Mohammad Rastegari. The paper introduces a new method called weight subcloning to expedite the training of small transformer models by initializing them with weights from larger pretrained models. This method ranks neurons by importance to reduce dimensions and removes blocks to align with the smaller model’s layer count, resulting in significantly faster training times. Weight subcloning allows the transfer of knowledge from larger to smaller models, improving speed and potentially accuracy without the need for a pretrained model of the exact desired size.

arxiv preprint – Weight subcloning: direct initialization of transformers using larger pretrained ones