Distillation allows you to make a smaller model that can produce the same outputs as a larger model. Basically they’re pirating all of the hard work anthropic did pirating the entire internet.
Alibaba gets a model that produces basically the same output for a tiny fraction of the cost to operate the model once it’s finished training. Distillation training also uses basically all of its data from the big model (afaik it’s all of it sourced from the parent model)
It’s like if you took a lump of metal and showed it Porsche 911s until it turned into a 911 shaped chunk of metal that had 95% of the performance, but it only cost you $3000 for the ingot, and also cost ⅕ the amount in fuel and maintenance.
Distillation allows you to make a smaller model that can produce the same outputs as a larger model. Basically they’re pirating all of the hard work anthropic did pirating the entire internet.
Alibaba gets a model that produces basically the same output for a tiny fraction of the cost to operate the model once it’s finished training. Distillation training also uses basically all of its data from the big model (afaik it’s all of it sourced from the parent model)
It’s like if you took a lump of metal and showed it Porsche 911s until it turned into a 911 shaped chunk of metal that had 95% of the performance, but it only cost you $3000 for the ingot, and also cost ⅕ the amount in fuel and maintenance.
Ok, thanks for the detailed explanation. I guess if your goal is to make your model sound like another model that makes perfect sense.