• iocase@lemmy.zip
    link
    fedilink
    English
    arrow-up
    2
    ·
    2 days ago

    It’s approximate but yeah you can get roughly in that ballpark. The biggest benefit is making the model weights smaller and cheaper to run. You can fit 5X as many instances on the same server if you distill down while having basically the same output.

    The main caveat is you need to absolutely hammer the main model with questions from all angles to try and get it to present as much of its internalized knowledge as possible. Which is why Anthropic is pissed about this since they’re barely making money off of these prompts to train a more efficient competitor (BTW this is how “mini” or other models are trained. They’re distillates)