Models aren’t retrained from zero. They can be fine tuned or they could even have added a routine to handle specific cases like this.
For example, Claude used to have a routine that would call external tools embedded in the app to parse structured data and transform it. Not sure about how it does it now.
Takes months to train a model, there were already models that got it right when the question was popular, as long as thinking was enabled.
Also if they were optimising for this question, why not update their lower end model (Haiku) as well?
The interesting question would be what percent of humans get it wrong. Smaller than LLMs for sure, but I somehow doubt it’s 0.
Models aren’t retrained from zero. They can be fine tuned or they could even have added a routine to handle specific cases like this.
For example, Claude used to have a routine that would call external tools embedded in the app to parse structured data and transform it. Not sure about how it does it now.