I'm pretty sure that's how they learn all languages? LLMs don't know what they're saying. A large part of how they function is essentially just predicting what word their training data would be most likely to use next. Obviously there's more to it but that's the gist. It's not like they understand context.
Scanning the New Yorker article, I don't see any technical details of how they intend to do this, just handwavy "chatGPT will do it". I can't rule out that they have more details in there, but I'm not hopeful.
The main problem you're going to run into is "aligning" whalish and english. If you train a LLM on english, and one on whalish, you're going to give it no notion for when it's actually saying the same thing. Just using plain next-token prediction you can't really do that. You'd need some "translated" data, in which case I'd hazard the guess that a small set of translated data would already help greatly align much larger sets of unannotated data.
So how do we get translated data? If you were to track whales and annotate their utterances with what's going on - what things are nearby, which other whales are nearby, what is everyone doing? - then you have some data that potentially has a mapping to english concepts and you can actually do something here.
The glib answer is that the procedure described in the article won't work. LLMs are not trained by throwing a gigantic corpus at a machine learning model and walking away; there's a great deal of additional manual training required to get the model to produce coherent outputs. Unless you have some entity which speaks fluent whale to refine the model with, you're out of luck.
The slightly less glib answer is that no non-human animal speaks (or otherwise conveys) "language". Anyone remember this nugget from a few years ago? The reply is being kind of a dick, but is also right: the original comment is self-refuting. Non-human animals make associations with specific sounds, which they can then use to communicate with each other, but there's little to no possibility for abstraction or composition. There's a credible argument to be made that Koko the gorilla was the most intelligent (at least linguistically) non-human animal in history (there's also a credible argument that her intelligence was wildly overestimated, but that's a different topic). She was less linguistically capable than an average human two-year-old. So: we don't need a whale language model, that's not even coherent. At the absolute most, we need a whale dictionary (and realistically, it'd be more of a pamphlet). An LLM does absolutely nothing to help with that.
In terms of non-English human languages? Yes, there's no reason to believe LLMs are not technically perfectly capable of being trained on them (though because they're trained on extremely large volumes of written text, there likely are languages for which not enough training data exists). The limitations there are economic: training an LLM is extremely expensive, and English has by a very wide margin the most speakers on the planet who can provide a return on that expense.
I'm pretty sure that's how they learn all languages? LLMs don't know what they're saying. A large part of how they function is essentially just predicting what word their training data would be most likely to use next. Obviously there's more to it but that's the gist. It's not like they understand context.
I think the tricky part is that there's no way to double check. We can see if ChatGPT is making sense in English or another human language.
But if it spits out something in whale, we basically have to shrug and say "yeah I guess that sounds like whale to me."
Scanning the New Yorker article, I don't see any technical details of how they intend to do this, just handwavy "chatGPT will do it". I can't rule out that they have more details in there, but I'm not hopeful.
The main problem you're going to run into is "aligning" whalish and english. If you train a LLM on english, and one on whalish, you're going to give it no notion for when it's actually saying the same thing. Just using plain next-token prediction you can't really do that. You'd need some "translated" data, in which case I'd hazard the guess that a small set of translated data would already help greatly align much larger sets of unannotated data.
So how do we get translated data? If you were to track whales and annotate their utterances with what's going on - what things are nearby, which other whales are nearby, what is everyone doing? - then you have some data that potentially has a mapping to english concepts and you can actually do something here.
The glib answer is that the procedure described in the article won't work. LLMs are not trained by throwing a gigantic corpus at a machine learning model and walking away; there's a great deal of additional manual training required to get the model to produce coherent outputs. Unless you have some entity which speaks fluent whale to refine the model with, you're out of luck.
The slightly less glib answer is that no non-human animal speaks (or otherwise conveys) "language". Anyone remember this nugget from a few years ago? The reply is being kind of a dick, but is also right: the original comment is self-refuting. Non-human animals make associations with specific sounds, which they can then use to communicate with each other, but there's little to no possibility for abstraction or composition. There's a credible argument to be made that Koko the gorilla was the most intelligent (at least linguistically) non-human animal in history (there's also a credible argument that her intelligence was wildly overestimated, but that's a different topic). She was less linguistically capable than an average human two-year-old. So: we don't need a whale language model, that's not even coherent. At the absolute most, we need a whale dictionary (and realistically, it'd be more of a pamphlet). An LLM does absolutely nothing to help with that.
In terms of non-English human languages? Yes, there's no reason to believe LLMs are not technically perfectly capable of being trained on them (though because they're trained on extremely large volumes of written text, there likely are languages for which not enough training data exists). The limitations there are economic: training an LLM is extremely expensive, and English has by a very wide margin the most speakers on the planet who can provide a return on that expense.