There are other, more well defined intelligence tests such as the "Winograd Schema" [1] which asks grammatical questions that require real-world knowledge and common-sense in order to answer such as:
"In the sentence 'The trophy would not fit in the suitcase because it was too big/small.' what does big/small refer to?"
But even these type of questions which were considered to be "The Sentences Computers Can't Understand, But Humans Can" as late as 2020 [2] seem to be easy for LLMs to answer with GPT4 topping the list with almost human level accuracy [3].
So assuming ChatGPT isn't an AGI (yet), how can we prove it to ourselves? What linguistic tasks are humans qualitatively better than LLMs in?
------
[0] https://en.wikipedia.org/wiki/Eugene_Goostman [1] https://en.wikipedia.org/wiki/Winograd_schema_challenge [2] https://www.youtube.com/watch?v=m3vIEKWrP9Q [3] https://paperswithcode.com/sota/common-sense-reasoning-on-winogrande