Formal Linguistic Competence, World-Modelling and Embodiment; Threshold Engineering for AGI
- I. Logothetis
- Dec 4, 2025
- 9 min read
By Isabella Logothetis

In discourse on the potential for intelligence in AI systems, the requisite distinction must be made between narrow AI, task-specific competency in a constrained domain, and artificial general intelligence (AGI), associated with the capacity for domain-general flexible problem-solving (Pennachin & Goertzel, 2007: 1-2). The achievement of the former is uncontentious; there are extant reactive systems optimised for discrete tasks. The latter is a far more demanding status, devising a threshold for which is a complex undertaking. This essay will argue that evaluative frameworks which foreground language ability as a sufficient or necessary indicator of AGI are inadequate. First, I will outline how the criterion of formal linguistic competence is too permissive, on the grounds that it is not necessarily correlated to conceptual understanding as grounded in world-modelling capabilities. I will then problematise the use of language as a necessary criterion for the possession and demonstration of these internal structural representations, and thus AGI. Finally, I will argue that the deficiencies of these approaches are superseded by utilising efficient skill acquisition as an AGI metric, concluding that AI systems can in principle fulfil this criterion.
I will first outline how treating dialogue ability as a sufficient condition for AGI is an unreasonably liberal threshold. A paradigmatic case is the deployment of the Turing Test as a benchmark for machine intelligence. Turing’s imitation game measures natural language competency in machines by gauging a human interrogator’s ability to correctly identify a human and digital respondent based on communication via a text-only interface (Oppy & Dowe, 2021: Section 1). He predicted that after five minutes of conversation, the average interrogator would have at least a 30% chance of misidentifying the computer. While versions of the Turing Test have varied in terms of specifying further constraints (Mitchell, 2024a) - from exclusively expert interrogators to longer discussion times - the core argument for a related AGI criterion is as follows: formal linguistic competence - adherence to the statistical regularities of human language - is sufficient for intelligence candidacy, or even the ability to ‘think’ (Mitchell & Krakauer, 2022: 5). This claim is particularly significant given recent evidence that current LLMs instructed to adopt a human-like persona pass the original three-party Turing Test (Jones & Berge, 2025: 8). Under the aforementioned criterion, we should ascribe AGI to GPT-4.5 and LLaMa solely on these grounds.
There are, however, several reasons to distinguish successful imitation of dialogue ability from AGI. It is worth noting that the ascription of human-like intelligence to entities capable of meeting the structural and narrative expectations of natural language is largely intuition-based and susceptible to anthropomorphic bias. For LLMs, the widespread use of anthropomorphic metaphors (e.g. ‘knowledge’ or ‘reading’) (Mitchell, 2024b), their ability to learn linguistic abstractions and syntactic constructions, and their simulation of cognitive empathy (Sorin et al., 2024: 7), render human judgement vulnerable to over-attribution. Foregrounding the approximation of higher-order discursive norms means that the Turing Test constitutes an indirect measure of intelligence, contingent upon human manipulability. What we intuitively treat as a reliable proxy for intelligence is not an epistemically robust basis for threshold engineering; a successful criterion for AGI must extend beyond the exploitation of human heuristics for attributing theory of mind architecture.
The suggestibility of the intuition and implementation of dialogue tests reflects a deeper flaw; the conflation of human language processing and conceptual grounding, such that the AI system tracks semantic content (i.e. contextual meaning and truth-conditionality). The presence of the former is not predicated upon the latter. This is the basis of Birch’s (2024: 313-317) gaming problem; LLMs are trained on a vast-corpus of natural language data, facilitating the sophisticated mimesis of expressions closely associated with higher-order cognitive states. The inference pipeline from verbal outputs to indicators of general intelligence is therefore significantly attenuated. A particular skill’s correlation with intelligence in humans does not entail that its replication in LLMs should be treated as reliable evidence of AGI. This invites a differentiation between human and LLM language acquisition; the dramatic restructuring of the brain’s visual word form area in the process of learning to read suggests that human language capacity stems from domain-general learning architecture, not task-specific modules (Heyes, 2012: 2182). Conversely, LLMs’ task-oriented programming and access to unlimited priors and training data could enable them to “buy arbitrary levels of skill” (Chollet, 2019: 1) and exploit superficial patterns in input data to emulate language without genuine understanding. This absence of comprehension is exemplified by the hallucination phenomenon, wherein LLMs produce semantically coherent but internally inconsistent or false outputs (Lyre, 2024: 15-16). It is not possible for language fluency to refute the characterisation of LLMs as stochastic parrots, systems that ‘parrot’ statistically probable phrases based on probability distributions rather than intentional states (Bender et al., 2021: 616).
I argue that it is reasonable to assume that genuine understanding is contingent upon the development of an internal world model; a compressed representation of the world which captures its core processes, such as casual relations (LeCun, 2022: 2-4). Such a model underpins several abilities central to domain-general flexible problem-solving, such as abstract planning, predictive capacity and the development of agent-relative goals. Here, I raise what I consider to be an excessively restrictive argument for natural language as necessary for AGI, which regards human dialogue behavior as integral to the construction of world models. This position is exemplified by Landgrebe and Smith’s (2019: 5-6, 8-12) treatment of language as indispensable for the development of intentionality, plan formation, and simulation of world relations; the “processing of both external and internal reality” (ibid: 11). This dialogic approach to world-modelling foregrounds the reciprocal, context-sensitive feedback loops constitutive of natural language discourse as loci for the deployment and renegotiation of individual goals. The primary consequence is as follows: all non-linguistic AI systems, as well as non-human animals, are solely responsive to particular stimuli, lack world-modelling capacity and flexible intentionality, and thus should be excluded from the domain of general intelligence.
I contend that dialogue is neither necessary for world-model construction, nor the only means of assessing the abilities associated with it. I propose, instead, that embodiment - instantiation in an interactive feedback-driven environment - forms the foundational basis for world-modeling, even in the absence of language as a symbolic representation of the model (Cowart, n.d.: Section 2). This is buttressed by the well-established connection between perception-action loops, which enable an agent to iteratively update its behaviour in response to the effect of its actions, and the tracking of causal dependencies, a prerequisite for counterfactual reasoning. Such internal representations allow an agent to move from reactive to deliberative behaviour, by supporting generalisation from past experiences to novel situations. This parametric model enables the anticipation of future events, supporting the pursuit of goals in dynamic, unpredictable environments.
The legitimacy of this approach and the excessively restrictive characterization of language as a prerequisite for general intelligence are supported by ostensibly intelligent behaviour in non-human animals. The prioritisation of anthropocentric metrics (i.e. dialogue) is contentious on normative and empirical grounds. On the former, Mollo (2021: 707) astutely raises that the privileging of typical expressions of the human intellect as a universal desideratum constrains the domain of intelligence without robust justificatory grounds. The risk here is obscuring alternative intelligence signatures which could otherwise support candidacy for non-linguistic systems/animals, disqualifying them on epistemically arbitrary grounds. On the latter, I raise the body of evidence concerning the detection of domain-general flexible problem-solving in non-human animals, which is linked to the presence of structural world models predicting hypothetical future states of the world (Diester et al., 2024: 2265). As demonstration, I will focus on the octopus, a protostome cephalopod mollusc frequently cited as a candidate for nonhuman - and thus non-linguistic - intelligence (Mather, 2019: 1-2, 19). There is compelling evidence of flexibility across separate contexts; met with identical cases of predation threat, octopuses respond in novel and non-uniform ways (i.e. threat exploration, fleeing), indicative of non-facultative behaviour. While the verbal criterion necessarily regards all animals as purely “instinct[ual]” (Landgrebe & Smith, 2019: 9), bound to a narrow, mechanical stimulus-response repertoire, this is contravened by emergent problem-solving capabilities and adaptive learning. When forced bivalve separation failed to open a clam, different octopuses employed non-stereotypical tactics to acquire the food, ranging from drilling holes with their papillae to chipping at the valve margins with their beaks (Mather, 2019: 21). This implies an internal representation of cause-effect relations, evidenced in the sequential learning underpinning the retirement of unsuccessful strategies for effective alternatives. Such mapping is unmediated by language, but relies on the octopuses’ ability to ground their representations in embodied experiences, allowing them to infer structural properties through direct manipulation and feedback.
Ergo, it not self-evident that non-linguistic AI systems should automatically be denied AGI status, but rather that the understanding of the world underpinning general intelligence rests instead on extralinguistic grounding-via-embodiment. Assuming embodied feedback as a necessary desideratum, it remains entirely possible for AI systems to satisfy this criterion through virtual embodiment, an agent’s instantiation in an interactive, feedback-rich simulated environment. Superceding the classical paradigm of static learning, Jin & Jia (2025: 4-9) demonstrate how closed-loop dynamical interactions and perception-action loops in a simulated maze environment can produce internalized spatial knowledge in neural networks through meta-reinforcement learning. One of several successful attempts to deploy an embodied agent in a virtual setting to facilitate goal-fulfillment and object tracking (Xiang et al. 2023: 1-3), such an example clarifies how grounding can arise from computational analogues of physical embodiment.
Although I have taken embodiment to be a prerequisite for world-modelling and therefore AGI, it cannot constitute the sole criterion. I anticipate criticism that this would produce a slippery slope wherein an unjustifiably wide range of embodied animals or agents, however rudimentary, constitute intelligence candidates. Without language, it may seem unclear how we could track flexibility or sequential learning in AI systems, potentially fuelling a concern that characterisations of general intelligence which depart from human parameters are speculative and ungrounded. However, I maintain that the distinction between human intelligence and general intelligence must be retained to avoid the two pitfalls of narrow task-specific tests discussed in this essay; language as sufficient criterion is a highly gameable ability which AI systems can meet without semantic grounding by virtue of its training data and priors, and language as a necessary criterion constitutes an excessively restrictive anthropocentric desideratum that neglects alternative manifestations of intelligent behavior. As a solution, I suggest building upon the problem-solving behaviours that have become the basis for intelligence in the octopus, by measuring the capacity for adaptive learning and broad generalisation when met with novel situations. Chollet’s (2019: 8-15) universal psychometrics framework provides a robust approach to operationalising this capacity, by assessing efficient skill acquisition across unfamiliar tasks. Such acquisition, I argue, presupposes causal understanding, which in turn depends on internal representations of environmental structure and the application of experiential priors, acquired through learning in an interactive environment. The focus on developer-aware generalisation, wherein the relevant problems are known neither to the AI system nor its developer, eliminates the possibility that hard-coded heuristic rules are serving as a substitute for genuine understanding. This metric, passable independently of language and other acquired human abilities, is arguably the most effective manner of ascertaining adaptable goal-oriented planning and novelty robustness. If embodied, there is no reason in principle why an AI system should not successfully meet this criterion and be regarded as intelligent.
To conclude, I have argued that language ability is an unsatisfactory criterion in investigating the possibility of AGI, on the grounds that it is neither adequate evidence of nor a necessary precondition for world-modelling and subsequently understanding. Whether an AI system considered intelligent should not depend on its capacity for dialogue. I have argued that such internal representations can emerge from grounding in a virtual environment, thereby identifying embodiment as a necessary criterion for intelligence. Finally, I suggested that the requisite test of general intelligence should focus on the identification of its associated abilities - flexible, adaptive approaches to diverse sets of novel problems - and that it is reasonable to assume that AI systems will fulfil this desideratum.
References
Bender, E.M., Gebru, T., McMillan-Major, A. and Shmitchell, S., 2021. On the dangers of stochastic parrots. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp.610–623. https://doi.org/10.1145/3442188.3445922
Birch, J., 2024. Large language models and the gaming problem. In: The Edge of Sentience, pp.313–322. https://doi.org/10.1093/9780191966729.003.0017
Chollet, F., 2019. On the measure of intelligence. arXiv preprint arXiv:1911.01547. Available at: https://arxiv.org/abs/1911.01547 [Accessed 24 Mar. 2025].
Coelho Mollo, D., 2022. Intelligent behaviour. Erkenntnis, 89(2), pp.705–721. https://doi.org/10.1007/s10670-022-00552-8
Cowart, M., 2005. Embodied cognition. [online] Internet Encyclopedia of Philosophy. Available at: https://iep.utm.edu/embodied-cognition/ [Accessed 15 Apr. 2025].
Diester, I., Bartos, M., Bödecker, J., Kortylewski, A., Leibold, C., Letzkus, J., Nour, M.M., Schönauer, M., Straw, A., Valada, A., Vlachos, A. and Brox, T., 2024. Internal world models in humans, animals, and AI. Neuron, 112(16), pp.2783–2801. https://doi.org/10.1016/j.neuron.2024.06.019
Heyes, C., 2012. Grist and mills: On the cultural origins of cultural learning. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1599), pp.2181–2191. https://doi.org/10.1098/rstb.2012.0120
Jin, L. and Jia, L., 2025. Embodied world models emerge from navigational task in open-ended environments. arXiv preprint arXiv:2504.11419. https://doi.org/10.48550/arXiv.2504.11419
Jones, C.R. and Bergen, B.K., 2025. Large language models pass the Turing test. arXiv preprint arXiv:2503.23674. Available at: https://arxiv.org/abs/2503.23674 [Accessed 22 Apr. 2025].
Landgrebe, J. and Smith, B., 2019. There is no artificial general intelligence. arXiv preprint arXiv:1906.05833. Available at: https://arxiv.org/abs/1906.05833 [Accessed 24 Mar. 2025].
LeCun, Y., 2022. A path towards autonomous machine intelligence (Version 0.9.2). OpenReview. Available at: https://openreview.net/pdf?id=BZ5a1r-kVsf [Accessed 23 Mar. 2025].
Lyre, H., 2024. "Understanding AI": Semantic grounding in large language models. arXiv preprint arXiv:2402.10992. https://doi.org/10.48550/arXiv.2402.10992
Mather, J.A., 2019. What is in an octopus’s mind? Animal Sentience, 4(26). https://doi.org/10.51291/2377-7478.1370
Mitchell, M., 2024a. Debates on the nature of artificial general intelligence. Science, 383(6689). https://doi.org/10.1126/science.ado7069
Mitchell, M., 2024b. The Turing test and our shifting conceptions of intelligence. Science, 385(6710). https://doi.org/10.1126/science.adq9356
Mitchell, M. and Krakauer, D.C., 2023. The debate over understanding in AI’s large language models. Proceedings of the National Academy of Sciences, 120(13). https://doi.org/10.1073/pnas.2215907120
Oppy, G. and Dowe, D., 2021. The Turing test. Stanford Encyclopedia of Philosophy. Available at: https://plato.stanford.edu/archives/win2021/entries/turing-test/ [Accessed 15 Apr. 2025].
Pennachin, C. and Goertzel, B., no date. Contemporary approaches to artificial general intelligence. In: Cognitive Technologies, pp.1–30. https://doi.org/10.1007/978-3-540-68677-4_1
Sorin, V., Brin, D., Barash, Y., Konen, E., Charney, A., Nadkarni, G. and Klang, E., 2024. Large language models and empathy: Systematic review. Journal of Medical Internet Research, 26(1), p.e52597. https://doi.org/10.2196/52597
Xiang, J., Tao, T., Gu, Y., Shu, T., Wang, Z., Yang, Z. and Hu, Z., 2023. Language models meet world models: Embodied experiences enhance language models. arXiv preprint arXiv:2305.10626. https://doi.org/10.48550/arXiv.2305.10626



Comments