The Measurement Problem
Artificial General Intelligence has been discussed for decades without a precise definition of what it would mean to have achieved it. This ambiguity is not merely semantic — it actively impedes governance, research direction, and public understanding. Claims that current systems are near AGI and claims that AGI is impossibly far away are both unfalsifiable without measurement tools.
Google DeepMind's 2026 cognitive taxonomy is a serious attempt to address this. It identifies ten cognitive faculties — perception, generation, attention, learning, memory, reasoning, metacognition, executive functions, problem solving, and social cognition — and proposes benchmarking AI systems against human performance across all ten. The framework is a genuine contribution. But its measurement strategy has a critical underspecified assumption: what human performance distribution should AI be measured against?
DeepMind proposes demographically representative adults with at least a high school education. This document argues that this is the wrong baseline for most of the ten faculties, and that a developmental baseline — measuring AI against children at specific cognitive developmental stages — would be both more scientifically informative and more practically achievable for the hardest-to-measure capabilities.
The Case for Developmental Baselines
Adults Are the Wrong Starting Point
Measuring AI against adult human performance assumes that adult cognition is the target. But adult cognition is the end state of a developmental process, not a natural category. An adult's metacognitive ability, social cognition, and learning flexibility are the product of approximately 25 years of continuous experience, biological maturation, and cultural embedding. Asking whether an AI matches this endpoint obscures the more interesting question: which developmental stage does the AI currently approximate?
Piaget's developmental model provides the framework. Cognition develops through qualitatively distinct stages:
Sensorimotor (0–2 years): Object permanence, basic prediction, sensorimotor loops. No symbolic representation yet.
Preoperational (2–7 years): Symbolic representation, language acquisition, animistic thinking. Limited logical operations. The child can use symbols but cannot yet perform logical transformations on them.
Concrete operational (7–11 years): Conservation, logical operations on concrete objects, beginning causal reasoning. The child can reason about physical reality but not yet about hypothetical situations.
Formal operational (12+ years): Abstract reasoning, hypothetical-deductive thought, metacognition. The ability to reason about reasoning itself.
Current large language models approximate formal operational thought on linguistic tasks while failing preoperational tasks that require physical grounding. This jagged profile — advanced symbolic manipulation, poor object permanence and causal grounding — is precisely what developmental staging would reveal. An adult baseline obscures the profile. A developmental baseline maps it.
Why Children as Baselines
The practical argument first: children's cognitive tasks are better specified, more extensively validated, and more discriminative at lower capability levels than adult benchmarks. The Piagetian task battery — conservation tasks, object permanence tests, false belief tasks, analogical reasoning probes — has been refined over decades and provides clean pass/fail criteria at each developmental stage. Adult cognitive benchmarks tend to be ceiling-heavy for already-capable AI systems.
The scientific argument: developmental staging reveals the architecture of cognition, not just its endpoint. An AI that passes conservation tasks but fails false belief tasks has a specific cognitive profile that tells you something about its internal structure. An AI benchmarked only against adults gets a scalar score that tells you nothing about the structure underneath.
The ethical argument: children are not the experimental subjects here. Children serve as the normative reference — the documented performance distributions from existing developmental psychology research. New data collection from children for this purpose would raise exactly the legal and ethical concerns noted. But the existing developmental psychology literature provides extensive normative data without requiring new child subjects.
The Ethical and Legal Caveat
Direct new data collection from children as experimental subjects for AGI measurement purposes would require IRB approval, parental consent, and careful benefit/risk analysis. These barriers are real and appropriate. The proposal here is to use the existing developmental literature as the normative reference, not to run new experiments on children. The normative distributions for Piagetian tasks, theory of mind tasks, and analogical reasoning at each developmental stage are extensively documented and publicly available.
Mapping the Ten Faculties to Developmental Stages
Perception
Adult baseline is appropriate here — current AI systems have well-characterized perceptual capabilities, and the developmental trajectory for basic perception is less informative. The interesting discrimination is between low-level feature detection and high-level scene comprehension. Existing benchmarks are adequate.
Generation
Adult baseline is reasonable for linguistic generation. The more interesting developmental question is whether generation reflects genuine internal representation or surface pattern completion — the stochastic parrot problem. Developmental measures of generative language — whether production reflects underlying semantic structure versus surface mimicry — would be more discriminative than adult fluency measures.
Attention
Adult attention benchmarks measure sustained attention, selective attention, and attentional flexibility. But the developmental trajectory here is informative — young children have different attentional profiles than adults, with stronger capture by novelty and weaker inhibitory control. An AI's attentional profile likely maps better onto a specific developmental stage than onto the adult distribution. Developmental baselines would reveal this. Current AI systems show characteristics consistent with high novelty sensitivity and limited sustained attention — a profile more consistent with preoperational children than adults.
Learning
This is the faculty where adult baselines are most clearly wrong. Adult learning benchmarks measure learning from instruction, from analogy, and from error — all of which assume a rich prior knowledge base. The more fundamental question for AI is learning rate and transfer: how quickly can the system acquire a new concept from minimal examples, and how well does it transfer? These are precisely what Piagetian learning tasks measure. The concept acquisition paradigm — present novel instances and non-instances, measure when the child correctly generalizes — is directly applicable to AI systems and has extensive developmental normative data. Current AI systems cannot learn at all at inference time without fine-tuning. This maps to a specific failure: they lack the accommodation mechanism that Piaget identified as the engine of cognitive development. Developmental staging makes this failure precise.
Memory
Adult memory benchmarks conflate episodic and semantic memory in ways that obscure AI's specific profile. AI systems have excellent semantic memory (knowledge from training) and no episodic memory (memory of specific events with temporal and contextual tags). This maps onto a specific developmental profile: children under approximately 3 years have limited episodic memory despite functional semantic learning. The developmental dissociation between semantic and episodic memory is well-characterized and would provide a more informative benchmark than adult memory tasks that assume both.
Reasoning
The Piagetian task battery is almost exactly the right benchmark for reasoning. Conservation tasks measure logical invariance. Transitive inference tasks measure reasoning chains. False belief tasks (theory of mind) measure the ability to hold a model of another's mental state that differs from reality. The developmental progression from intuitive to concrete to formal reasoning is precisely the dimension along which current AI systems show jagged profiles. Adult reasoning benchmarks miss this discrimination.
Metacognition
Adult metacognition measures monitoring accuracy — does the subject know what they know and don't know? Children develop metacognitive capacity progressively, with reliable metacognitive monitoring emerging around age 8-10. Current AI systems have poor calibration — they confabulate without awareness. This maps onto a preoperational developmental profile. Developmental measures of metacognitive monitoring are well-validated and more discriminative than adult measures for systems with AI's current capability profile.
Executive Functions
Executive function develops substantially across childhood — inhibitory control, working memory, and cognitive flexibility all reach adult levels in late adolescence. Developmental measures such as the dimensional change card sort (DCCS) and the day/night task are discriminative at the lower end of the capability range where AI currently sits. Adult executive function benchmarks are likely ceiling-heavy for most AI systems on some sub-functions and floor-heavy on others.
Problem Solving and Social Cognition
These are the two composite faculties — they require integrating multiple foundational capabilities simultaneously. Adult baselines are appropriate as an eventual target. But the developmental question is which foundational deficits prevent composite performance. A system that fails social cognition tasks may be failing theory of mind (developmental stage ~4 years), or failing integration of multiple social cues (developmental stage ~8 years), or failing abstract social inference (adolescent). Developmental staging identifies which foundational deficit is limiting composite performance.
A Proposed Measurement Framework
Stage 1: Developmental Profiling
For each of the ten cognitive faculties, identify the developmental task battery most appropriate to the faculty and administer it in order of increasing developmental stage — from sensorimotor through formal operational. Record the stage at which performance drops below the normative distribution. This produces a developmental profile across all ten faculties.
The developmental profile exposes the jagged capability landscape that scalar adult benchmarks conceal. A system that passes formal operational reasoning tasks but fails concrete operational conservation tasks has a specific architectural signature that tells you something about how reasoning is implemented.
Stage 2: Adult Comparison for Advanced Faculties
For faculties where the system's developmental profile approaches adult levels, apply adult benchmarks using DeepMind's proposed protocol — demographically representative human baseline, cognitive profile radar chart. This identifies which faculties have approached adult performance and how far remaining gaps are.
Stage 3: Longitudinal Tracking
The most valuable measurement is change over time. A system that is currently at the concrete operational stage on metacognition but advancing is more informative than a static cross-sectional profile. Building longitudinal measurement into the framework allows tracking of developmental trajectory, not just current position.
The Jagged Profile Problem
DeepMind identifies the jagged capability problem — a system may outperform 99% of humans in logical reasoning while failing below the median on social cognition. A radar chart across ten dimensions exposes this imbalance.
The developmental framework sharpens this considerably. The interesting jaggedness is not just high-vs-low across faculties. It is developmentally inconsistent profiles within faculties — formal operational performance on abstract linguistic tasks combined with sensorimotor failures on physical grounding tasks within the same faculty.
This jaggedness is architecturally informative. A biological system that showed this profile would indicate specific developmental disruption — it would tell you something was wrong with the acquisition pathway, not just the endpoint. For AI systems, the developmental profile tells you something about the training pathway and what kinds of experience are missing.
Current large language models trained exclusively on text show exactly the profile you would predict for a system that has the codec output of cognition without the grounded developmental experience that produced it. Excellent formal operational performance on linguistic tasks. Sensorimotor failures on physical prediction tasks. Preoperational failures on object permanence under novel conditions. The developmental framework makes this prediction precise and testable.
Open Questions
Normative data adequacy. Existing developmental psychology normative data was collected for clinical and research purposes, not for AI benchmarking. Sample sizes, task variants, and cultural representativeness vary. A systematic review of available normative data against benchmarking requirements is needed before this framework can be fully implemented.
Stage boundary validity. Piagetian stage boundaries are population-level statistical regularities, not sharp cognitive thresholds. Individual variation is substantial. Using stage boundaries as hard thresholds for AI measurement requires careful operationalization.
Novel tasks. AI systems may have been exposed to developmental task variants in training data, introducing the data contamination problem DeepMind identifies for adult benchmarks. Novel developmental task variants — same cognitive requirements, novel surface form — are needed. This is a solvable problem but requires task development work.
The developmental analogy limits. Children develop through biological maturation, embodied experience, and social interaction. AI systems acquire capabilities through training on static datasets. The developmental staging framework is a measurement tool, not a claim that AI development follows the same pathway as biological development. Stage labels are benchmarks, not mechanistic claims.
What counts as passing.
No comments:
Post a Comment