23 March 2026

 

Toward a Theory of Measuring AGI

The Measurement Problem

Artificial General Intelligence has been discussed for decades without a precise definition of what it would mean to have achieved it. This ambiguity is not merely semantic — it actively impedes governance, research direction, and public understanding. Claims that current systems are near AGI and claims that AGI is impossibly far away are both unfalsifiable without measurement tools.

Google DeepMind's 2026 cognitive taxonomy is a serious attempt to address this. It identifies ten cognitive faculties — perception, generation, attention, learning, memory, reasoning, metacognition, executive functions, problem solving, and social cognition — and proposes benchmarking AI systems against human performance across all ten. The framework is a genuine contribution. But its measurement strategy has a critical underspecified assumption: what human performance distribution should AI be measured against?

DeepMind proposes demographically representative adults with at least a high school education. This document argues that this is the wrong baseline for most of the ten faculties, and that a developmental baseline — measuring AI against children at specific cognitive developmental stages — would be both more scientifically informative and more practically achievable for the hardest-to-measure capabilities.


The Case for Developmental Baselines

Adults Are the Wrong Starting Point

Measuring AI against adult human performance assumes that adult cognition is the target. But adult cognition is the end state of a developmental process, not a natural category. An adult's metacognitive ability, social cognition, and learning flexibility are the product of approximately 25 years of continuous experience, biological maturation, and cultural embedding. Asking whether an AI matches this endpoint obscures the more interesting question: which developmental stage does the AI currently approximate?

Piaget's developmental model provides the framework. Cognition develops through qualitatively distinct stages:

Sensorimotor (0–2 years): Object permanence, basic prediction, sensorimotor loops. No symbolic representation yet.

Preoperational (2–7 years): Symbolic representation, language acquisition, animistic thinking. Limited logical operations. The child can use symbols but cannot yet perform logical transformations on them.

Concrete operational (7–11 years): Conservation, logical operations on concrete objects, beginning causal reasoning. The child can reason about physical reality but not yet about hypothetical situations.

Formal operational (12+ years): Abstract reasoning, hypothetical-deductive thought, metacognition. The ability to reason about reasoning itself.

Current large language models approximate formal operational thought on linguistic tasks while failing preoperational tasks that require physical grounding. This jagged profile — advanced symbolic manipulation, poor object permanence and causal grounding — is precisely what developmental staging would reveal. An adult baseline obscures the profile. A developmental baseline maps it.

Why Children as Baselines

The practical argument first: children's cognitive tasks are better specified, more extensively validated, and more discriminative at lower capability levels than adult benchmarks. The Piagetian task battery — conservation tasks, object permanence tests, false belief tasks, analogical reasoning probes — has been refined over decades and provides clean pass/fail criteria at each developmental stage. Adult cognitive benchmarks tend to be ceiling-heavy for already-capable AI systems.

The scientific argument: developmental staging reveals the architecture of cognition, not just its endpoint. An AI that passes conservation tasks but fails false belief tasks has a specific cognitive profile that tells you something about its internal structure. An AI benchmarked only against adults gets a scalar score that tells you nothing about the structure underneath.

The ethical argument: children are not the experimental subjects here. Children serve as the normative reference — the documented performance distributions from existing developmental psychology research. New data collection from children for this purpose would raise exactly the legal and ethical concerns noted. But the existing developmental psychology literature provides extensive normative data without requiring new child subjects.

The Ethical and Legal Caveat

Direct new data collection from children as experimental subjects for AGI measurement purposes would require IRB approval, parental consent, and careful benefit/risk analysis. These barriers are real and appropriate. The proposal here is to use the existing developmental literature as the normative reference, not to run new experiments on children. The normative distributions for Piagetian tasks, theory of mind tasks, and analogical reasoning at each developmental stage are extensively documented and publicly available.


Mapping the Ten Faculties to Developmental Stages

Perception

Adult baseline is appropriate here — current AI systems have well-characterized perceptual capabilities, and the developmental trajectory for basic perception is less informative. The interesting discrimination is between low-level feature detection and high-level scene comprehension. Existing benchmarks are adequate.

Generation

Adult baseline is reasonable for linguistic generation. The more interesting developmental question is whether generation reflects genuine internal representation or surface pattern completion — the stochastic parrot problem. Developmental measures of generative language — whether production reflects underlying semantic structure versus surface mimicry — would be more discriminative than adult fluency measures.

Attention

Adult attention benchmarks measure sustained attention, selective attention, and attentional flexibility. But the developmental trajectory here is informative — young children have different attentional profiles than adults, with stronger capture by novelty and weaker inhibitory control. An AI's attentional profile likely maps better onto a specific developmental stage than onto the adult distribution. Developmental baselines would reveal this. Current AI systems show characteristics consistent with high novelty sensitivity and limited sustained attention — a profile more consistent with preoperational children than adults.

Learning

This is the faculty where adult baselines are most clearly wrong. Adult learning benchmarks measure learning from instruction, from analogy, and from error — all of which assume a rich prior knowledge base. The more fundamental question for AI is learning rate and transfer: how quickly can the system acquire a new concept from minimal examples, and how well does it transfer? These are precisely what Piagetian learning tasks measure. The concept acquisition paradigm — present novel instances and non-instances, measure when the child correctly generalizes — is directly applicable to AI systems and has extensive developmental normative data. Current AI systems cannot learn at all at inference time without fine-tuning. This maps to a specific failure: they lack the accommodation mechanism that Piaget identified as the engine of cognitive development. Developmental staging makes this failure precise.

Memory

Adult memory benchmarks conflate episodic and semantic memory in ways that obscure AI's specific profile. AI systems have excellent semantic memory (knowledge from training) and no episodic memory (memory of specific events with temporal and contextual tags). This maps onto a specific developmental profile: children under approximately 3 years have limited episodic memory despite functional semantic learning. The developmental dissociation between semantic and episodic memory is well-characterized and would provide a more informative benchmark than adult memory tasks that assume both.

Reasoning

The Piagetian task battery is almost exactly the right benchmark for reasoning. Conservation tasks measure logical invariance. Transitive inference tasks measure reasoning chains. False belief tasks (theory of mind) measure the ability to hold a model of another's mental state that differs from reality. The developmental progression from intuitive to concrete to formal reasoning is precisely the dimension along which current AI systems show jagged profiles. Adult reasoning benchmarks miss this discrimination.

Metacognition

Adult metacognition measures monitoring accuracy — does the subject know what they know and don't know? Children develop metacognitive capacity progressively, with reliable metacognitive monitoring emerging around age 8-10. Current AI systems have poor calibration — they confabulate without awareness. This maps onto a preoperational developmental profile. Developmental measures of metacognitive monitoring are well-validated and more discriminative than adult measures for systems with AI's current capability profile.

Executive Functions

Executive function develops substantially across childhood — inhibitory control, working memory, and cognitive flexibility all reach adult levels in late adolescence. Developmental measures such as the dimensional change card sort (DCCS) and the day/night task are discriminative at the lower end of the capability range where AI currently sits. Adult executive function benchmarks are likely ceiling-heavy for most AI systems on some sub-functions and floor-heavy on others.

Problem Solving and Social Cognition

These are the two composite faculties — they require integrating multiple foundational capabilities simultaneously. Adult baselines are appropriate as an eventual target. But the developmental question is which foundational deficits prevent composite performance. A system that fails social cognition tasks may be failing theory of mind (developmental stage ~4 years), or failing integration of multiple social cues (developmental stage ~8 years), or failing abstract social inference (adolescent). Developmental staging identifies which foundational deficit is limiting composite performance.


A Proposed Measurement Framework

Stage 1: Developmental Profiling

For each of the ten cognitive faculties, identify the developmental task battery most appropriate to the faculty and administer it in order of increasing developmental stage — from sensorimotor through formal operational. Record the stage at which performance drops below the normative distribution. This produces a developmental profile across all ten faculties.

The developmental profile exposes the jagged capability landscape that scalar adult benchmarks conceal. A system that passes formal operational reasoning tasks but fails concrete operational conservation tasks has a specific architectural signature that tells you something about how reasoning is implemented.

Stage 2: Adult Comparison for Advanced Faculties

For faculties where the system's developmental profile approaches adult levels, apply adult benchmarks using DeepMind's proposed protocol — demographically representative human baseline, cognitive profile radar chart. This identifies which faculties have approached adult performance and how far remaining gaps are.

Stage 3: Longitudinal Tracking

The most valuable measurement is change over time. A system that is currently at the concrete operational stage on metacognition but advancing is more informative than a static cross-sectional profile. Building longitudinal measurement into the framework allows tracking of developmental trajectory, not just current position.


The Jagged Profile Problem

DeepMind identifies the jagged capability problem — a system may outperform 99% of humans in logical reasoning while failing below the median on social cognition. A radar chart across ten dimensions exposes this imbalance.

The developmental framework sharpens this considerably. The interesting jaggedness is not just high-vs-low across faculties. It is developmentally inconsistent profiles within faculties — formal operational performance on abstract linguistic tasks combined with sensorimotor failures on physical grounding tasks within the same faculty.

This jaggedness is architecturally informative. A biological system that showed this profile would indicate specific developmental disruption — it would tell you something was wrong with the acquisition pathway, not just the endpoint. For AI systems, the developmental profile tells you something about the training pathway and what kinds of experience are missing.

Current large language models trained exclusively on text show exactly the profile you would predict for a system that has the codec output of cognition without the grounded developmental experience that produced it. Excellent formal operational performance on linguistic tasks. Sensorimotor failures on physical prediction tasks. Preoperational failures on object permanence under novel conditions. The developmental framework makes this prediction precise and testable.


Open Questions

Normative data adequacy. Existing developmental psychology normative data was collected for clinical and research purposes, not for AI benchmarking. Sample sizes, task variants, and cultural representativeness vary. A systematic review of available normative data against benchmarking requirements is needed before this framework can be fully implemented.

Stage boundary validity. Piagetian stage boundaries are population-level statistical regularities, not sharp cognitive thresholds. Individual variation is substantial. Using stage boundaries as hard thresholds for AI measurement requires careful operationalization.

Novel tasks. AI systems may have been exposed to developmental task variants in training data, introducing the data contamination problem DeepMind identifies for adult benchmarks. Novel developmental task variants — same cognitive requirements, novel surface form — are needed. This is a solvable problem but requires task development work.

The developmental analogy limits. Children develop through biological maturation, embodied experience, and social interaction. AI systems acquire capabilities through training on static datasets. The developmental staging framework is a measurement tool, not a claim that AI development follows the same pathway as biological development. Stage labels are benchmarks, not mechanistic claims.

What counts as passing. DeepMind proposes comparing AI performance to the distribution of human performance — the proportion of humans the system outperforms. The same approach applies to developmental norms: at what percentile of same-stage children does the system perform? This operationalization is straightforward but requires agreement on which normative distributions to use.

22 March 2026

Exploring AGI Using AI


I have long been interested in Artificial Intelligence (AI) and what is now called Artificial General Intelligence (AGI). How long? Check out this post. If I'd pursued my Master's in Computer Science, it would have been in AI. My bachelor's was in cognitive psychology.

First Use of LLM AI

As 2025 waned, our house was sold, leading to a subsequent influx of cash that needed to be invested. A portfolio existed that needed a major revamp. I’d not used the current AIs, specifically the Large Language Models (LLMs), although Grammarly counts as some use. I turned to LLMs to create the portfolio.

I wasn’t naive, letting them do the design and walk away. Instead, it became a collaborative effort among Claude, Gemini, ChatGPT, and me. Sharing the project status among the AIs brought information and insights that a single AI wouldn’t have found.

The project evolved from a simple list of Exchange-Traded Funds (ETFs) into a portfolio with an operating manual. The manual provides clear guidance on when specific funds should be transferred to other funds based on market rates, market volatility, and the spread between private and public funding. Other guidance indicates when to reinvest dividends rather than draw on them for income. The guidelines have delivered good results when backtested against problematic markets since 2000.

This project ran from the end of December 2025 to mid-February 2026. Two week-long vacations did intervene.

Today is 22 March 2026, and the volatility tracking is getting a workout due to the Iranian military activity. It brushed against the flag to swap funds but avoided the need.

Using AI for AGI

The experience was encouraging. I turned to my interest in AGI.

Previous experience showed the advantage of working with multiple AIs for ongoing project reviews. This project brought a new insight: the chats develop personalities. I’d noticed this in general on the last project, so I created a startup prompt to explain to a new chat how it should interact with me and how I work. The prompt is based on the AIs' own observations. One example is that I ‘pivot’ in discussions rather than following the linear, logical path that AIs assume. Especially with the AGI discussions, I recall a previous thought, or a new one is triggered. The first AGI attempt evolved to use three versions of Claude: the theoretician, the engineer, and the coder. These three personalities originated from the initial conversations with them. The coder less so since it is Claude Code, which is specifically meant for C++ development. The engineer chat started with creating an architectural system design based on a document from the theoretician. The theory discussion was the open question: how can we do this? The interesting part about the engineer AI is that it would propose a solution to an issue that would work, but didn’t follow the underlying theory. The AI often completes my general observation with details and implications. One of them generously suggested I would have completed it. In many cases, I didn’t have the deeper knowledge the AI is trained on, so no, I wouldn’t have completed the thought.

This work will continue despite the first experiment leading to an insurmountable challenge. That doesn’t discourage me. I subscribe to Edison’s position: “I have not failed. I've just found 10,000 ways that won't work.” Direct work on AGI may be interrupted by a challenge from DeepMind on how to measure progress toward AGI: "Measuring progress toward AGI: A cognitive framework." My exploration, in collaboration with an AI, provides insight into the problem and identifies areas for improvement in this measurement methodology.

05 July 2020

SRC2 - Explicit Steering - Wheel Speed

SRC2 Rover
This fourth post about the qualifying round of the NASA Space Robotics Challenge - Phase 2 (SRC2) addresses the speed of the wheels of the rover, shown to the right, under the various possible motions. The rover uses Explicit Four Wheel Steering which allows the orientation of the wheels to be independently changed. The second and third posts explored the geometry to determine the position of the wheels for a turn, pivoting in place, and crab / straight movement. See the first post for the basics of the competition. 

Wheel Orientation

The orientation of the wheels on the rover determines the speed for the wheels. In straight or crab movement the speed is the same for all wheels. When turning, shown in the diagram below, the speeds are different for the inner and outer wheels. The requested overall speed of the rover, determined at the center of the rover, is used to calculate the inner and outer speeds. 



 Term        Description
 ICR Instantaneous Center of Rotation
 Rr Radius from ICR to center of rover
 RiRo Radius of  rover's inner (i) and outer (o) sides through ICR
 Wb, Wt Wheel base and wheel track of rover. Lengths are representative of actual size.
 WRi, WRo Radius of inner(i) and outer (o) wheels
  δiδo  Steering angle for inner (i) and outer (o) wheels

Visualize on the diagram three concentric circles drawn from the ICR. One circle passes through the center of the rover while the others pass through the inner and outer corners, or wheels, of the rover. The second post calculated the wheel's turning radius as:


The rover radius is Rr, the distance from the ICR to the center of the rover. 

The speed (Sr and turn radius of the rover determine the time (Tr) to complete a full circle, as shown in the first equation below. The next equation calculates the speed of either set of wheels (WR) using the circumference of the respective circles. Subsequently, that equation can be simplified as shown in the formulations that follow. 



Twist Calculation

The standard ROS movement command is the twist message. It contains two 3 dimensional vectors. One specifies the linear movement for the x, y, and z dimensions. The other specifies the orientation, also as x, y, and z, but meaning roll, pitch, and yaw respectively. 

The calculations for steering orientation and speed are all based on the radius of the turn. That turn radius needs to be calculated using the X velocity and the Yaw from the message. Recall from post three that turning during a crab movement is not under consideration so the linear Y value is ignored. 

Getting to how to do the calculation requires some interesting analysis but the final, actual calculation is extremely simple. The starting point is the Yaw in radians / second. The first equation determines the time it would take to turn a full 2𝜋 radians at the Yaw rate. Or, how long to traverse a full circle. 



Next, that time is used to determine the circumference of the circle using the X speed. Knowing the circumference the radius is determined. The equations show the individual steps but then combine them to reduce them to a simple calculation. Everything should reduce to such simplicity. Note that dimensional units are included to assure the final units are valid.

03 July 2020

SRC2 - Explicit Steering - Crab, Straight, and Pivot Movements


SRC2 Rover
This is the third in a series of post about my involvement with the qualifying round of the NASA Space Robotics Challenge - Phase 2 (SRC2). The first post introduced the basics of the competition. One aspect of the challenge is there is no controller for the rover depicted to the right. It uses Explicit Four Wheel Steering which allows the orientation of the wheels to be independently changed. This provides multiple ways for the rover to move, e.g. straight, crab, turn, pivot.

The second post explored the geometry on positioning the wheels for a turn. This post will address pivoting in place and crab movement, i.e. moving sideways. It also addresses the trivial crab case of moving straight forward or back. 

29 June 2020

SRC2 - Explicit Steering - Wheel Orientation for Turning

SRC2 Rover

The first post in this series explained I'm currently involved with the qualifying round of the NASA Space Robotics Challenge - Phase 2 (SRC2). The competition requires controlling the rover to the right. It uses Explicit Four Wheel Steering which allows the orientation of the wheels to be independently changed, This provides multiple ways for the rover to move, e.g. straight, crab, turn, pivot.

The challenge is there is no controller for the rover in the Robot Operating System (ROS) because the rover wheels are controlled by effort, not the typical speed control.
This article address the geometry of controlling the rover when it is turning. The diagram below illustrates the rover making a counter-clockwise turn around the Instantaneous Center of Rotation (ICR). The arrows represent the wheel orientation. The dotted box is drawn proportional to the wheel base and track of the rover. Note the orientation of the X/Y axis which is ROS standard for robots.
Explicit Steering - Rover Turning


 Term        Description
 ICR Instantaneous Center of Rotation
 Rr Radius from ICR to center of rover
 RiRo Radius of  rover's inner (i) and outer (o) sides through ICR
 Wb, Wt Wheel base and wheel track of rover. Lengths are representative of actual size.
 WRi, WRo Radius of inner(i) and outer (o) wheels
  δiδo  Steering angle for inner (i) and outer (o) wheels


NASA Space Robotics Challenge - Phase 2


Another NASA Centennial Challenge began earlier this year. It will be the 3rd I've entered. I also entered the 2019 ARIAC competition which makes 4 competitions. The current competition is the Space Robotics Challenge - Phase 2 (SRC2). In this competition robotic mining rovers explore the Moon to detect volatiles and collect them, detect a low orbiting cube sat, and position one rover aligned with a fiducial on a processing station.

Links to Follow-on Posts


In the following posts I'll explain what I can about this topic. There are a number of research papers available however my posts will provide a simplified explanation. 

Explicit Steering

The biggest challenge in this competition is controlling the rover. It is the size of a small SUV with Explicit Four Wheel Steering, i.e. each of the four wheels is steered separately. If you've seen the Mars rover it is a similar design. That's the base rover, above.

Explicit steering allows flexibility in movement of the rover. It can turn all four wheels to the same angle to move sideways in a crab movement. This also provides straight forward movement. By orienting the wheels at different angles the rover can turn. An extreme example of this is pivoting in place.

Robot Operating System

The base software for the competition is the Robot Operating System (ROS) which consists of the fundamentals for communicating amongst software nodes and a large number of packages that provide useful capabilities. Unfortunately there isn't one for controlling the SRC2 rover.

There are two ways of controlling the wheels for locomotion. The predominant one is issuing a speed command. A number of ROS controller packages provide this capability. Another specified the effort or torque. There are effort controllers but none directly apply to the SRC2 rover.

The location of the rover on the surface is required for reporting the position of volatiles, moving to them, and generally controlling the rover's movement. This requires deriving odometry from wheel movement, vision processing via stereo cameras and an inertial measurement unit (IMU). Again, this is not provided. It is especially challenging due to the low friction between the wheels and simulated Moon surface which allows slippage of the wheels.

28 January 2010

Thinking Aloud - Long Ago Neural AI

I got thinking about way back when in 1972-74 in undergrad school. I was doing some work in AI, albeit within the psych department. This was before the heyday of neural network although there was some activity in the area. I ran across the book, Intelligence: Its Organization and Development by Michael Cunningham. He proposed a rigorous, testable way in which intelligence organizes in the infant. I guess it didn't work out since it didn't make the front page of the New York Times as a major breakthrough sometime in the intervening decades.

Interestingly, a web search turns up little information beyond citations. None of the titles in the citations indicate a successful implementation or breakthrough based on the work.

I still have a paper I wrote about the book and a description of a FORTRAN implementation that never got finished.

One of the challenges back then, and remains so somewhat today, is that testing ideas like this requires a simulation environment that can be as complex to produce as the actual ideas you want to test. But I realized that today I do have a physical device, my Create robot, that could be used for testing.

I'm not going to layout all the details of Cunningham's proposal since he took a book to develop and describe the idea. I won't even list the roughly 2 dozen specific assumptions in the model. What I am going to do is walk through some thoughts on how a project might proceed to see if it is worth pursuing.

You start with input and output elements - sensors and actuators in today's robot parlance. There are some reflex connections between these elements. For example, a pain reaction reflex so if an infant's hand touches something hot it jerks away. Or if the side of the mouth touches something the head turns in that direction in an attempt to suckle.

Jumping over the start up process (which is always a pain), lets assume the robot is moving forward and hits a wall. The bumper switch closes but there is no reflex to shut down the motors. The motors keeps turning and you get an overload reading. There is a reflex for this and it stops the motors. Now the motors are stopped and the bump switch is still triggered.

There would be a number of elements. Each sensor input on the Create could have an input element. Each actuator would have an output element. As indicated, the over current input element could be connected to an output element that stops the motors. Note - A point to consider that there might be output elements that don't directly connect to actuators but instead inhibit actuators. Continuing the thought, there might need to be backup, stop, and forward elements for the motors. In the situation described, these elements would have high levels of activity. Other elements, like a push button, would have no activity. The Cunningham model proposes that those elements with high activity are connected through a new memory element. The inputs to the input side of the memory and the outputs to the output side. What might happen is a connection is created between the bump switch, the over current and the motor stop elements through the new memory element. In the future, a bump switch closure would stop the motor.

I now recall one result from my work with the FORTRAN implementation. This is the need to have multiple elements to represent the state of input and output elements. My note above reflects this. For example, the bump switch needs two elements - open and closed. The motor needs forward, reverse and stopped. It may need even more indicating speed, although I would first try relating the element activity level with the speed.

The activity level of an element decays if it is not triggered. So the bump switch closing triggers activity that decays over time. The motor activity decreases until the motor stops. An issue would be to keep the bump switch closed activity going long enough for the over current activity to shutdown the motor and get the new memory element built. Note: maybe an input triggers again after a period of time?

How do we get the bump switch open? The only way is by getting the motor to reverse. Infants in a situation like this flail. They randomly move. Sometimes they do this happily while cooing and sometimes angrily while crying. It appears to be a natural reaction to try something, anything to make things different. (A really ugl phenomena in an adult but you still see it. If not physically at least mentally. Ever had a boss whose reaction was, "Don't just stand there! Do SOMETHING.") I don't recall the model addressing this situation. (I did find used copies of the book and have one ordered so I can refresh my thinking.)

Somehow some general level of activity has to increase which can generate activity at outputs. Sometimes this would be through inputs. For an infant this could be sound, pressure on skin, internal senses, and vision. I dislike simply generating random activity levels to cause something to happen. Maybe the general inputs of the Create - power levels, current readings, etc are sufficient to generate activity.

Clearly, a dropping charge level in the battery could be tied to a "hunger" reaction which sends the robot searching for its charger. That brings in using the IR sensor to control the drive for the docking station. That probably requires external guidance to train the IR / motor control coordination to execute the docking maneuver. That opens up an entirely different set of thoughts.

Which is enough for today... No conclusion on trying to implement this. But no conclusion not to do so, either.

  Toward a Theory of Measuring AGI The Measurement Problem Artificial General Intelligence has been discussed for decades without a precise ...