It’s easy to infer AI models will replace your job soon when you see these stories:
- AI achieved gold medal level at the International Mathematic Olympiad and the International Collegiate Programming Contest World Finals
- AI scored in top 90% on bar exam and passed the U.S. Medical Licensing Exam.
- AI beat the world chess champ in 1997 and the world’s Go champion in 2017
- AI predicts form and function of proteins based on molecules’ amino acids which resulted in its AI researchers being awarded the Nobel Prize
- AI (ChatGPT) amazes people with “advising” and “assisting” to land 100 million users in first two months.
As we have learned, with what Gen Z calls “irl,” these proxies for human intelligence don’t translate to AI doing things in-real-life.
- Roughly 95% of businesses that invested a combined $40 billion in AI failed to make money, according to an MIT study
- A randomized controlled trial (RCT) found that when developers use AI tools, they take 19% longer than without it
- Carnegie Mellon researchers found the best AI agents fail about 70% of the time on real-world corporate tasks.
- A McKinsey survey found that only about ten percent of respondents report scaling AI agents beyond pilots.
- Gartner predicts over 40% of Agentic AI projects will be canceled by end of 2027
Steven Pinker defines intelligence as “the pursuit of goals in the face of obstacles[1] which requires doing. Psychologists define intelligence as learning from experience, adapting to new situations, handling abstract concepts, and manipulating the environment[2]. AI struggles with these intelligent behaviors of “doing” in the real world.
While AI struggles with “doing,” it has had great success with “advising” and “assisting” with a human-in-the-loop (including explictly-defined-next-action scripts). OpenAI reports approximately 30% of people use AI Chatbots for advising and assisting at work and 70% for non-work. Physicians find ambient AI assists them with drafting medical notes, thus saving them thirty minutes per day.
The cognitive dissonance of AI is the advising and assisting performance while struggling with hallucinations and doing. The inferred leap that AI’s success will translate into AI doing (with massive job layoffs) may be clouded by these human intelligence proxies. Human proxies assume you can achieve goals in the face of obstacles (Pinker), learn from experience, handle abstraction, and manipulate the environment (psychologists). AI researchers have recognized the need for new proxies which includes OpenAI releasing GPTval, that evaluate “doing” within 44 occupations and 1,320 specialized tasks.
A recent AI paper from Stanford and Harvard explains why most ‘Agentic AI’ systems are impressive in demos and then completely fall apart in real use. Here are some of the “doing” areas that researchers are addressing:
On-the-job training – the ability to learn a unique environment, workflow, people, tools, goals, and improve over time. The industry calls this recursive self-improvement. Yann LeCun cites a teenager learning to drive in 14 hours and AI-powered autonomous vehicles still struggling. Waymo provided 14 million rides without a driver in 2025, though it lost $1.23 billion on $450 million of revenue. Waymo still requires fleet response agents that view real-time feeds from the vehicle’s exterior cameras. Tesla’s robotaxi has been perpetually one year away since Elon Musk’s 2019 announcement.
Generalizations – AI agents are very good at recognizing and reproducing patterns they’ve seen before, though often fail when a situation looks new though conceptually similar. This makes it difficult for AI agents to make predictions in novel situations or when significant variations exist. Geoffrey Hinton has described the human brain as an analogy machine that help us decide what to do based on analogies of the past. A toddler needs one taste of a disgusting food to generalize it to new situations. AI’s lack of generalization makes it difficult to interpret causal relationships unless someone stated it on Reddit. AI’s understanding is on the surface-level through text or pixels tokens, not the conceptual-level like humans.
Tool Use – Agents can call tools (APIs, databases) and use browsers, but they struggle to decide when, why, and how to use tools reliably. AI models are trained via supervised examples rather than experiential trial-and-error like humans. Small errors on early steps using tools can compound and confuse downstream AI reasoning. AI agents call the same failing tool instead of diagnosing the issue, misinterpret outputs, or assuming the tool is always correct. When AI agents use tools, they are susceptible to adversarial attacks just like humans to social engineering and phishing.
Memory – AI lacks durable, reliable memory across interactions, sessions, and episodes. The memory embedded in pretraining and fine tuning is expensive to update. Large Language Models (LLMs) supplement this with user prompts, Retrieval-Augmented Generation (RAG) techniques, and context windows that can process one million tokens (10 to 15 books). The AI agent doesn’t know what it should remember and how to prioritize it without explicit instructions. This leads to catastrophic forgetting of user preferences, trained knowledge, or past decisions, relearning the same facts repeatedly and storing information but failing to retrieve it when relevant.
Context – AI agents have limited ability to track, prioritize, and reinterpret context over time. Context windows are finite, older information gets compressed and models struggle to distinguish between “important” and “incidental” details. Humans imagine mental models of the world specific to the goal and environment that includes the objects, people, places, abstractions, and analogies. This enables human to mentally test strategies, beliefs, causal effects, and potential futures as well as update them as they learn. Many researchers[3] are focused on developing world models to address these LLM struggles, which are essentially next token (words, pixels) predictors.
Long horizon planning – AI agents have difficulty planning and executing goals that require many steps over time. AI training optimizes for next-token prediction, not multi-step success. AI agents lack a persistent notion of goals, subgoals, and progress. Errors early in a plan propagate, cascade, and derail later steps. AI developers address this with reinforcement learning during fine-tuning, however, unless objectives can be clearly measured (does the software code work, did you win the game, did you pass the test) and rewarded, it doesn’t learn. AI Agents are strong at planning on paper, weak at planning in action.
Ben Dickson provides an informative look how AI researchers plan to address these issues in 2026. AI “doing” progress is something we all need to watch to get advance notice when AI will replace our jobs.
[1] Steven Pinker, How the Mind Works (W. W. Norton, 1997), 372.
[2] Jahangir Moini, Anthony LoGolbo, and Raheleh Ahangari, “Understanding Physiological Psychology,” in Foundations of the Mind, Brain, and Behavioral Relationships (Elsevier, 2024), 211–28, https://doi.org/10.1016/B978-0-323-95975-9.00002-0.
[3] Yann LeCun, Fei-Fei Li, Mira Murati,