≡ Menu

MIT Researchers Identified AI Challenges in Healthcare

MIT researchers identified AI challenges healthcare systems must overcome in a paper presented at the Neural Information Processing Systems (NeurIPS 2025) conference in December.

Best-performing AI models on chest X-rays and cancer histopathology images at the first hospital were the worst-performing on up to 75 percent of patients at the second hospital. The AI model failures occurred when models are applied to data other than what they were trained on.

“We demonstrate that even when you train models on large amounts of data, and choose the best average model, in a new setting this ‘best model’ could be the worst model for 6-75 percent of the new data,” says Associate Professor Marzyeh Ghassemi.

 

The challenge will be how do we leverage AI models working at the Mayo Clinic or MD Anderson at the over thousand healthcare systems including safety net hospitals.

{ 0 comments }

The AI Struggle With “Doing”

It’s easy to infer AI models will replace your job soon when you see these stories:

As we have learned, with what Gen Z calls “irl,” these proxies for human intelligence don’t translate to AI doing things in-real-life.

  • Roughly 95% of businesses that invested a combined $40 billion in AI failed to make money, according to an MIT study
  • A randomized controlled trial (RCT) found that when developers use AI tools, they take 19% longer than without it
  • Carnegie Mellon researchers found the best AI agents fail about 70% of the time on real-world corporate tasks.
  • A McKinsey survey found that only about ten percent of respondents report scaling AI agents beyond pilots.
  • Gartner predicts over 40% of Agentic AI projects will be canceled by end of 2027

Steven Pinker defines intelligence as “the pursuit of goals in the face of obstacles[1] which requires doing. Psychologists define intelligence as learning from experience, adapting to new situations, handling abstract concepts, and manipulating the environment[2]. AI struggles with these intelligent behaviors of “doing” in the real world.

While AI struggles with “doing,” it has had great success with “advising” and “assisting” with a human-in-the-loop (including explictly-defined-next-action scripts). OpenAI reports approximately 30% of people use AI Chatbots for advising and assisting at work and 70% for non-work. Physicians find ambient AI assists them with drafting medical notes, thus saving them thirty minutes per day.

The cognitive dissonance of AI is the advising and assisting performance while struggling with hallucinations and doing. The inferred leap that AI’s success will translate into AI doing (with massive job layoffs) may be clouded by these human intelligence proxies. Human proxies assume you can achieve goals in the face of obstacles (Pinker), learn from experience, handle abstraction, and manipulate the environment (psychologists). AI researchers have recognized the need for new proxies which includes OpenAI releasing GPTval, that evaluate “doing” within 44 occupations and 1,320 specialized tasks.

A recent AI paper from Stanford and Harvard explains why most ‘Agentic AI’ systems are impressive in demos and then completely fall apart in real use. Here are some of the “doing” areas that researchers are addressing:

On-the-job training – the ability to learn a unique environment, workflow, people, tools, goals, and improve over time.  The industry calls this recursive self-improvement. Yann LeCun cites a teenager learning to drive in 14 hours and AI-powered autonomous vehicles still struggling. Waymo provided 14 million rides without a driver in 2025, though it lost $1.23 billion on $450 million of revenue. Waymo still requires fleet response agents that view real-time feeds from the vehicle’s exterior cameras. Tesla’s robotaxi has been perpetually one year away since Elon Musk’s 2019 announcement.

Generalizations – AI agents are very good at recognizing and reproducing patterns they’ve seen before, though often fail when a situation looks new though conceptually similar. This makes it difficult for AI agents to make predictions in novel situations or when significant variations exist. Geoffrey Hinton has described the human brain as an analogy machine that help us decide what to do based on analogies of the past. A toddler needs one taste of a disgusting food to generalize it to new situations. AI’s lack of generalization makes it difficult to interpret causal relationships unless someone stated it on Reddit. AI’s understanding is on the surface-level through text or pixels tokens, not the conceptual-level like humans.

Tool Use – Agents can call tools (APIs, databases) and use browsers, but they struggle to decide when, why, and how to use tools reliably. AI models are trained via supervised examples rather than experiential trial-and-error like humans. Small errors on early steps using tools can compound and confuse downstream AI reasoning. AI agents call the same failing tool instead of diagnosing the issue, misinterpret outputs, or assuming the tool is always correct. When AI agents use tools, they are susceptible to adversarial attacks just like humans to social engineering and phishing.

Memory – AI lacks durable, reliable memory across interactions, sessions, and episodes. The memory embedded in pretraining and fine tuning is expensive to update. Large Language Models (LLMs) supplement this with user prompts, Retrieval-Augmented Generation (RAG) techniques, and context windows that can process one million tokens (10 to 15 books). The AI agent doesn’t know what it should remember and how to prioritize it without explicit instructions. This leads to catastrophic forgetting of user preferences, trained knowledge, or past decisions, relearning the same facts repeatedly and storing information but failing to retrieve it when relevant.

Context – AI agents have limited ability to track, prioritize, and reinterpret context over time. Context windows are finite, older information gets compressed and models struggle to distinguish between “important” and “incidental” details. Humans imagine mental models of the world specific to the goal and environment that includes the objects, people, places, abstractions, and analogies.  This enables human to mentally test strategies, beliefs, causal effects, and potential futures as well as update them as they learn. Many researchers[3] are focused on developing world models to address these LLM struggles, which are essentially next token (words, pixels) predictors.

Long horizon planning – AI agents have difficulty planning and executing goals that require many steps over time. AI training optimizes for next-token prediction, not multi-step success. AI agents lack a persistent notion of goals, subgoals, and progress. Errors early in a plan propagate, cascade, and derail later steps. AI developers address this with reinforcement learning during fine-tuning, however, unless objectives can be clearly measured (does the software code work, did you win the game, did you pass the test) and rewarded, it doesn’t learn. AI Agents are strong at planning on paper, weak at planning in action.

Ben Dickson provides an informative look how AI researchers plan to address these issues in 2026. AI “doing” progress is something we all need to watch to get advance notice when AI will replace our jobs.

[1] Steven Pinker, How the Mind Works (W. W. Norton, 1997), 372.

[2] Jahangir Moini, Anthony LoGolbo, and Raheleh Ahangari, “Understanding Physiological Psychology,” in Foundations of the Mind, Brain, and Behavioral Relationships (Elsevier, 2024), 211–28, https://doi.org/10.1016/B978-0-323-95975-9.00002-0.

[3] Yann LeCunFei-Fei Li, Mira Murati,

{ 0 comments }

Can You Live Without AI?

A dad, husband, author, and journalist living in New York decided to find out. For 48 hours, A.J. Jacobs would avoid all interactions with A.I. and machine learning.

He woke, picked up his phone, and entered his iPhone passcode like it was 2017. He quickly learned his iPhone would be useless. No AI curated news, social feeds, or attention maximization targeted ads. No email passed through the spam filter or podcasts cleaned up with AI. He put his iPhone in the drawer.

His wife Julie turned on the lights and A.J. quickly flicked them off. “Are you kidding me?” she asked.

Con Edison uses AI to monitor four million meters to manage the grid. He thought about using rainwater to brush his teeth after realizing New York water uses machine learning to monitor 1,600 sensors. He had to walk or bike to avoid traffic flow monitoring, Ubers, and subway. He couldn’t use weather apps, Zoom that leverages AI for noise suppression, credit card transactions, food services, retail, or television streaming. He ended up watching Brewster McCloud on a twenty-year-old DVD player.

AI and machine learning (ML) models are used to predict, detect, and recognize patterns. The outputs of AI and ML models feed explicitly programmed software scripts that essentially run our world. While people are most familiar with AI through Chatbots, like ChatGPT introduced by OpenAI in November 2022, the hidden AI in the background runs our world. It has been since the 1990s when it began detecting credit card fraud, managing retail inventory, and sorting zip codes for the U.S. Mail.

The fraud detection ML model has milliseconds after a credit card is swiped to either flag for review (and anger the customer) or clear (and risk bank losses). These ML models consistently improved over the decades to find the perfect balance of customer friction and costly losses. Kroger would make five million ML model sales predictions per day, one for each item in each store, to power their supply chain. There’s not enough room in the back for extra Cheerios boxes or tolerance for unhappy Cheerios lovers.

Jacobs wrote in his New York Times story, “What I didn’t expect was that my attempt to avoid all interactions with A.I. and machine learning would affect nearly every part of my life — what I ate, what I wore, how I got around.

Most can go forty-eight hours without chatbots, few can go without hidden AI and ML models that power human society. Jacobs mentioned that even a goat herder in the mountains check weather apps.

Photo Source: The New York Times

{ 0 comments }

What’s Up With AI?

As if life wasn’t complex enough, we now must make sense of AI. It hasn’t been easy, through three words may help.

A MIT study found 95% of investments in Gen AI have produced zero returns, while 90% of workers surveyed reported daily use of personal Gen AI tools like ChatGPT or Claude for job tasks.

A study from software company Atlassian found daily usage of AI among individual workers has doubled over the past year, while 96% of businesses “have not seen dramatic improvements in organizational efficiency, innovation, or work quality.”

A survey of 3,700 business executives found 87% said AI will “completely transform roles and responsibilities” within their organizations over the next twelve months, while 29% said their workforces are equipped with the skills and training necessary to leverage the technology.

Harvard economist found 92% of the U.S. GDP growth in first half of 2025 was from AI investments, yet a Center for Economic Studies (CES) paper found a 1.3% drop in productivity after implementing AI though they expect productivity gains later.

It seems clear AI “is” and “will be” transformational, though it is hard to distinguish what “is” versus “will be” or whether we are in an AI bubble. OpenAI CEO Sam Altman, Amazon founder Jeff Bezos, and 54% of fund managers recently indicated that AI stocks were in bubble territory.

Railroads, electricity, and the internet were transformational innovations that created bubbles, went bust, and then faded to normal in the background. When the internet moved past boom and bust, we faded into new business moats such as Google (Search, Android, Chrome, Cloud), Meta (social media), Tik Tok, Amazon (eCommerce, Cloud), Microsoft (Windows, Office, Cloud), and Apple (MacOS and iOS). The announced massive AI investments, with over one trillion by OpenAI, indicates investors expect AI to be more than companions, coders, and search tools, rather new moats when AI fades to the background.

To help make sense of AI, we may think in terms of advising, assisting and doing. We must also be clear what AI “is” versus “will be.” Today, AI “is” mostly “advising” with some exciting new “assisting.”  The hype is mostly about what AI “will be,” which is “doing.”

1. Advising – most AI use cases are advising. It takes inputs and creates inferences such as predicting email spam, loan worthiness, what to wear to a party, and content (Tik Tok video) that will maximize your engagement.  The AI inferences feed deterministically programmed actions, “if this, then do that.” Advising helps us figure out how to do things and answers our questions. The human-in-the-loop decides what to do next or the inference result powers explicit programming such as maximizing user engagement. AI is not replacing humans, though it should help us become more efficient and effective. It is hard to measure the productivity of advising, though if the strategies result in less actions (efficiency) and better outcomes (effectiveness), it must be more productive.

2. Assisting – this is essentially a tool that does stuff in the digital world like tools (i.e., shovel, washing machine) in the physical world. It is often called Gen AI. It creates videos, software code, summarizes content, drafts letters or does homework assignments based on user prompts. It makes us more efficient and effective creating digital content within individual tasks. It requires the human-in-the-loop to judge the content created to avoid adverse outcomes, like a chatbot that accepted a $1 for a new Chevy Tahoe with a MSRP of $58,195. While the terms “AI agents” or “Agentic AI” are used to describe AI that extracts data from documents, engages customers, curates and summarize content, the next actions are determined by the human-in-the-loops or predetermined and executed with explicit software logic (like Siri or Alexa). It’s logical to assume that creating digital content, like tools in the physical world, will help us become more efficient and effective.

3. Doing – this is goal achievement without a human-in-the-loop. “Doing” is a typically a highly efficient, tightly synchronized flywheel of few to millions of “inferences” and “actions” where the actions are not predetermined. Humans have a tight integration of the neurons, synapses, and well-tuned perceptual, motor, learning, memory, and executive neurocognitive functions. This enables flexibility in novel environments based on mentally imagined models of the world. “Doing” is an autonomous vehicle without a human driver. As we have learned, addressing the last 1% to 5% of autonomous driving edge case may take a decade or longer. “Doing” is the human immune system that makes inferences based on inputs and uses its agency to makes decisions and take actions to destroy pathogens. A thermostat that automatically turns on the heat is not “doing”, rather it is “advising” because a human explicitly programmed the next actions based on inferences.  Doing is difficult for AI as it lacks the capacity to understand the world, understand the physical world, the ability to remember and retrieve things, persistent memory, the ability to reason, and the ability to plan. This is according to Turing Award winner Yann LeCun.  While there is no doubt these AI challenges will be addressed, AI is not doing much today.

“Advising” and “assisting” is today’s AI reality. “Doing” gets the AI hype with arousal headlines of how it will replace our jobs and superintelligence will rapidly, irreversibly, and uncontrollable take over the world to render humans subservient.

When trying to make sense of AI, begin with who decides and does the next best actions. Is it the human-in-the-loop, predetermined by humans, or AI with agency.

{ 0 comments }

Understanding AI by understanding humans

Are humans underrated? Anthropic (maker of Claude) CEO Dario Amodei predicted in May:

AI could wipe out half of all entry-level white-collar jobs — and spike unemployment to 10-20% in the next one to five years.

Last year, AI startup investor and author of AI Superpowers Kai-Fu Lee predicted AI will displace 50% of jobs by 2027.

Research on humans has begun to put sand in the gears of these bold predictions. Researchers are following entrepreneurs, marketing departments, and shameless blog writers by evoking AI to get attention. Yet, improving our understanding of humans may be essential for our lifelong journeys living with AI.

Last week, I saw three studies that illustate this shift.

AttentionHow the Brain Filters Distractions to Stay Focused on a Goal

The Yale University study demonstrated how the human brain allocates limited perceptual resources to focus on goal-relevant information in dynamic environments. The study finds that the brain prioritizes perceptual effort based on goals, filtering out distractions. Attention shifts rapidly and flexibly in response to changing visual demands.  AI struggles with non-relevant information and requires precise language to be effective, as demonstrated in a clinical diagnosis study using chatbots.  The study found that if you remove physicians from filtering relevant information and precisely describing them (using long Latin derived terms) the effectiveness of chatbots drops from (94 percent accuracy to 34 percent).

Attention is essentially processing of bi-directional electrical pulses in neurons between perception and mental models relevant to the goal-directed strategy. Agentic AI will need to learn attention to focus on relevant inputs, shift attention rapidly, change based on perceptual inputs (learning) and infer futures without requiring precise prompts to engage LLM token prediction machines.

LearningWhy Children Learn Language Faster Than AI

Learning (a.k.a. self-correction) may be the most important type of inference for survival of any form of life.  The Max Planck Institute study found that even the smartest machines can’t match young minds at language learning. They estimated if a human learned language at the same rate as ChatGPT, it would take them 92,000 years. They introduced a new framework and cited three key areas:

  • Embodied Learning: Children use sight, sound, movement, and touch to build language in a rich, interactive world.
  • Active Exploration: Kids create learning moments by pointing, crawling, and engaging with their surroundings.
  • AI vs. Human Learning: Machines process static data; children dynamically adapt in real-time social and sensory contexts.

Next ActionAffordances in the brain: The human superpower AI hasn’t mastered

To achieve a goal, strategy inferences such as perception, imagining, deciding, and predicting must conclude with the next best action(s). The  study by University of Amsterdam scientists discovered:

Our brains automatically understand how we can move through different environments—whether it’s swimming in a lake or walking a path—without conscious thought. These “action possibilities,” or affordances, light up specific brain regions independently of what’s visually present. In contrast, AI models like ChatGPT struggle with these intuitive judgments, missing the physical context that humans naturally grasp.

There is no doubt that AI and robots will improve next best action inferences when they get widely deployed. For now, they must rely on token prediction machines based on statistical representations of words or groups of pixels (a.k.a., ChatGPT, Claude, or Gemini).

Photo Credit: Neuroscience News

{ 0 comments }

Are we ready for Doctor AI?

ChatGPT, Gemini, Claude and Large Language Models (LLMs) are impressive with medical diagnoses, with ChatGPT-4 performing better than physicians at diagnosing illness in a small study. A closer look finds AI in medical diagnosis is another example of the cognitive dissonance of AI.

  • Thought – A paper by researchers at the University of Oxford found LLMs could correctly identify relevant conditions 94.9% of the time when directly presented with test scenarios.
  • Thought – Human participants using LLMs to diagnose the same scenarios identified the correct conditions less than 34.5% of the time.

What went wrong?

Looking back at transcripts, researchers found that participants both provided incomplete information to the LLMs and the LLMs misinterpreted their prompts. For instance, one user who was supposed to exhibit symptoms of gallstones merely told the LLM: “I get severe stomach pains lasting up to an hour. It can make me vomit and seems to coincide with a takeaway,” omitting the location of the pain, the severity, and the frequency.

It appears physicians know how to identify the relevant conditions and how to clearly state them to the ChatBot. The Oxford study highlights one problem, not with humans or even LLMs, but with the way we sometimes measure LLM performance.

  • Thought – LLMs can pass medical licensing tests, real estate licensing exams, or state bar exams.
  • Thought – LLMs can often provide poor personal medical, real estate, and legal advice.
{ 0 comments }

The Cognitive Dissonance of AI

In psychology, cognitive dissonance is the discomfort from holding two or more contradictory thoughts.  The term describes AI today. To leverage AI and thrive in our AI journeys, we need to live with the discomfort that comes with understanding of the strengths and weaknesses of AI.

ChatGPT, Gemini, Claude:

Chatbots for advice:

Large Language Models:

AI Agents:

AI Reasoning Models:

Autonomous Vehicles:

Photo Credit: Author generated with ChatGPT. AI image generation is amazing, though it can be a struggle to get precisely what is wanted.

{ 0 comments }

How Do You Hire a Gen AI Model?

Hilke Schellmann describes how we use AI-powered algorithms to screen resumes, process background checks, facilitate candidate online assessments, and conduct one-way interviews in the book  The Algorithm: How AI Decides Who Gets Hired, Monitored, Promoted, and Fired and Why We Need to Fight Back Now. 

While the AI-powered algorithms for hiring humans may not work with Large Language Models (Gen AI), we do have insights from Melanie Mitchell. She is one of the best explainers of AI. Her bestselling book is Artificial Intelligence: A Guide for Thinking Humans. She explains very well what AI can do and what it cannot do.

She recently casted doubt on recent LLM research that stated: “GPT-3 appears to display an emergent ability to reason by analogy, matching or surpassing human performance across a wide range of text-based problem types.”

She replicated the experiments using counterfactual tasks to stress-test claims of reasoning in large language models.  While the advances of LLMs have been amazing, we need people like Melanie Mitchell to help make sense of the hype and sensational claims. Otherwise, how are we going to know to how hire our next assistant?

{ 0 comments }

A Little Earth Day Optimism

The complexity of reducing the CO2 pumped into the atmosphere can feel overwhelming and even hopeless. While we must continue engaging in the many initiatives to make this happen, it is nice to read an optimistic story that could help us improve our future.

That dose of optimism is the Jessica Rawnsley story “The Rise of the Carbon Farmer” in Wired.  She describes the revival of Regenerative Agriculture that keeps carbon in the soil rather than the atmosphere. It even improves soil health and improves yields.

By some counts, a third of the excess CO2 in the atmosphere started life in the soil, having been released not by burning fossil fuels but by changing how the planet’s land is used.

He (Patrick Holden) is one of a growing number of farmers shaking off conventional methods and harnessing practices to rebuild soil health and fertility—cover crops, minimal tilling, managed grazing, diverse crop rotations. It is a reverse revolution in some ways, taking farming back to what it once was.

 

 

{ 0 comments }

There are few things are more complex than managing health conditions. The healthcare system is very good at tracking the prescribing of medicines. It doesn’t track deprescribing of medicines.

Seasons change, fashions change, US presidents change, but for many patients, prescriptions never do—except to become more numerous.

Among US adults aged 40 to 79 years, about 22% reported using 5 or more prescription drugs in the previous 30 days. Within that group, people aged 60 to 79 years were more than twice as likely to have used at least 5 prescription drugs in the previous month as those aged 40 to 59 years.

Over time, a drug’s benefit may decline while its harms increase, Johns Hopkins geriatrician Cynthia Boyd, MD, MPH, told JAMA. “There are a pretty limited number of drugs for which the benefit-harm balance never changes.”

Deprescribing requires shared decision-making that considers “what patients value and what patients prioritize.”

Deprescribing lacks proven clinical guidelines and time for a patient and physician discussion. The average patient visits are twelve minutes for new patients and seven for return patients*.

* Topol, Eric, Deep Medicine, Basic Books, New York, 2019, p17

{ 0 comments }