AI Daily

Thursday, May 21, 2026

OpenAI GPT-next Disproves 80-Year-Old Erdős Mathematical Problem

In a significant milestone for AI in the formal sciences, a model identified as 'GPT-next' has reportedly disproved the Erdős planar unit distance problem. This achievement was accomplished with a total compute spend of under $1,000, signaling a dramatic shift in the cost-efficiency of using large language models for high-level mathematical discovery and conjecture resolution. This breakthrough suggests that frontier models are evolving beyond assistive roles in coding and writing into autonomous reasoning agents capable of tackling long-standing open problems in mathematics. The community reaction highlights both the immense potential for AI-driven scientific discovery and the rapidly closing gap between human expertise and automated reasoning in specialized domains.

Latent Space

ResearchArena: Benchmarking the Feasibility of Fully Autonomous AI Researchers

The introduction of ResearchArena marks a systematic attempt to evaluate whether AI agents can handle the entire scientific research loop, from ideation and experimentation to paper writing. Using off-the-shelf agents like Claude Code (Opus 4.6) and Codex (GPT-5.4), the benchmark moves beyond simple coding tasks to assess the quality and validity of agent-generated scientific papers. While current systems can technically produce complete manuscripts, ResearchArena identifies critical gaps between functional feasibility and genuine research quality. This work provides a necessary framework for tracking the progress of 'auto-research' systems as they transition from generating hallucinations of science to producing peer-competitive discoveries.

arxiv/cs.AI

Railway Transitions to 'Agent-Native Cloud' Amid Surge in Autonomous Coding

Railway has reported a significant shift in cloud infrastructure usage, with coding agents now accounting for over $200,000 in monthly spend on their platform. The company is pivoting toward an 'Agent-Native' architecture to support 3 million users and 100,000 weekly signups, driven largely by the rise of autonomous development tools that operate directly on cloud metal. This trend suggests a potential 'death of the PR' (Pull Request) as autonomous agents begin to manage continuous deployment and infrastructure updates without traditional human mediation. The move by Railway reflects a broader industry transition where cloud providers must optimize for the high-concurrency, high-compute demands of agentic workflows rather than human developers.

Latent Space

Hallucination as Security Exploit: New Risks in Multimodal Agent Actions

New research formalizes a dangerous failure mode for multimodal agents where visual hallucinations are treated as authorization for privileged actions. When an agent misinterprets a false visual claim—such as a screenshot or a webpage element—as a valid precondition for a tool call, it can result in unauthorized data transfers or system clicks. This 'hallucination-to-action' conversion effectively turns a model's perceptual errors into security vulnerabilities. As enterprise agents gain more agency over financial and personal data, the research argues that evidence-carrying multimodal architectures are required to ensure that actions are backed by verified visual data rather than speculative generations.

arxiv/cs.AI

Formalizing Trust Calibration for Agentic Tool Use

A new framework for 'Progressive Autonomy' addresses the critical challenge of human-in-the-loop governance for AI agents. By modeling trust as a preference-learning problem, the system uses a Gaussian-process posterior to determine when an agent's proposed action requires explicit human approval versus when it can execute autonomously. This approach allows for a dynamic 'policy gateway' that escalates decisions to humans only in high-uncertainty or high-risk scenarios. This formalization is essential for the safe deployment of autonomous agents in real-world environments, ensuring that trust is calibrated based on historical performance and latent human risk tolerance rather than fixed rules.

arxiv/cs.AI

Google Faces Criticism over 'Antigravity' IDE Redesign While Meta Trims Staff

Google's latest AI-integrated IDE, Antigravity 2.0, has received significant negative feedback from the developer community for its intrusive design changes. Critics point to a chaotic product ecosystem at Google that often prioritizes rapid AI integration over established developer workflows and user experience standards. Simultaneously, Meta has announced a 10% reduction in staff despite reaching record revenue and profits. This move highlights a continuing trend in the tech industry where even highly profitable firms are restructuring to redirect resources toward intensive AI development and infrastructure, often at the cost of traditional engineering roles.

Pragmatic Engineer

LBW-Guard: Improving LLM Training Stability with Autonomous Control Layers

Training large language models is often plagued by instability and wasted compute during aggressive learning-rate scaling. Researchers have introduced Learn-by-Wire Guard (LBW-Guard), a governance layer that operates above standard optimizers like AdamW to observe training telemetry and interpret instability-sensitive regimes. By providing bounded autonomous control over the training process, LBW-Guard helps prevent degraded runs and optimizes compute efficiency. This development is particularly relevant for organizations training frontier models at scale, where a single unstable run can result in millions of dollars in lost compute time.

arxiv/cs.AI

AdventHealth and OpenAI Partner to Deploy Healthcare-Specific ChatGPT

OpenAI and AdventHealth have announced a major partnership to integrate AI into clinical workflows across the healthcare provider's network. The initiative focuses on using ChatGPT for Healthcare to reduce administrative burdens on clinicians and streamline document-heavy processes, with the ultimate goal of increasing the time providers can spend on direct patient care. This collaboration underscores the aggressive push by OpenAI to move into the enterprise healthcare sector, a field requiring rigorous data privacy and reliability. The partnership will serve as a high-profile test case for whether LLMs can effectively handle the complexities of patient-managed records and clinical data at scale.

OpenAI · arxiv/cs.AI

Mechanistic Analysis Identifies Attention Head Imbalance in Multimodal Hallucinations

A new mechanistic interpretability study has identified specific 'attention head imbalances' that cause multimodal large language models (MLLMs) to hallucinate. The research finds that models often prioritize erroneous textual prompts over contradictory visual evidence because certain internal components are biased toward linguistic consistency over perceptual data. Through path patching and causal analysis of five major open-source MLLMs, the study isolates the specific heads responsible for resisting visual evidence. This research provides a roadmap for developers to 'tune' models for better grounding, potentially reducing the frequency of modality-conflict hallucinations in visual-language tasks.

arxiv/cs.AI

Datasette Agent: Open-Source Tooling for LLM Data Interaction

Simon Willison has released early alpha versions of 'datasette-agent,' an open-source tool designed to let AI agents interact more effectively with structured data. The project emphasizes the use of 'small tools'—specialized functions that an agent can invoke to query, filter, and visualize SQLite databases. This release highlights the growing trend toward building modular, open-source utilities that expand the capabilities of general-purpose LLMs. By providing a formal interface for agents to explore data, Datasette Agent enables more reliable and transparent data analysis workflows that avoid the typical 'black box' issues associated with pure text-to-SQL generation.

Simon Willison · Simon Willison