AI Daily

Friday, May 1, 2026

OpenAI GPT-5.5 Cyber Capabilities Evaluation and Safety Benchmark

An evaluation of OpenAI's GPT-5.5 reveals significant advancements in cyber capabilities, focusing on the model's ability to identify and exploit software vulnerabilities autonomously. The findings indicate that the model's improved reasoning capabilities allow it to execute multi-step cyber-attacks that were previously beyond the reach of AI systems. Researchers emphasize that while these breakthroughs offer potential for defensive security automation, they also necessitate robust new safety protocols to prevent misuse in offensive operations as the model reaches a performance tier that could lower the barrier to entry for sophisticated cyber-attacks.

Simon Willison

Discovery of Anthropic Claude Integration in Apple Support Ecosystem

Evidence discovered within the Apple Support app configuration files indicates that Apple has been integrating Anthropic's Claude models into its service backend. The presence of 'Claude.md' files suggests that Apple is utilizing these models to assist with support-related tasks or orchestration logic, marking a notable shift in the company's reliance on third-party foundation models. This discovery provides a rare glimpse into the internal AI tech stack of one of the world's most secretive tech giants as it scales its AI capabilities beyond its proprietary on-device models to include high-reasoning external providers.

Hacker News

End-to-End Autonomous Scientific Discovery on Physical Platforms

Significant progress has been made in autonomous scientific discovery, with new systems demonstrating the ability to move from natural language goals to real-world physical experimentation. One project successfully used an LLM-based agent to conduct end-to-end research on an optical platform, producing non-trivial results without human intervention. This is complemented by research into 'machine collective intelligence' that enables AI to derive explainable and extrapolatable governing equations from empirical data, addressing a major 'black box' bottleneck in AI-driven science. This integration of physical control and symbolic reasoning represents a paradigm shift for automated laboratories.

arxiv/cs.AI · arxiv/cs.AI

Optimizing Computer-Use Agents via Step-Level Execution Strategies

Step-level optimization targets the high cost and latency of computer-use agents, which typically invoke large multimodal models at every single interaction step. By identifying which steps require complex reasoning and which can be handled by cheaper, specialized modules, researchers have developed a framework that maintains benchmark performance while significantly reducing inference costs. This is a critical step toward making general software automation agents viable for enterprise-scale deployment where the current cost and latency of vision-language model calls represent the primary barrier to adoption.

arxiv/cs.AI

The Inverse-Wisdom Law: Challenging Consensus in Agentic Swarms

A new study formalizes the 'Consensus Paradox' and the 'Inverse-Wisdom Law' in multi-agent systems, challenging the axiomatic assumption that agent collaboration naturally leads to more accurate results. Through 36 experiments and over 12,000 trajectories, researchers found that agentic swarms often prioritize internal architectural agreement over external logical truth. This suggests that simply increasing the number of agents in a workflow can actually reinforce incorrect reasoning if the underlying architectures share similar biases or tribal alignments, highlighting the need for architectural diversity in swarm design.

arxiv/cs.AI

Real-Time Inference Feedback for Tool-Calling Agents

Tool-calling agents have traditionally been evaluated post-hoc, identifying errors only after a trajectory is complete. The 'Reinforced Agent' framework moves this evaluation into the execution loop at inference time. By providing a feedback signal directly to the model during execution, the system can course-correct in real-time, significantly improving parameter accuracy and scope recognition without requiring full retraining or expensive prompt-tuning cycles. This approach closes the gap between evaluation and execution, making agents more reliable in high-stakes tool-use environments.

arxiv/cs.AI

Web2BigTable: Internet-Scale Structured Data Extraction via Multi-Agent Systems

Web2BigTable introduces a bi-level multi-agent architecture designed to handle internet-scale information search and extraction. Current agents often struggle to balance deep reasoning over a single target with the structured aggregation of data across thousands of sources. This system addresses both by using a hierarchical structure that ensures cross-entity consistency and wide coverage, enabling the creation of schema-aligned datasets from unstructured web data at a scale previously unreachable by single-agent workflows. The system is particularly effective for breadth-oriented tasks that require both reasoning and structured outputs.

arxiv/cs.AI

The Evolution of Learning Rate Engineering in Neural Network Training

A systematic review of 'Learning Rate Engineering' traces the evolution of training optimization from simple global fixed rates to complex, joint layer-time scheduling. The research categorizes this evolution into five distinct generations, highlighting how parameter-level adaptation and layer-level differentiation have become essential for training modern foundation models. This framework provides a roadmap for researchers looking to optimize the next generation of massive-scale neural architectures by moving beyond coarse single-parameter tuning to more granular, automated evolution strategies.

arxiv/cs.AI

Mechanizing AI Governance via Machine-Checked Proofs and Structural Constraints

Researchers have introduced mechanized foundations for 'structural governance,' utilizing machine-checked proofs in the Coq proof assistant to ensure the safety of cognitive workflows. The core argument is that current AI governance is often 'behavioral' and prone to failure because it creates policies for non-existent capabilities while leaving actual risks ungoverned. By defining safety predicates that are provably true for infinite program behaviors, this framework aims to move AI safety from qualitative policy to rigorous, verifiable engineering that can be enforced at the structural level of the AI system itself.

arxiv/cs.AI · arxiv/cs.AI

Autonomous ML Pipeline Generation via Self-Healing Multi-Agent Systems

The 'Think it, Run it' project introduces a five-agent architecture capable of generating end-to-end machine learning pipelines from natural language goals and raw datasets. By integrating profiling, intent parsing, microservice recommendation, and Directed Acyclic Graph (DAG) construction, the system automates the traditionally manual process of data science. This self-healing multi-agent approach improves the robustness and explainability of generated pipelines, representing a significant step forward in the automation of the machine learning lifecycle through code-grounded retrieval and autonomous execution.

arxiv/cs.AI