The ARC Prize organization designs benchmarks which are specifically crafted to demonstrate tasks that humans complete easily, but are difficult for AIs like LLMs, "Reasoning" models, and Agentic frameworks.
ARC-AGI-3 is the first fully interactive benchmark in the ARC-AGI series. ARC-AGI-3 represents hundreds of original turn-based environments, each handcrafted by a team of human game designers. There are no instructions, no rules, and no stated goals. To succeed, an AI agent must explore each environment on its own, figure out how it works, discover what winning looks like, and carry what it learns forward across increasingly difficult levels.
Previous ARC-AGI benchmarks predicted and tracked major AI breakthroughs, from reasoning models to coding agents. ARC-AGI-3 points to what's next: the gap between AI that can follow instructions and AI that can genuinely explore, learn, and adapt in unfamiliar situations.
You can try the tasks yourself here: https://arcprize.org/arc-agi/3
Here is the current leaderboard for ARC-AGI 3, using state of the art models
- OpenAI GPT-5.4 High - 0.3% success rate at $5.2K
- Google Gemini 3.1 Pro - 0.2% success rate at $2.2K
- Anthropic Opus 4.6 Max - 0.2% success rate at $8.9K
- xAI Grok 4.20 Reasoning - 0.0% success rate $3.8K.

(Logarithmic cost on the horizontal axis. Note that the vertical scale goes from 0% to 3% in this graph. If human scores were included, they would be at 100%, at the cost of approximately $250.)
https://arcprize.org/leaderboard
Technical report: https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf
In order for an environment to be included in ARC-AGI-3, it needs to pass the minimum “easy for humans”
threshold. Each environment was attempted by 10 people. Only environments that could be fully solved by
at least two human participants (independently) were considered for inclusion in the public, semi-private
and fully-private sets. Many environments were solved by six or more people. As a reminder, an environment
is considered solved only if the test taker was able to complete all levels, upon seeing the environment for
the very first time.
As such, all ARC-AGI-3 environments are verified to be 100% solvable by humans with no prior
task-specific training
Consciousness (the fact of experience) doesn't necessarily need to be linked to intelligence. It might be but it doesn't have to. An LLM is almost definitely more intelligent than an insect but it most likely is like nothing to be an LLM but it probably is like something to be an insect.
Isn’t it kind of eery that you can only suppose it must be “like something” to be an insect, from the very precise bias of being human? We’re projecting the idea that “it’s like something to be something [as a human]” only the experience of other things.
How would we describe what it’s like? Would something poetic suffice, such as “it’s like being a leaf in the wind, and with weak preference of where you blow but no memory of where you’ve been.” … but, all of that is human concepts, human experience decomposed into a subset of more human experiences (really weird, the recursive nature of experience and concepts).
I think the idea of “what it’s like…” has some interesting flaws when applied to nonhumans. It kind of presupposes that insects are lesser, in a way. As though we can conceptualize what it’s kind to be them, merely by understanding a stricter subset of what it’s like to be human.
I can only suppose that of other people as well. There's no way to measure consciousness. The only evidence of its existence is the fact that it feels like something to be me from my subjective perspective. Other humans behave the way I do so I assume they're probably having similar experiences but I have no idea what it's like to be a bat for example.
However, answering the question "what it's like to be" is not relevant here. What's relevant is that existence has qualia at all.