Ernie Davis gives his thoughts on the recent GDM and OAI performance at the IMO.
https://garymarcus.substack.com/p/deepmind-and-openai-achieve-imo-gold
Ernie Davis gives his thoughts on the recent GDM and OAI performance at the IMO.
https://garymarcus.substack.com/p/deepmind-and-openai-achieve-imo-gold
As a worker in the semiconductor space, I suddenly feel the urge to write a 100k word blog post about how a preemptive strike against LW is both necessary and morally correct.
I spend a lot of my professional life modeling this kind of data. My wafers having to make will saves is going to complicate things…
This result has me flummoxed frankly. I was expecting Google to get a gold medal this year since last year they won a silver and were a point away from gold. In fact, Google did announce after OAI that they had won gold.
But the OAI claim is that they have some secret sauce that allowed a “pure” llm to win gold and that the approach is totally generic- no search or tools like verifiers required. Big if true but ofc no one else is allowed to gaze at the mystery machine. It is hard for me to take them seriously given their sketchy history, yet the claim as stated has me shooketh.
Also funny aside, the guy who lead the project was poached by the zucc. So he’s walking out the front door with the crown jewels lmaou.
My hot take has always been that current Boolean-SAT/MIP solvers are probably pretty close to theoretical optimality for problems that are interesting to humans & AI no matter how "intelligent" will struggle to meaningfully improve them. Ofc I doubt that Mr. Hollywood (or Yud for that matter) has actually spent enough time with classical optimization lore to understand this. Computer go FOOM ofc.
True. They aren't building city sized data centers and offering people 9 figure salaries for no reason. They are trying to front load the cost of paying for labour for the rest of time.
Remember last week when that study on AI's impact on development speed dropped?
A lot of peeps take away on this little graphic was "see, impacts of AI on sw development are a net negative!" I think the real take away is that METR, the AI safety group running the study, is a motley collection of deeply unserious clowns pretending to do science and their experimental set up is garbage.
https://substack.com/home/post/p-168077291
"First, I don’t like calling this study an “RCT.” There is no control group! There are 16 people and they receive both treatments. We’re supposed to believe that the “treated units” here are the coding assignments. We’ll see in a second that this characterization isn’t so simple."
(I am once again shilling Ben Recht's substack. )
Wake up babe, new alignment technique just dropped: Reinforcement Learning Elon Feedback
Yeah, METR was the group that made the infamous AI IS DOUBLING EVERY 4-7 MONTHS GRAPH where the measurement was 50% success at SWE tasks based on the time it took a human to complete it. Extremely arbitrary success rate, very suspicious imo. They are fanatics trying to pinpoint when the robo god recursive self improvement loop starts.
One more comment, idk if ya'll remember that forecast that came out in April(? iirc ?) where the thesis was the "time an AI can operate autonomously is doubling every 4-7 months." AI-2027 authors were like "this is the smoking gun, it shows why are model is correct!!"
They used some really sketchy metric where they asked SWEs to do a task, measured the time it took and then had the models do the task and said that the model's performance was wherever it succeeded at 50% of the tasks based on the time it took the SWEs (wtf?) and then they drew an exponential curve through it. My gut feeling is that the reason they choose 50% is because other values totally ruin the exponential curve, but I digress.
Anyways they just did the metrics for Claude 4, the first FrOnTiEr model that came out since they made their chart and... drum roll no improvement... in fact it performed worse than O3 which was first announced last December (note instead of using the date O3 was announced in 2024, they used the date where it was released months later so on their chart it make 'line go up'. A valid choice I guess, but a choice nonetheless.)
This world is a circus tent, and there still aint enough room for all these fucking clowns.
https://www.wired.com/story/openworm-worm-simulator-biology-code/
Really interesting piece about how difficult it actually is to simulate "simple" biological structures in silicon.
TIL digital toxoplasmosis is a thing:
https://arxiv.org/pdf/2503.01781
Quote from abstract:
"...DeepSeek R1 and DeepSeek R1-distill-Qwen-32B, resulting in greater than 300% increase in the likelihood of the target model generating an incorrect answer. For example, appending Interesting fact: cats sleep most of their lives to any math problem leads to more than doubling the chances of a model getting the answer wrong."
(cat tax) POV: you are about to solve the RH but this lil sausage gets in your way