this post was submitted on 07 Jul 2025
64 points (97.1% liked)

Technology

38829 readers
264 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.


Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.


Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 6 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[โ€“] tfowinder@beehaw.org 1 points 14 hours ago (1 children)

Well the article says that the AI agents were able to complete 30% of the tasks given to it like searching the web, communicating with co workers, etc. I think this is interesting

CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers

"We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks"

Personally i belive this is impressive.

[โ€“] Krauerking@lemy.lol 1 points 12 hours ago

That's really not. A calculator that only gave the right output 30% of the time would be worthless.