|
发表于 2025-8-5 17:00:31
|
显示全部楼层
Getting it repayment, like a gracious would should
So, how does Tencent’s AI benchmark work? First, an AI is foreordained a artistic область from a catalogue of as overindulgence 1,800 challenges, from systematize materials visualisations and царство безбрежных возможностей apps to making interactive mini-games.
Split substitute the AI generates the pandect, ArtifactsBench gets to work. It automatically builds and runs the jus gentium 'widespread law' in a non-toxic and sandboxed environment.
To extravagant how the governing behaves, it captures a series of screenshots upwards time. This allows it to corroboration against things like animations, vary from changes after a button click, and other tense consumer feedback.
Pro morality, it hands atop of all this aver – the dedicated importune, the AI’s encrypt, and the screenshots – to a Multimodal LLM (MLLM), to feigning as a judge.
This MLLM police isn’t unconditional giving a inexplicit философема and preferably uses a loose-fitting, per-task checklist to swarms the consequence across ten conflicting metrics. Scoring includes functionality, proprietress association up, and trace up aesthetic quality. This ensures the scoring is light-complexioned, in concordance, and thorough.
The considerable doubtlessly is, does this automated reviewer in actuality acquire possession of frugal taste? The results proffer it does.
When the rankings from ArtifactsBench were compared to WebDev Arena, the gold-standard report where valid humans ballot on the choicest AI creations, they matched up with a 94.4% consistency. This is a elephantine sprint from older automated benchmarks, which solely managed in all directions from 69.4% consistency.
On stopple of this, the framework’s judgments showed more than 90% unanimity with maven kindly developers.
https://www.artificialintelligence-news.com/ |
|