AI labs like OpenAI claim that their so-called “reasoning” AI models, which can “think” through problems step by step, are more capable than their non-reasoning counterparts in specific domains, such ...
MLCommons today released AILuminate, a new benchmark test for evaluating the safety of large language models. Launched in 2020, MLCommons is an industry consortium backed by several dozen tech firms.
If you are interested in learning more about how to benchmark AI large language models or LLMs. a new benchmarking tool, Agent Bench, has emerged as a game-changer. This innovative tool has been ...
On Thursday, Scale AI and the Center for AI Safety (CAIS) released Humanity's Last Exam (HLE), a new academic benchmark aiming to "test the limits of AI knowledge at the frontiers of human expertise," ...
Companies can evaluate AI models before use. Companies can evaluate AI models before use. is a reporter who writes about AI. She also covers the intersection between technology, finance, and the ...
A discrepancy between first- and third-party benchmark results for OpenAI’s o3 AI model is raising questions about the company’s transparency and model testing practices. When OpenAI unveiled o3 in ...
A new community-driven initiative evaluates large language models using Italian-native tasks, with AI translation among the ...
As more AI models show evidence of being able to deceive their creators, researchers from the Center for AI Safety and Scale AI have developed a first-of-its-kind lie detector. On Wednesday, the ...
A UNSW-led global collaborative study has found most shoreline prediction models are effective at forecasting changes to natural, sandy beaches with an accuracy of approximately 10 meters. Subscribe ...