ARTICLEsimonwillison.net2 min read

Qwen3.6 Triumphs Over Opus 4.7 in Humorous Benchmark Test

By Simon Willison

Qwen3.6 Triumphs Over Opus 4.7 in Humorous Benchmark Test

AI Summary

In a playful yet insightful comparison, I tested two newly released AI models—Qwen3.6-35B-A3B from Alibaba and Claude Opus 4.7 from Anthropic—using my unconventional 'pelican riding a bicycle' benchmark. The results were amusing: Qwen3.6 outperformed Opus, which struggled with the bicycle frame. Even when I adjusted Opus to its maximum thinking level, it didn't improve significantly. To further test these models, I asked them to generate an SVG of a flamingo on a unicycle. Once again, Qwen impressed, even adding a witty touch with a sunglasses comment on the flamingo.

While the pelican benchmark started as a joke, it surprisingly correlated with the models' general utility over time. Early attempts were poor, but recent models like Gemini 3.1 Pro have produced usable illustrations. However, today's results suggest that this correlation may no longer hold. Despite my respect for Qwen, it's hard to believe a 21GB quantized model surpasses Anthropic's latest release in overall utility.

Yet, if you specifically need an SVG of a pelican on a bicycle, Qwen3.6-35B-A3B is currently your best choice, even running on a laptop. This quirky benchmark highlights the challenges and humor in evaluating AI models, reminding us of the absurdity in comparing their capabilities.

Key Concepts

AI Model Benchmarking

AI model benchmarking involves evaluating and comparing the performance of different AI models using standardized tests or tasks. These benchmarks help determine the models' capabilities and effectiveness in specific applications.

Model Performance Evaluation

Model performance evaluation is the process of assessing how well an AI model performs a given task. This involves measuring accuracy, efficiency, and other relevant metrics to determine the model's effectiveness.

Category

AI
M

Summarized by Mente

Save any article, video, or tweet. AI summarizes it, finds connections, and creates your to-do list.

Start free, no credit card