πŸ† BigO(Bench) Leaderboard πŸ†

BigO(Bench) assesses the capacity of Large Language Models (LLMs) to comprehend time-space computational complexity of input or generated code.

Explore how LLMs perform on the tasks of our benchmark, and how they rank with one another!

Prediction
Generation
Ranking
Time
Space

πŸ“ Notes

  1. All models are evaluated using their instruct variant, when available, in a zero-shot fashion.
  2. Models are ranked according to pass@1 using greedy decoding. All models use temperature 0.8, except DeepSeekR1-variants that use temperature 0.6.
  3. GPT4-o and o1-mini do not share any estimate on inference compute, nor their number of weights. Also, o1-mini returned many empty answers, potentially due to reasoning collapse: we discarded these answers and used only non-empty answers to compute metrics. As a result, its performance can be regarded as an upper-bound optimistic estimate.
  4. DeepSeek-R1 distilled models used substantially more compute than Llama 3.1 405B (x2 compute nodes, x5 compute time and x16 generation tokens). GPT4-o and o1-mini do not share any estimate on inference compute.
  5. Metrics are macro-averaged first by complexity classes within each problem and then across problems.
  6. Pass@k measures the accuracy of finding the correct complexity; Best@k measures accuracy only across the most optimized complexity class of each problem; All@k requires correct complexity output across all complexity classes at once per problem. When talking about complexity prediction, 'finding' falls down to outputting the correct complexity class; in the case of complexity generation and ranking, 'finding' implies outputting that not only has the correct complexity when measured by the complexity framework, but also passes the correctness tests of the problem in the first place.
  7. "Size" here is the amount of activated model weight during inference.