Can LLMs Generate Code with Controlled Time and Space Complexity?

1FAIR at Meta 2Inria 3Work done at Meta, now working at Mistral AI

Figure 1 BigO(Bench) framework overview: Given a coding problem and human solutions, the framework evaluates language models on three key tasks: (1) predicting time-space complexities of existing solutions, (2) generating new code that meets specified complexity requirements, and (3) ranking solutions against human-written code with similar complexity profiles. The complexity framework automatically validates model outputs by computing runtime distributions and curve coefficients.

Motivation

We introduce BigO(Bench), a novel coding benchmark designed to evaluate the capabilities of generative language models in understanding and generating code with specified time and space complexities. This benchmark addresses the gap in current evaluations that often overlook the ability of models to comprehend and produce code constrained by computational complexity. BigO(Bench) includes tooling to infer the algorithmic complexity of any Python function from profiling measurements, including human- or LLM-generated solutions. BigO(Bench) also includes of set of 3,105 coding problems and 1,190,250 solutions from Code Contests annotated with inferred (synthetic) time and space complexity labels from the complexity framework, as well as corresponding runtime and memory footprint values for a large set of input sizes. We present results from evaluating multiple state-of-the-art language models on this benchmark, highlighting their strengths and weaknesses in handling complexity requirements. In particular, token-space reasoning models are unrivaled in code generation but not in complexity understanding, hinting that they may not generalize well to tasks for which no reward was given at training time.

Contributions

  1. Task Firstly, we introduce three tasks that evaluate a model's understanding of time and space complexities. For a given coding challenge and human solution, the model can be queried to a. predict time-space complexities, b. generate code that solves the challenge while adhering to a specified (known to be feasible) complexity, and c. on top of it ranks better than human solutions of the same challenge and complexity.
  2. Dataset Secondly, we support training and evaluation on these tasks by the release of a dataset of 3,105 coding problems and 1,190,250 solutions from Code Contests, that includes time-space complexity labels, curve coefficients as well as runtime and memory profiling measurements.
  3. Framework Thirdly, we release the code for our complexity inference framework, that takes a Python function and returns time and space complexities. It's a rule-based algorithm based on fuzzing, profiling, and regressing of major complexity classes (including multi-dimensional). This is what we used to produce ground truth labels for BigO(Bench), which are statistically significant ground truth performance profiles and not theoretical complexities. This complexity evaluation framework achieves 92\% and 84\% match (with human annotated theoretical complexity) respectively on the time and space complexity test sets.
  4. Benchmark Fourthly, we evaluate 12 popular models on our benchmark along fined-tuned ones and compare in details their performance: using our All@1 metric, DeepSeek-R1 Llama 70B achieves top scores 41.4% and 4.8% on time complexity prediction and generation, 3.4% on space complexity generation and is outperformed on space prediction by Llama 3.1 405B with 10.3%.

Dynamic Complexity Inference Framework

Figure 2 Outline of the dynamic complexity inference framework. The framework takes a code snippet and a single example of inputs to this code snippet. Then, it processes the code snippet and proceeds with extensive inputs generation, based on the provided example of inputs: inputs are independently or interdependently increased in size, using several expansion methods that can be the identity or random, among else. This forms a queue of synthetic inputs on which to execute the provided code snippet. These executions happen independently in sandboxes, where runtime and memory footprint measures are taken. Once all the measures are collected, the framework can model the code snippet time and space dependencies to the different inputs. Using curve fitting, the time and space complexity of the code is computed on each input separately and then altogether. The global time and space complexity over all inputs is what is being returned.



The time-space complexity framework is a rule-based algorithm that can process any Python function in order to infer its time and space complexities dynamically. As inputs, it takes a Python function along its function inputs and their corresponding dataclass, which are then processed and modified before being run while runtime and memory footprints are measured. From a high-level perspective, the framework increases the size of inputs following various strategies, in order to assess the impact of their size on execution metrics (e.g. execution time, memory used). When the function has several arguments, they can be expanded independently or together to determine the overall complexity of the function, taking into account potential interdependencies. The prepared code, along with the various sets of expanded inputs are queued up and run in independent sandboxes, using the Bubblewrap library, to avoid any harmful side effects of the code being run. While running, Cprofiler is used for time execution measures and tracemalloc for memory footprint. Using non-negative least squares curve fitting on each set of measures, the coefficients and residuals of each complexity class are computed. The gold complexity class output for a given set of measures is chosen as the minimizer of the residuals, taking into account a simplicity bias (the more simple the complexity class is, the smaller the simplicity bias). This curve fitting is applied on each set of measures, each corresponding to a different subset of arguments being expanded with a different expansion method. Using ensemble methods, the global complexity of the Python function is computed by aggregating the individual complexity outputs along the different set of measures. Finally, the complexity framework also returns the coefficients of the curve of each elected complexity. These coefficients can be leveraged to rank and classify the optimisations of different Python solutions within the same complexity class. More details and set up instructions are shared on Github.

Complexity Tasks

Complexity Prediction

The first evaluation task of the benchmark, "Complexity Prediction", consists in predicting the time and space complexity given a problem description and a human solution. Our baseline for this task is the naive model that always returns O(n), the most frequent class. Pass@k measures the accuracy of finding the correct complexity; Best@k measures accuracy only across the most optimized complexity class of each problem; All@k requires correct complexity output across all complexity classes at once per problem.

Complexity Generation

The second task "Complexity Generation" requires the LLM to generate a correct solution to a given problem description that has to respect a feasible time or space complexity requirement. Our baseline for this task is a Llama 3.1 70B model that is queried for the same prompts without the complexity requirement. Pass@k measures the accuracy of finding a correct solution, according to public, private and generated tests, that has the correct complexity, as measured by the complexity framework; Best@k and All@k are similarly defined as their counterparts in the results of the first task.

Complexity Coefficient Percentile Ranking

The third task, "Complexity Coefficient Percentile Ranking", measures how a generated solution to a given problem, respecting a complexity requirement, ranks among human solutions of the same complexity class and problem. The ranking is performed based on the coefficient of the complexity curve, as measured by the framework: the lower the coefficient, the more flat the complexity curve and the more optimized the solution. Ranking results are given in percentile of the distribution, where a solution of the nth percentile is more optimized than n% of human solutions. The querying is similar to the second task with the addition of the requirement "Try to optimize the runtime of your code as much as you can, while respecting the time complexity requirement".

Results

Table 1 BigO(Bench) benchmark results for popular LLMs. Program Synthesis checks correctness of model-generated solutions to given programming problems.Complexity Prediction measures whether a model can find the time-space complexity of a code snippet. Complexity Generation evaluates whether a model outputs a working code snippet to a given problem, that meets a time-space complexity requirement. Pass@k treats all complexity classes of all problems independently, Best@k only evaluates the most optimized complexity class of each problem, All@k measures whether all complexity classes per problem are correct at once.



Looking at the above table, all LLMs undergo a noticeable drop of performance on the combined task "Complexity Generation" compared to the individual tasks "Program Synthesis" and "Complexity Prediction". Across all tasks, the top performing model remains DeepSeek-R1 Llama 70B with 64.2 and 29.2 Pass@1 on respectively time prediction and generation, except on space prediction where models tend to overthink and misunderstand the notion of extra space complexity, though explicitly described in the test prompts. Models tend to be even more misled when asked to "Optimize the solution while respecting the complexity requirement", which leads to average 12% loss of performance for time generation All@1 in Table: Coefficient Ranking, up to ~30% for GPT-4o and o1-mini.



Table 2 Using the complexity framework, the best measured coefficient of the complexity curve, out of 20 attempts, is used to rank LLM-generated code among human solutions from the same problem and time-space complexity class. Ranking is percentile based, n% ranking score amounts for n% human solutions having worse complexity coefficient. If no LLM solution passes correctness tests, ranking score is set to 0. INTERSEC is the subset where all starred models have at least one successful solution.

BibTeX


@misc{chambon2025bigobenchllmsgenerate,
  title={BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?}, 
  author={Pierre Chambon and Baptiste Roziere and Benoit Sagot and Gabriel Synnaeve},
  year={2025},
  eprint={2503.15242},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2503.15242}, 
}