2511.04355v1

Model: gemma-3-27b-it-free

# Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks **Authors**: Amir Molzam Sharifloo1, Maedeh Heydari1, Parsa Kazerooni1, Daniel Maninger12, Mira Mezini Abstract Large Language Models (LLMs) have achieved remarkable success in code generation, and the race to improve their performance has become a central focus of AI research. Benchmarks and leaderboards are increasingly popular, offering quantitative rankings of LLMs. However, they provide limited insight into the tasks that LLMs consistently fail to solve—information that is crucial for understanding current limitations and guiding the development of more capable models. To address this gap, we examined code generation tasks across four popular benchmarks, identifying those that major LLMs are most likely to fail. To understand the causes of these failures, we investigated whether the static complexity of solution code contributes to them, followed by a systematic inspection of 114 tasks that LLMs consistently struggled with. Our analysis revealed four recurring patterns of weaknesses in LLMs, as well as common complications within benchmark tasks that most often lead to failure. I Introduction Large Language Models (LLMs) are rapidly transforming software development by automating code generation tasks. Trained on massive corpora of open-source code, these systems are now widely adopted in practice—recent surveys indicate that many developers regularly use AI-assisted coding tools [1]. As the adoption of code-generating LLMs accelerates, rigorous benchmarking becomes increasingly critical. Effective benchmarks provide a standardized framework to evaluate model capabilities, track progress over time, and compare systems on common grounds. A variety of benchmarks have been proposed to evaluate LLM code generation [2, 3, 4, 5, 6]. Early efforts, such as Most Basic Python Problems (MBPP) [3] and HumanEval [2], focus on functional correctness for small Python programs, probing models’ understanding of algorithms, language, and basic libraries. Extensions like MBPP+ and HumanEval+ [7] increase test coverage to reveal edge cases, while HumanEval Pro and MBPP Pro [8] introduce compositional tasks requiring multiple function calls, testing reasoning across generation steps. Other benchmarks emphasize scale and realism: APPS [9] offers thousands of competitive programming problems with multiple test cases, and LiveCodeBench [4] continuously updates problems from platforms like LeetCode and AtCoder. BigCodeBench [5] evaluates use of external libraries across domains, while repository-level benchmarks such as SWE-bench [6] assess LLM performance on multi-file software development tasks. Despite the widespread adoption of these benchmarks, there has been little systematic analysis of their tasks and the cases where LLMs failed to generate correct code. Identifying failure-inducing tasks and understanding the underlying causes of the failures is essential for guiding both benchmark design and model development. The prior work [10] analyzed the errors that LLMs made when generating code for the HumanEval benchmark and reported both syntactic and semantic mistakes. However, its scope was limited to HumanEval and did not consider more diverse or challenging benchmarks. In contrast, this paper addresses this gap by conducting an empirical study of popular function-level code generation benchmarks, aiming to answer the following research questions: - RQ1: Which tasks do LLMs consistently fail to solve? - RQ2: To what extent can these failures be explained by the complexity of the solution code? - RQ3: Which additional factors, beyond code complexity, influence LLM failures? To investigate these questions, we conducted a task-level study of four widely used benchmarks (MBPP, HumanEval, BigCodeBench, and LiveCodeBench) and evaluated six representative advanced LLMs selected to reflect diverse and popular code generation scenarios (Claude Sonnet-4, DeepSeek-V3, Qwen3-Coder, GPT-4o, Llama-3.3-70B, and Mistral-3.2-24B). We additionally implemented a tool to measure task complexity from the corresponding solution code, which we used to examine its correlation with model failures. Interestingly, complexity alone did not fully account for these failures, suggesting that additional factors play a significant role. Analyzing the 114 consistently failed tasks across benchmarks, we identified four recurring failure patterns: wrong problem mapping, flawed or incomplete algorithm design, edge case mishandling, and formatting mistakes. This analysis reveals how factors beyond solution complexity contribute to LLM failures. We further observed that some benchmark tasks are extremely ambiguous, so that certain failures do not necessarily reflect actual model weaknesses. Our contributions are as follows: - (1) Detailed task-level experimental results for four widely used benchmarks evaluated on six major LLMs. - (2) A method for measuring solution code complexity and analyzing its correlation with LLM failure rates. - (3) Fine-grained failure inspection of 114 tasks, revealing patterns of weaknesses in LLMs and challenges arising from benchmark design. The experimental data and analysis scripts are publicly released on GitHub to support reproducibility and future research. https://github.com/breath24/FailureBench. II Experimental Setup Benchmarks: We conducted a survey of existing code generation benchmarks and selected those that are both widely adopted in the research community and equipped with executable test cases, enabling reliable evaluation of generated code. Our final selection comprises HumanEval (5,845 citations as of August 2025), MBPP (2,142), LiveCodeBench (524), and BigCodeBench (240), representing a diverse yet widely used set of benchmarks for code generation. In this paper, we focused on benchmarks that specialize in function generation, where a natural language problem is given (with or without contextual code) and the model is required to generate a function that implements the expected functionality. We used the latest version of each benchmark available to minimize data contamination, thereby increasing the likelihood of observing failures. The APPS benchmark was excluded, as it is widely used for training and would risk contamination. Due to the large number of tasks in MBPP, we relied on a subset of 378 tasks, used in recent extensions such as MBPP+. For LiveCodeBench, we used LCB-V6, which consists of 175 programming problems released between January and April 2025. For BigCodeBench, we focused on the BCB-Hard subset (148 tasks) using instruct prompts, which contains the most difficult tasks to challenge LLMs. In total, we experimented with 865 tasks across four benchamrks. Models: As for the LLMs, we chose six representative advanced LLMs—namely Claude Sonnet-4, DeepSeek-V3, Qwen3-Coder, GPT-4o, Llama-3.3-70B and Mistral-3.2-24B—from different projects that have been widely used. Unfortunately, we excluded Gemini and Grok due to our limited budget and their costly inference fees. Evaluation Procedure: To identify which tasks lead to failure, we ran the benchmarks on the selected LLMs and collected the generated code. Each task was evaluated using a single generated solution per model (PASS@1): a task was classified as a success if the solution passed all test cases, and as a failure otherwise. For failed tasks, additional details were recorded for further analysis. Table I reports the number of failures for each model across the benchmarks. TABLE I: The number of task failures across benchmarks. | Model | HumanEval | MBPP | LCB | BCB-Hard | | --- | --- | --- | --- | --- | | Claude Sonnet-4 | 2 | 11 | 54 | 109 | | Qwen3-Coder | 7 | 9 | 73 | 102 | | DeepSeek-V3 | 14 | 9 | 76 | 107 | | GPT-4o | 25 | 14 | 85 | 107 | | Mistral-3.2-24B | 28 | 23 | 110 | 114 | | Llama-3.3-70B | 32 | 32 | 100 | 101 | Results: To provide a categorization of task difficulty based on actual LLM performance, we further classified tasks according to the number of models that failed to solve them, resulting in seven categories ranging from 0 to 6, where 0 represents tasks solved by all models, and 6 represents tasks for which all models failed to generate correct code. Table II shows the number of failures in each category. TABLE II: Distribution of model failures across benchmarks. For each benchmark, columns 0–6 indicate the number of models that failed a task: 0 = solved by all models, 1 = failed by 1 model, …, 6 = failed by all models. This provides a difficulty categorization of tasks. - HumanEval: Out of 164 tasks, 113 were solved correctly by every model. There was only one task for which none of the models managed to generate the correct code. Claude Sonnet-4 performed exceptionally good with only two failures, while Llama-3.3-70B records high in failing with tasks that other models were able to solve. - MBPP: Throughout the benchmark, 318 tasks were solved by every model. Two tasks were not solved by any of the models. Qwen3-Coder and DeepSeek-V3 emerged as the best models, while Llama-3.3-70B failed the highest number of tasks. - LiveCodeBench: Among all tasks, 35 could not be solved correctly by any model, while 43 were solved correctly by every model. Claude Sonnet-4 demonstrated the strongest performance with only 54 failures, while Mistral-3.2-24B produced most failures. - BCB-Hard: In this benchmark, only 14 tasks were correctly solved by all models, while 76 tasks were consistently failed. All models produced many failures with a failure rate range from 68% to 77%. RQ1: Which tasks do LLMs consistently fail at? Our experiments identified 114 tasks across four benchmarks that all models consistently failed. BCB-Hard had the most failures, followed by LiveCodeBench. III Code Complexity Analysis Solution code complexity provides a quantitative lens on task difficulty, independent of natural language prompts. To explore whether this complexity correlates to LLM performance, we analyzed ground-truth solutions across benchmarks. We first describe the complexity metrics we use, then report descriptive statistics, followed by correlation and regression analyses to examine their relationship with model failures. III-A Complexity Measurement We designed and implemented an algorithm to quantitatively measure the complexity of solution code. We used a set of different code complexity dimensions, each of which captures a distinct facet of what makes a code snippet more cognitively or technically challenging. The dimensions are defined as follows: - Cyclomatic Complexity (CC): Measures the intricacy of control flow constructs, including conditionals (if), loops (for, while), and branching. A higher CC value indicates more involved execution paths and decision points. - Data Structures: Assesses the extent and variety of data structure usage, such as lists, dictionaries, sets, and user-defined containers. - Function Calls: Captures the frequency of function invocation. - Code Length: Serves as a surface-level proxy for overall complexity, as longer code may correspond to richer logic. - Nesting Depth: Reflects the maximum depth of syntactic nesting, such as loops within conditionals or recursive calls embedded in other structures. - Recursions: Specifically tracks the presence and frequency of recursive functions. We use static analysis of abstract syntax trees to compute each of these metrics, averaging values across all tasks in a benchmark. Table III presents the results of these measurements. As the measurement shows, recursions are extremely rare across benchmarks and that the use of data structures is relatively limited. LiveCodeBench demonstrates greater or nearly equal values for all metrics comparing to other benchmarks, highlighting that its solutions require highest code complexity. TABLE III: Code complexity metrics-average value- across benchmarks. | Metric | HumanEval | MBPP | LCB | BCB-Hard | | --- | --- | --- | --- | --- | | CC | 3.34 | 2.58 | 9 | 4.66 | | Data Structures | 0.26 | 0.10 | 0.62 | 0.69 | | Function Calls | 3.10 | 2.07 | 12.97 | 11.89 | | Length | 6.30 | 6.68 | 29.54 | 15.91 | | Nesting Depth | 8.37 | 8.70 | 10.48 | 9.42 | | Recursions | 0.09 | 0.09 | 0.25 | 0.03 | III-B Correlation Analysis Figure 1 illustrates how average metric values vary across tasks grouped by the number of LLMs that failed to solve them. LiveCodeBench shows a clear positive correlation between code complexity metrics and failure rate, whereas no such trend is evident in the other benchmarks. To quantify these relationships, we applied the Spearman and linear regression methods to measure the correlation between benchmark failure rates and individual metrics. Spearman’s results suggested that LiveCodeBench failures were positively correlated with code complexity metrics. In contrast, correlations remain weak and statistically non-significant for HumanEval, MBPP, and BCB-Hard. Applying regression, the explanatory power of all metrics is negligible in HumanEval and MBPP, with consistently low $R^{2}$ values and non-significant p-values, indicating that these metrics do not meaningfully account for failure rates. BCB-Hard exhibits slightly stronger signals: metrics such as length and nesting depth achieve moderate $R^{2}$ values ( $≈ 0.14$ – $0.15$ ) with borderline p-values ( $≈ 0.07$ – $0.09$ ), suggesting weak but potentially meaningful associations. However, none reach conventional thresholds for statistical significance. By contrast, LiveCodeBench displays the strongest relationships. Several metrics—including cyclomatic complexity, function calls, and length—show relatively high $R^{2}$ values (up to $0.32$ ) alongside highly significant p-values ( $<10^{-6}$ ), indicating robust linear associations with failure rate. <details> <summary>x1.png Details</summary> ![807cc3b3](/v1/image/807cc3b376b9fff8b145112eda940e0cff94b52cd8bb305241be6a31ffe0b404) ### Visual Description \n ## Line Chart: CC vs. Tasks Grouped by Failed LLMs ### Overview This line chart displays the relationship between the number of tasks grouped by the number of failed Large Language Models (LLMs) on the x-axis and the CC (likely a correlation coefficient or similar metric) on the y-axis. Four different benchmarks are represented by distinct colored lines. The chart shows how the CC value changes as the number of failed LLMs increases for each benchmark. ### Components/Axes * **X-axis Title:** "Tasks Grouped by Number of Failed LLMs" * Scale: 0 to 6, with markers at each integer value. * **Y-axis Title:** "CC" * Scale: 0 to 16, with markers at 0, 5, 10, and 15. * **Legend Title:** "Benchmark" * **Line Labels & Colors:** * `bcb` - Blue * `humaneval` - Orange * `lcb` - Green * `mbpp` - Red * **Gridlines:** Present, providing a visual aid for reading values. ### Detailed Analysis The chart displays four lines, each representing a benchmark. * **bcb (Blue Line):** The line starts at approximately 2 at x=0, rises to a peak of approximately 5 at x=3, then declines to approximately 4.8 at x=6. The trend is initially upward, then downward. * (0, 2) * (1, 2.5) * (2, 3) * (3, 5) * (4, 4.8) * (5, 4.5) * (6, 4.8) * **humaneval (Orange Line):** The line begins at approximately 2.2 at x=0, fluctuates around 3-4 until x=4, then rises to approximately 5.5 at x=6. The trend is relatively flat initially, then upward. * (0, 2.2) * (1, 2.8) * (2, 3.2) * (3, 3.5) * (4, 3.8) * (5, 4.8) * (6, 5.5) * **lcb (Green Line):** This line shows the most dramatic increase. It starts at approximately 2.5 at x=0 and rises steadily to a peak of approximately 15.5 at x=5, then declines to approximately 12.5 at x=6. The trend is strongly upward, then slightly downward. * (0, 2.5) * (1, 4.5) * (2, 6.5) * (3, 8.5) * (4, 10.5) * (5, 15.5) * (6, 12.5) * **mbpp (Red Line):** The line starts at approximately 1.8 at x=0, decreases to approximately 1.5 at x=1, then rises to approximately 2.5 at x=3, and declines to approximately 1.8 at x=6. The trend is initially downward, then upward, then downward. * (0, 1.8) * (1, 1.5) * (2, 2) * (3, 2.5) * (4, 2.2) * (5, 2) * (6, 1.8) ### Key Observations * The `lcb` benchmark exhibits the highest CC values and the most significant increase with the number of failed LLMs. * The `mbpp` benchmark consistently has the lowest CC values. * The `bcb` and `humaneval` benchmarks show moderate CC values with less pronounced trends. * The `lcb` benchmark shows a clear positive correlation between the number of failed LLMs and the CC value, up to x=5, after which it slightly decreases. ### Interpretation The chart suggests that as the number of tasks grouped by failed LLMs increases, the correlation (CC) between the tasks and the benchmarks varies significantly depending on the benchmark used. The `lcb` benchmark appears to be particularly sensitive to the number of failed LLMs, showing a strong positive correlation. This could indicate that `lcb` is a good measure of task difficulty or complexity, as tasks that are difficult for LLMs to solve may also be more correlated with each other. The lower CC values for `mbpp` suggest that this benchmark may be less sensitive to the number of failed LLMs, or that the tasks within `mbpp` are more diverse and less correlated. The fluctuating trends for `bcb` and `humaneval` indicate a more complex relationship between the number of failed LLMs and the correlation within these benchmarks. The slight decrease in CC for `lcb` at x=6 could indicate a saturation point, where adding more tasks grouped by failed LLMs does not further increase the correlation. It's also possible that this decrease is due to noise or outliers in the data. Further investigation would be needed to determine the underlying cause. </details> ((a)) Cyclomatic Complexity <details> <summary>x2.png Details</summary> ![6551fdd7](/v1/image/6551fdd73b12943e1049f9379aa495df4a34e31ac251cf8333927f69496a88d0) ### Visual Description ## Line Chart: Length vs. Failed LLMs ### Overview This line chart displays the relationship between the number of failed Large Language Models (LLMs) and the length of tasks across four different benchmarks: bcb, humaneval, lcb, and mbpp. The x-axis represents the number of failed LLMs (grouped tasks), ranging from 0 to 6. The y-axis represents the length of the tasks. ### Components/Axes * **Title:** Benchmark * **X-axis Label:** Tasks Grouped by Number of Failed LLMs * **Y-axis Label:** Length * **Legend:** Located in the top-left corner, listing the four benchmarks with corresponding line colors: * bcb (Blue) * humaneval (Orange) * lcb (Green) * mbpp (Red) * **X-axis Markers:** 0, 1, 2, 3, 4, 5, 6 * **Y-axis Markers:** 0, 10, 20, 30, 40, 50 ### Detailed Analysis Here's a breakdown of each benchmark's trend and data points: * **bcb (Blue Line):** The line generally fluctuates around a value of approximately 20. * At 0 failed LLMs: ~14 * At 1 failed LLM: ~17 * At 2 failed LLMs: ~22 * At 3 failed LLMs: ~22 * At 4 failed LLMs: ~20 * At 5 failed LLMs: ~15 * At 6 failed LLMs: ~17 * **humaneval (Orange Line):** The line shows a slight decrease initially, then a small increase, and ends relatively flat. * At 0 failed LLMs: ~5 * At 1 failed LLM: ~7 * At 2 failed LLMs: ~6 * At 3 failed LLMs: ~8 * At 4 failed LLMs: ~6 * At 5 failed LLMs: ~4 * At 6 failed LLMs: ~7 * **lcb (Green Line):** This line exhibits a strong upward trend, increasing almost linearly. * At 0 failed LLMs: ~15 * At 1 failed LLM: ~22 * At 2 failed LLMs: ~28 * At 3 failed LLMs: ~34 * At 4 failed LLMs: ~39 * At 5 failed LLMs: ~44 * At 6 failed LLMs: ~52 * **mbpp (Red Line):** The line remains relatively flat and low throughout, with minor fluctuations. * At 0 failed LLMs: ~8 * At 1 failed LLM: ~8 * At 2 failed LLMs: ~6 * At 3 failed LLMs: ~7 * At 4 failed LLMs: ~6 * At 5 failed LLMs: ~4 * At 6 failed LLMs: ~5 ### Key Observations * The `lcb` benchmark shows a clear positive correlation between the number of failed LLMs and task length. As more LLMs fail, the task length increases significantly. * The `mbpp` benchmark remains consistently low in length, regardless of the number of failed LLMs. * `bcb` shows some fluctuation, but remains relatively stable around 20. * `humaneval` shows minimal change, with a slight increase and decrease. ### Interpretation The chart suggests that the difficulty or complexity of tasks, as measured by length, increases as more LLMs are unable to solve them, particularly for the `lcb` benchmark. This could indicate that the `lcb` benchmark contains tasks that become progressively more challenging or require more sophisticated reasoning as the number of failed attempts increases. The consistent low length of `mbpp` tasks suggests that these tasks are inherently simpler or less demanding, regardless of LLM performance. The relatively stable `bcb` benchmark suggests a consistent level of difficulty. The slight fluctuations in `humaneval` may indicate some sensitivity to the number of failed LLMs, but the effect is minimal. The data implies that the benchmarks are not equally sensitive to LLM failures. `lcb` is highly sensitive, while `mbpp` is not. This could be due to the nature of the tasks within each benchmark, the evaluation metrics used, or the specific LLMs being tested. Further investigation would be needed to determine the underlying reasons for these differences. </details> ((b)) Length <details> <summary>x3.png Details</summary> ![56fdc941](/v1/image/56fdc9417311ff797f83b54bb74be550aa35bd13f3bb6c3591c3371f8ffc8748) ### Visual Description ## Line Chart: Nesting Depth vs. Failed LLMs ### Overview This line chart displays the relationship between the number of failed Large Language Models (LLMs) and the nesting depth achieved on various benchmarks. The x-axis represents the number of failed LLMs, grouped into categories from 0 to 6. The y-axis represents the nesting depth, ranging from approximately 8 to 12. Four different benchmarks are represented by distinct colored lines: bcb, humaneval, lcb, and mbpp. ### Components/Axes * **X-axis Title:** "Tasks Grouped by Number of Failed LLMs" * **X-axis Markers:** 0, 1, 2, 3, 4, 5, 6 * **Y-axis Title:** "Nesting Depth" * **Y-axis Scale:** Approximately 8 to 12 * **Legend Title:** "Benchmark" * **Legend Labels:** * bcb (Blue Line) * humaneval (Orange Line) * lcb (Green Line) * mbpp (Red Line) ### Detailed Analysis * **bcb (Blue Line):** The line starts at approximately 9.1 at x=0, increases to a peak of around 9.6 at x=4, then decreases to approximately 9.3 at x=6. The trend is generally flat with a slight increase and then a slight decrease. * (0, 9.1) * (1, 9.2) * (2, 9.2) * (3, 9.4) * (4, 9.6) * (5, 9.4) * (6, 9.3) * **humaneval (Orange Line):** The line begins at approximately 8.6 at x=0, increases to a peak of around 9.3 at x=1, then decreases to approximately 8.2 at x=4, and rises again to around 8.5 at x=6. The trend is fluctuating. * (0, 8.6) * (1, 9.3) * (2, 8.8) * (3, 8.5) * (4, 8.2) * (5, 8.3) * (6, 8.5) * **lcb (Green Line):** The line starts at approximately 8.8 at x=0 and consistently increases to a peak of around 12.2 at x=5, then slightly decreases to approximately 11.9 at x=6. The trend is strongly upward. * (0, 8.8) * (1, 9.4) * (2, 10.1) * (3, 10.7) * (4, 11.4) * (5, 12.2) * (6, 11.9) * **mbpp (Red Line):** The line begins at approximately 8.8 at x=0, increases to around 9.3 at x=1, decreases to approximately 8.8 at x=2, and then sharply increases to approximately 11.2 at x=6. The trend is generally upward, with an initial fluctuation followed by a significant increase. * (0, 8.8) * (1, 9.3) * (2, 8.8) * (3, 9.1) * (4, 9.5) * (5, 10.3) * (6, 11.2) ### Key Observations * The 'lcb' benchmark consistently demonstrates the highest nesting depth across all values of failed LLMs. * The 'humaneval' benchmark exhibits the most fluctuating nesting depth. * The 'mbpp' benchmark shows a significant increase in nesting depth as the number of failed LLMs increases, particularly between x=4 and x=6. * The 'bcb' benchmark remains relatively stable throughout the range of failed LLMs. ### Interpretation The chart suggests that the number of failed LLMs can influence the nesting depth achieved on different benchmarks. The 'lcb' benchmark appears to be less sensitive to the number of failed LLMs, consistently achieving high nesting depths. Conversely, the 'mbpp' benchmark shows a strong positive correlation between the number of failed LLMs and nesting depth, indicating that more failures might lead to deeper nesting in this specific benchmark. The fluctuating behavior of 'humaneval' suggests that its nesting depth is more variable and potentially influenced by factors other than just the number of failed LLMs. The data implies that different benchmarks have varying levels of robustness to LLM failures, and the impact of failures on nesting depth is benchmark-specific. The increasing trend of 'mbpp' could indicate that the benchmark becomes more challenging or requires more complex reasoning as LLMs fail, leading to deeper nesting to resolve issues. </details> ((c)) Nesting Depth <details> <summary>x4.png Details</summary> ![47c332f1](/v1/image/47c332f169b99caf58ea411b4dace29b1ad2bb9e2eda2d2f159a0d2ae4bad65b) ### Visual Description ## Line Chart: Function Calls vs. Failed LLMs ### Overview This line chart depicts the relationship between the number of tasks grouped by the number of failed Large Language Models (LLMs) and the corresponding number of function calls. Four different benchmarks are compared: `bcb`, `humaneval`, `lcb`, and `mbpp`. The x-axis represents the grouping of tasks based on the number of failed LLMs (ranging from 0 to 6), while the y-axis represents the number of function calls (ranging from 0 to 20). ### Components/Axes * **Title:** Not explicitly present, but the chart represents "Function Calls vs. Failed LLMs". * **X-axis Label:** "Tasks Grouped by Number of Failed LLMs" * **X-axis Markers:** 0, 1, 2, 3, 4, 5, 6 * **Y-axis Label:** "Function Calls" * **Y-axis Scale:** 0 to 20, with increments of 5. * **Legend:** Located in the top-left corner. * `bcb` - Blue line with triangle markers. * `humaneval` - Orange line with circle markers. * `lcb` - Green line with diamond markers. * `mbpp` - Red line with square markers. ### Detailed Analysis Here's a breakdown of each benchmark's trend and data points: * **bcb (Blue Line):** The line initially slopes upward from x=0 to x=3, then plateaus and slightly declines from x=3 to x=6. * x=0: ~10 function calls * x=1: ~11 function calls * x=2: ~12 function calls * x=3: ~15 function calls * x=4: ~13 function calls * x=5: ~12 function calls * x=6: ~11 function calls * **humaneval (Orange Line):** The line exhibits a generally increasing trend from x=0 to x=6, with some fluctuations. * x=0: ~2 function calls * x=1: ~2 function calls * x=2: ~3 function calls * x=3: ~4 function calls * x=4: ~3 function calls * x=5: ~5 function calls * x=6: ~7 function calls * **lcb (Green Line):** This line shows a strong upward trend, particularly from x=0 to x=5, then plateaus. * x=0: ~6 function calls * x=1: ~8 function calls * x=2: ~12 function calls * x=3: ~15 function calls * x=4: ~13 function calls * x=5: ~21 function calls * x=6: ~21 function calls * **mbpp (Red Line):** The line remains relatively flat throughout the range of x-values, with minor fluctuations. * x=0: ~3 function calls * x=1: ~3 function calls * x=2: ~2 function calls * x=3: ~3 function calls * x=4: ~2 function calls * x=5: ~2 function calls * x=6: ~3 function calls ### Key Observations * `lcb` consistently requires the highest number of function calls across all task groupings. * `mbpp` consistently requires the lowest number of function calls. * `bcb` shows an initial increase in function calls with more failed LLMs, but then stabilizes. * `humaneval` shows a steady increase in function calls as the number of failed LLMs increases. * The most significant increase in function calls for `lcb` occurs between x=0 and x=5. ### Interpretation The chart suggests that the number of function calls needed to complete tasks varies significantly depending on the benchmark used. The `lcb` benchmark appears to be the most complex, requiring substantially more function calls than the other benchmarks, and its complexity increases with the number of failed LLMs. `mbpp` is the simplest, requiring a minimal number of function calls regardless of the number of failed LLMs. The increasing trend of `humaneval` suggests that as tasks become more challenging (indicated by more LLM failures), the number of function calls needed to resolve them also increases. The initial increase and subsequent stabilization of `bcb` could indicate a point of diminishing returns, where adding more LLMs beyond a certain failure threshold doesn't significantly increase the number of function calls. This data could be used to evaluate the efficiency and complexity of different benchmarks and to understand how LLM failures impact the resource requirements of task completion. </details> ((d)) Function Calls Figure 1: Comparison of benchmark code metrics for tasks, grouped by the number of LLMs that failed them. LiveCodeBench shows a positive correlation between code complexity and failure rate, while other benchmarks do not. RQ2: To what extent are these failures explained by solution complexity? Failures show only a weak association with solution code complexity across HumanEval, MBPP, and BCB-Hard. These results suggest that code complexity alone cannot systematically explain LLM failures across benchmarks, and that semantic task properties and benchmark-specific factors likely play a more significant role. IV Failure Inspection We conducted a detailed task-by-task analysis of consistently failed tasks, examining the code generated by each model together with the corresponding failed test cases. This close inspection allowed us to uncover recurring patterns and better understand systematic challenges faced by LLMs across benchmarks. LLM Failure Patterns: Across the consistently failed tasks, we identified four main failure modes: - Wrong problem mapping occurs when models interpret a task as belonging to the wrong problem class. For example, in HumanEval/132, the task is to determine whether a string of square brackets contains a valid subsequence with at least one nested pair. All models incorrectly mapped this to the standard ”balanced brackets” class of problems and applied a stack-based early-return strategy: pushing opening brackets, popping on closing brackets, and returning True as soon as a nested configuration was locally detected. This approach failed because it identified only the first instance of nesting rather than ensuring that sufficient nesting occurred, as required by the canonical solution. This illustrates a common bias in LLMs toward familiar problem types, which can lead them to overlook details specified in the prompt. - Flawed or incomplete algorithm design arises when LLMs take the correct approach but include a flawed or incomplete set of steps. For example, task BCB-Hard/945 requires generating a time series of sales data starting from a specified date, using regression to forecast future sales. LLMs correctly implemented data processing and regression, but did not incorporate mechanisms to handle non-monotonic trends. - Edge case mishandling refers to failures where the code generated by an LLM cannot correctly handle uncommon or boundary scenarios. For example, BCB-Hard/964 requires converting files with multiple extensions from a source directory into CSV files in a target directory, but all models failed the nested subdirectory test case because their code only iterated over top-level files rather than recursively traversing subfolders. - Formatting mistakes arise when the underlying algorithm is correct, but solutions are rejected because of strict input/output requirements. For instance, LiveCodeBench/3736 required results to be returned as string literals (e.g., "23"), whereas the models produced unquoted digits. Table IV reports the number of patterns observed in each benchmark. In some tasks, more than one pattern appeared to contribute to the failures, as we analyzed the outputs from all LLMs, each producing different solution code. TABLE IV: Number of failure patterns across benchmarks. | Failure Pattern | HumanEval | MBPP | LCB | BCB-Hard | | --- | --- | --- | --- | --- | | Wrong Problem Mapping | 1 | 0 | 20 | 24 | | Flawed/Incomplete Algorithm | 0 | 1 | 31 | 35 | | Formatting Mistakes | 0 | 0 | 10 | 32 | | Edge Case Mishandling | 0 | 1 | 1 | 27 | Ambiguous Prompt & Restricted Test: Many failures in BCB-Hard arise from task ambiguity, where prompts are underspecified and tests are overspecified, forcing models to make reasonable assumptions that nonetheless lead to failure. For instance, models sometimes assumed specific column names in the output CSV, which the test cases did not expect. A task is passed only when a model’s assumptions happen to match the hidden expectations or when the generated code is flexible enough to satisfy alternative assumptions imposed by the tests. Additionally, we investigated why Llama-3.3-70B outperformed other models on BCB-Hard, particularly Claude Sonnet-4, which demonstrated stronger performance across the remaining benchmarks. We did this by examining tasks that Llama-3.3-70B solved correctly while others failed. A closer examination of these tasks revealed that Llama’s success often stemmed from a simple or literal interpretation of the prompts, whereas stronger models tended to rely on conventional coding practices or added assumptions that were reasonable in general but not aligned with the benchmark’s strict test cases. For instance, in BCB-Hard/147, Claude Sonnet-4 skipped network and broadcast IP addresses following standard Python networking conventions, while Llama-3.3-70B iterated over all addresses literally as requested in the prompt. This may illustrate a broader phenomenon: more capable models can sometimes over-optimize for general correctness or practicality, inadvertently violating strict benchmark test cases, whereas simpler models may succeed by adhering more literally to the task instructions. RQ3: Which additional task characteristics, beyond solution complexity, influence LLM failures? We identified four recurring failure patterns— wrong problem mapping, flawed or incomplete algorithm design, edge case mishandling, and formatting mistakes —that highlight current LLM weaknesses. Underspecified prompts, as well as overspecified tests in benchmark tasks, are additional factors contributing to further failures. V Limitations and Future Directions In this section, we briefly discuss several limitations of our study and outline future directions: 1. Our experiments are currently limited to four benchmarks. We plan to extend them with further function-level as well as repository-level benchmarks, such as SWE-Bench [6], which requires solving coding issues across multiple files. 1. We will broaden our study to cover a wide range of models, from small to large scale, to examine what types of failures arise with different architectures and sizes. 1. Our complexity measurement focuses on solution code, but it could be extended to include prompt complexity. 1. Recursion and the use of data structures are underrepresented across existing benchmarks. Future work could involve creating specialized benchmarks to address this gap. 1. Building on failure pattern insights (e.g., edge case handling and problem mapping), we plan to design targeted training strategies to improve LLM performance. 1. We also plan to design benchmarks based on common failure patterns to more effectively discriminate between model capabilities. Acknowledgment This work was funded by the Hessian Ministry of Higher Education, Research, Science and the Arts within the cluster project The Third Wave of Artificial Intelligence (3AI), by the National Research Center for Applied Cybersecurity ATHENE within the project Foundational Models for Secure Software Development, and by the LOEWE initiative (Hesse, Germany) [LOEWE/4a//519/05/00.002(0013)/95]. References - [1] J. Ji, J. Jun, M. Wu, and R. Gelles, “Cybersecurity risks of ai-generated code,” Center for Security and Emerging Technology, Tech. Rep., November 2024. [Online]. Available: https://cset.georgetown.edu/wp-content/uploads/CSET-Cybersecurity-Risks-of-AI-Generated-Code.pdf - [2] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating large language models trained on code,” 2021. [Online]. Available: https://arxiv.org/abs/2107.03374 - [3] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program synthesis with large language models,” 2021. [Online]. Available: https://arxiv.org/abs/2108.07732 - [4] S. Jain, S. B. Kong, E. Zelikman, A. Chen, T. Chen, A. Svyatkovskiy, and C. Sutton, “Livecodebench: Holistic and contamination-free evaluation of llms for code,” 2024. [Online]. Available: https://arxiv.org/abs/2404.00699 - [5] T. Y. Zhuo, M. C. Vu, J. Chim, H. Hu, W. Yu, R. Widyasari, I. N. B. Yusuf, H. Zhan, J. He, I. Paul, S. Brunner, C. Gong, T. Hoang, A. R. Zebaze, X. Hong, W.-D. Li, J. Kaddour, M. Xu, Z. Zhang, P. Yadav, N. Jain, A. Gu, Z. Cheng, J. Liu, Q. Liu, Z. Wang, B. Hui, N. Muennighoff, D. Lo, D. Fried, X. Du, H. de Vries, and L. V. Werra, “Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions,” 2025. [Online]. Available: https://arxiv.org/abs/2406.15877 - [6] C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?” 2024. [Online]. Available: https://arxiv.org/abs/2310.06770 - [7] J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” in Proceedings of the 37th International Conference on Neural Information Processing Systems, ser. NIPS ’23. Red Hook, NY, USA: Curran Associates Inc., 2023. - [8] Z. Yu, Y. Zhao, A. Cohan, and X.-P. Zhang, “Humaneval pro and mbpp pro: Evaluating large language models on self-invoking code generation,” 2024. [Online]. Available: https://arxiv.org/abs/2412.21199 - [9] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt, “Measuring coding challenge competence with apps,” 2021. [Online]. Available: https://arxiv.org/abs/2105.09938 - [10] Z. Wang, Z. Zhou, D. Song, Y. Huang, S. Chen, L. Ma, and T. Zhang, “Towards understanding the characteristics of code generation errors made by large language models,” in 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). Los Alamitos, CA, USA: IEEE Computer Society, May 2025, pp. 2587–2599. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ICSE55347.2025.00180

Rendering Paper...