\n
## Performance Comparison of Verification Methods on SAT and UNSAT Properties
### Overview
The image displays two side-by-side line charts comparing the performance of seven different computational methods (BaBSB, BaB, reluBaB, reluplex, MIPplanet, planet, BlackBox) on two distinct tasks: verifying SAT properties and verifying UNSAT properties. The charts plot the percentage of properties verified against the computation time in seconds.
### Components/Axes
* **Chart Type:** Two line charts, labeled (a) and (b).
* **X-Axis (Both Charts):** "Computation time (in s)". The scale runs from 0 to 6000 seconds, with major tick marks at 0, 2000, 4000, and 6000.
* **Y-Axis (Both Charts):** "% of properties verified". The scale runs from 0 to 100, with major tick marks at 0, 20, 40, 60, 80, and 100. A dashed horizontal line is present at the 100% mark.
* **Legend (Both Charts):** Located in the top-right quadrant of each chart. It lists seven methods with corresponding colored lines:
* **BaBSB:** Blue line
* **BaB:** Orange line
* **reluBaB:** Green line
* **reluplex:** Red line
* **MIPplanet:** Purple line
* **planet:** Brown line
* **BlackBox:** Pink line
* **Chart Titles (Sub-captions):**
* (a) *On SAT properties*
* (b) *On UNSAT properties*
### Detailed Analysis
#### Chart (a): On SAT properties
* **BaBSB (Blue):** Rises extremely steeply from the origin, reaching approximately 95% verification within the first few hundred seconds. It then plateaus, maintaining ~95% until the end of the time window (6000s).
* **BaB (Orange):** Follows a nearly identical path to BaBSB, also plateauing at approximately 95%.
* **reluBaB (Green):** Also rises very steeply, reaching a plateau slightly below BaBSB and BaB, at approximately 92-93%.
* **reluplex (Red):** Shows a steady, step-like increase. It starts near 0%, reaches ~40% by 1000s, ~60% by 2000s, ~80% by 4000s, and ends at approximately 85% at 6000s.
* **MIPplanet (Purple):** Rises quickly to about 55% within the first 500 seconds and then forms a flat plateau for the remainder of the time.
* **planet (Brown):** Follows a path very similar to MIPplanet, plateauing at a nearly identical level of approximately 55%.
* **BlackBox (Pink):** Remains near 0% for the entire duration, showing a negligible increase to perhaps 1-2% after 2000s.
#### Chart (b): On UNSAT properties
* **BaBSB (Blue):** Rises very steeply, reaching 100% verification by approximately 1500 seconds and maintaining it.
* **BaB (Orange):** Rises steeply but slightly slower than BaBSB. It reaches ~80% by 2000s and continues a gradual climb to approximately 85% by 6000s.
* **reluBaB (Green):** Follows a trend very similar to BaB, ending at approximately 83-84%.
* **reluplex (Red):** Increases steadily, reaching ~50% by 1000s, ~70% by 3000s, and plateauing at approximately 78% from 4000s onward.
* **MIPplanet (Purple):** Rises quickly to about 55% and plateaus, similar to its behavior on SAT properties.
* **planet (Brown):** Again closely mirrors MIPplanet, plateauing at approximately 55%.
* **BlackBox (Pink):** Shows a slow, steady, linear increase from 0%, reaching approximately 25% by 6000s.
### Key Observations
1. **Method Hierarchy:** BaBSB is the top-performing method on both tasks, being the fastest to reach high verification percentages. BaB and reluBaB form a second tier of strong performers.
2. **SAT vs. UNSAT:** Most methods achieve a higher final percentage of properties verified on the SAT task (chart a) compared to the UNSAT task (chart b). The exception is BaBSB, which reaches 100% on UNSAT but plateaus at ~95% on SAT.
3. **Plateauing Behavior:** MIPplanet and planet exhibit a distinct pattern: a rapid initial rise followed by a long, flat plateau at ~55% for both SAT and UNSAT properties, suggesting a fundamental limitation in their capability.
4. **reluplex Performance:** The reluplex method shows a more gradual, continuous improvement over time compared to the steep-then-flat curves of others. Its final performance is notably better on SAT (~85%) than on UNSAT (~78%).
5. **BlackBox Ineffectiveness:** The BlackBox method performs very poorly on SAT properties (near 0%) and only shows modest, slow progress on UNSAT properties (~25% at 6000s).
### Interpretation
The data demonstrates a clear performance ranking among the tested verification methods for neural network properties. **BaBSB** is the most efficient and effective solver, consistently verifying the most properties in the least time. The near-identical performance of **BaB** and **reluBaB** suggests they may share a similar underlying algorithmic approach.
The stark difference in the curves for **MIPplanet/planet** versus the others indicates a categorical difference in methodology. Their quick plateau implies they can solve an initial subset of "easier" properties but lack the mechanisms to handle more complex ones, regardless of additional computation time.
The contrast between SAT and UNSAT results is significant. The fact that most methods perform better on SAT properties suggests that, for this benchmark set, proving the existence of a satisfying input (SAT) is generally easier than proving that no such input exists (UNSAT). **BaBSB's** ability to reach 100% on UNSAT properties highlights its particular strength in this more challenging proof task.
The **BlackBox** method's poor performance serves as a baseline, indicating that the problem requires the sophisticated techniques employed by the other, more specialized methods. The charts effectively argue for the superiority of the **BaBSB** approach and illustrate the varying capabilities and limitations of different formal verification techniques for neural networks.