## Charts: Performance Metrics of Reinforcement Learning Algorithms
### Overview
This image contains a grid of six charts, arranged in two rows and three columns. The top row displays "normalized reward" over "round" for different algorithm comparisons. The bottom row displays "percentage of cooperation" over "round" for the same algorithm comparisons. Each chart compares two or three reinforcement learning algorithms. The shaded areas around the lines represent uncertainty.
### Components/Axes
**General Chart Elements (across all charts):**
* **X-axis:** Labeled "round". The scale ranges from 0 to 50, with tick marks at 0, 10, 20, 30, 40, and 50.
* **Y-axis (Top Row):** Labeled "normalized reward". The scale ranges from 0.0 to 1.0, with tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Y-axis (Bottom Row):** Labeled "percentage of cooperation". The scale ranges from 0 to 100, with tick marks at 0, 20, 40, 60, 80, and 100.
**Specific Chart Titles and Legends:**
**Top Row (Normalized Reward):**
1. **Chart Title:** "reward feedback: QL vs. CTS"
* **Legend:**
* QL (Purple line with purple shaded uncertainty)
* CTS (Blue line with blue shaded uncertainty)
2. **Chart Title:** "reward feedback: UCB vs. DQL"
* **Legend:**
* UCB (Purple line with purple shaded uncertainty)
* DQL (Orange line with orange shaded uncertainty)
3. **Chart Title:** "reward feedback: DQL vs. Tht4Tat"
* **Legend:**
* DQL (Green line with green shaded uncertainty)
* Tht4Tat (Pink line with pink shaded uncertainty)
4. **Chart Title:** "reward feedback: SARSA vs. LinUCB"
* **Legend:**
* SARSA (Pink line with pink shaded uncertainty)
* LinUCB (Blue line with blue shaded uncertainty)
5. **Chart Title:** "reward feedback: UCB vs. LinUCB vs. QL"
* **Legend:**
* UCB (Black line with black shaded uncertainty)
* LinUCB (Purple line with purple shaded uncertainty)
* QL (Blue line with blue shaded uncertainty)
**Bottom Row (Percentage of Cooperation):**
1. **Chart Title:** "cooperation ratio: QL vs. CTS"
* **Legend:**
* QL (Purple line with purple shaded uncertainty)
* CTS (Blue line with blue shaded uncertainty)
2. **Chart Title:** "cooperation ratio: UCB vs. DQL"
* **Legend:**
* UCB (Purple line with purple shaded uncertainty)
* DQL (Orange line with orange shaded uncertainty)
3. **Chart Title:** "cooperation ratio: DQL vs. Tht4Tat"
* **Legend:**
* DQL (Green line with green shaded uncertainty)
* Tht4Tat (Pink line with pink shaded uncertainty)
4. **Chart Title:** "cooperation ratio: SARSA vs. LinUCB"
* **Legend:**
* SARSA (Pink line with pink shaded uncertainty)
* LinUCB (Blue line with blue shaded uncertainty)
5. **Chart Title:** "cooperation ratio: UCB vs. LinUCB vs. QL"
* **Legend:**
* UCB (Black line with black shaded uncertainty)
* LinUCB (Purple line with purple shaded uncertainty)
* QL (Blue line with blue shaded uncertainty)
### Detailed Analysis
**Top Row (Normalized Reward):**
1. **QL vs. CTS:**
* **QL (Purple):** Starts around 0.7, fluctuates between 0.6 and 0.8, ending around 0.7.
* **CTS (Blue):** Starts around 0.6, increases to approximately 0.85 by round 10, then fluctuates between 0.75 and 0.85, ending around 0.8.
* **Trend:** CTS shows an initial increase and then stabilizes at a higher reward than QL, which remains relatively stable with fluctuations.
2. **UCB vs. DQL:**
* **UCB (Purple):** Starts around 0.8, drops to approximately 0.5 by round 5, then fluctuates between 0.45 and 0.6, ending around 0.55.
* **DQL (Orange):** Starts around 0.8, drops to approximately 0.5 by round 5, then fluctuates between 0.45 and 0.6, ending around 0.55.
* **Trend:** Both UCB and DQL show a similar initial drop and then maintain a similar, fluctuating reward level.
3. **DQL vs. Tht4Tat:**
* **DQL (Green):** Starts around 0.6, fluctuates between 0.55 and 0.7, ending around 0.65.
* **Tht4Tat (Pink):** Starts around 0.6, drops to approximately 0.4 by round 5, then fluctuates between 0.4 and 0.5, ending around 0.45.
* **Trend:** DQL maintains a higher and more stable normalized reward compared to Tht4Tat, which experiences a significant drop and remains at a lower level.
4. **SARSA vs. LinUCB:**
* **SARSA (Pink):** Starts around 0.6, increases steadily to approximately 0.75 by round 20, and then fluctuates between 0.7 and 0.8, ending around 0.75.
* **LinUCB (Blue):** Starts around 0.6, increases steadily to approximately 0.7 by round 20, and then fluctuates between 0.65 and 0.75, ending around 0.7.
* **Trend:** Both SARSA and LinUCB show an upward trend in normalized reward, with SARSA generally achieving a slightly higher reward.
5. **UCB vs. LinUCB vs. QL:**
* **UCB (Black):** Starts around 0.6, increases to approximately 0.8 by round 10, then fluctuates between 0.75 and 0.85, ending around 0.8.
* **LinUCB (Purple):** Starts around 0.6, increases to approximately 0.7 by round 10, then fluctuates between 0.6 and 0.7, ending around 0.65.
* **QL (Blue):** Starts around 0.6, drops to approximately 0.4 by round 5, then fluctuates between 0.35 and 0.5, ending around 0.4.
* **Trend:** UCB shows the highest normalized reward, followed by LinUCB, and then QL which has the lowest reward and shows a significant initial drop.
**Bottom Row (Percentage of Cooperation):**
1. **QL vs. CTS:**
* **QL (Purple):** Starts around 60%, drops sharply to approximately 30% by round 10, and then slowly decreases to around 25% by round 50.
* **CTS (Blue):** Starts around 60%, drops to approximately 55% by round 5, and then remains relatively stable around 55-60% until round 50.
* **Trend:** QL shows a significant decrease in cooperation, while CTS maintains a high level of cooperation.
2. **UCB vs. DQL:**
* **UCB (Purple):** Starts around 60%, drops sharply to approximately 20% by round 10, and then slowly decreases to around 15% by round 50.
* **DQL (Orange):** Starts around 60%, drops to approximately 20% by round 10, and then fluctuates between 15% and 25%, ending around 20%.
* **Trend:** Both UCB and DQL show a significant initial drop in cooperation, stabilizing at a lower percentage.
3. **DQL vs. Tht4Tat:**
* **DQL (Green):** Starts around 60%, drops to approximately 40% by round 10, and then slowly decreases to around 35% by round 50.
* **Tht4Tat (Pink):** Starts around 60%, drops sharply to approximately 20% by round 10, and then slowly decreases to around 15% by round 50.
* **Trend:** Tht4Tat shows a much steeper and deeper decline in cooperation compared to DQL.
4. **SARSA vs. LinUCB:**
* **SARSA (Pink):** Starts around 60%, drops to approximately 20% by round 10, and then slowly decreases to around 15% by round 50.
* **LinUCB (Blue):** Starts around 60%, drops to approximately 20% by round 10, and then fluctuates between 15% and 25%, ending around 20%.
* **Trend:** Both SARSA and LinUCB show a similar pattern of a sharp initial drop in cooperation, stabilizing at a lower percentage.
5. **UCB vs. LinUCB vs. QL:**
* **UCB (Black):** Starts around 60%, drops sharply to approximately 20% by round 10, and then slowly decreases to around 15% by round 50.
* **LinUCB (Purple):** Starts around 60%, drops to approximately 20% by round 10, and then fluctuates between 15% and 25%, ending around 20%.
* **QL (Blue):** Starts around 60%, drops sharply to approximately 10% by round 10, and then continues to decrease to around 5% by round 50.
* **Trend:** QL exhibits the lowest and most rapidly declining cooperation ratio, while UCB and LinUCB show similar, higher cooperation levels after an initial drop.
### Key Observations
* **Algorithm Performance Variation:** Different algorithms exhibit distinct performance characteristics in terms of normalized reward and cooperation ratio.
* **Trade-off between Reward and Cooperation:** In some comparisons (e.g., QL vs. CTS, DQL vs. Tht4Tat, UCB vs. LinUCB vs. QL), algorithms that achieve higher normalized rewards tend to have lower cooperation ratios, suggesting a potential trade-off.
* **Convergence:** Most algorithms appear to converge to a stable state for both reward and cooperation within the observed 50 rounds, although the levels of convergence vary significantly.
* **Initial Dynamics:** Many algorithms show a rapid change in both reward and cooperation within the first 10-20 rounds, indicating an initial learning or adaptation phase.
* **Specific Algorithm Behaviors:**
* QL consistently shows lower normalized rewards and the lowest cooperation ratios across multiple comparisons.
* CTS and SARSA tend to maintain higher cooperation ratios compared to other algorithms in their respective comparisons.
* UCB and LinUCB show varied performance depending on the comparison, but generally achieve moderate to high rewards and moderate cooperation.
### Interpretation
The charts collectively illustrate the performance of various reinforcement learning algorithms in a simulated environment, likely involving interactions where cooperation is a factor. The "normalized reward" metric suggests the effectiveness of the algorithms in achieving their objectives, while the "percentage of cooperation" indicates their tendency to engage in cooperative behavior.
The data suggests that there isn't a single "best" algorithm across all scenarios. For instance, in the "reward feedback: UCB vs. LinUCB vs. QL" chart, UCB achieves the highest normalized reward, but in the corresponding "cooperation ratio" chart, QL exhibits the lowest cooperation. This highlights a potential trade-off: algorithms optimized solely for reward might not necessarily be cooperative, and vice-versa.
The initial sharp drops in cooperation for many algorithms (e.g., QL, UCB, DQL, Tht4Tat, SARSA) suggest that these agents might initially explore non-cooperative strategies or require a period of learning to establish cooperative patterns. The stabilization of these metrics after the initial phase indicates that the algorithms reach a steady state of behavior within the observed timeframe.
The comparison between DQL and Tht4Tat, and between SARSA and LinUCB, shows that some algorithms are more robust in maintaining cooperation (e.g., CTS in the first row, SARSA and LinUCB in the fourth row) while still achieving reasonable rewards. Conversely, algorithms like QL appear to prioritize individual gain (higher reward in some contexts, but lower cooperation) or struggle to maintain cooperative behavior.
Overall, these charts provide a comparative analysis of different reinforcement learning strategies, demonstrating their effectiveness in achieving rewards and their propensity for cooperation, and revealing potential trade-offs between these two objectives. The shaded areas indicate variability, suggesting that the performance of these algorithms can be sensitive to random factors or specific environmental conditions.