## Diagram: Blocksworld Task State-Action Space
### Overview
This diagram illustrates a state transition system for a Blocksworld task with three states (S₀, S₁, S₂) and two actions (a₀, a₁). It compares expert vs. amateur logits for action selection and shows credit-directed (CD) logits with checkmarked optimal actions. The task goal is to have the red block on top of the yellow block.
### Components/Axes
- **Left Axis (States)**: S₀ (top), S₁ (middle), S₂ (bottom)
- **Right Axis (Actions)**: a₀ (top), a₁ (bottom)
- **Legend**:
- Red: unstack red
- Blue: pick-up blue
- Yellow: pick-up yellow
- Green: stack on green
- Orange: stack on yellow
- **Diagram Elements**:
- Circular nodes representing state-action combinations
- Arrows showing state transitions
- Colored blocks representing object positions
### Detailed Analysis
**State S₀**:
- Expert Logits (S_E₁):
- unstack red: 8 (red bar)
- pick-up blue: 1 (blue bar)
- pick-up yellow: 1 (yellow bar)
- Amateur Logits (S_A₁):
- unstack red: 6 (red bar)
- pick-up blue: 2 (blue bar)
- pick-up yellow: 2 (yellow bar)
- CD Logits (S_CD₁):
- ✅ unstack red (highlighted yellow)
**State S₁**:
- Expert Logits (S_E₂):
- stack on yellow: 7 (yellow bar)
- stack on blue: 1 (blue bar)
- stack on green: 1 (green bar)
- put-down red: 1 (red bar)
- Amateur Logits (S_A₂):
- stack on yellow: 3 (yellow bar)
- stack on blue: 2 (blue bar)
- stack on green: 2 (green bar)
- put-down red: 3 (red bar)
- CD Logits (S_CD₂):
- ✅ stack on yellow (highlighted yellow)
**Action Transitions**:
- a₀ (unstack red) transitions S₀ → S₁
- a₁ (stack on yellow) transitions S₁ → S₂
### Key Observations
1. Expert logits consistently show higher values for optimal actions compared to amateur logits
2. CD logits perfectly align with expert recommendations (100% accuracy)
3. State S₀ prioritizes unstacking red (8 vs 6 amateur)
4. State S₁ strongly favors stacking on yellow (7 vs 3 amateur)
5. Amateur logits show more balanced action preferences across states
### Interpretation
This diagram demonstrates how different logit sources guide decision-making in a block-stacking task:
- **Expert logits** reflect optimal policy knowledge, showing clear preference for goal-directed actions
- **Amateur logits** display more exploratory behavior with higher entropy in action selection
- **CD logits** represent idealized credit assignment, perfectly identifying optimal actions
- The state transitions reveal a two-step process: first unstacking the red block, then stacking it on yellow
- The visual representation effectively communicates the hierarchical nature of the task, with each state building toward the final configuration
The diagram suggests that while amateur policies can learn reasonable policies, expert guidance and proper credit assignment are crucial for achieving optimal performance in hierarchical tasks.