Define Version 0.1 Protocol for GUI Interaction State and Action Sequences #25

abrichr · 2025-04-05T21:12:44Z

🧩 Description

We need to define and implement a minimal but extensible protocol for representing GUI interaction sequences. This protocol will unify the visual state, action metadata, and interaction history into a single structured format—enabling consistent logging, dataset creation, LLM training, planning, and replay.

This format serves as the foundation for downstream systems including the Action Graph (#10), ModelDrivenVisualState, and planner/LLM interfaces.

🧠 Background

OmniMCP currently:

Captures visual state via OmniParser
Plans actions using an LLM
Executes actions via InputController

But there is no standardized, reusable format for representing:

What was seen
What was done
Why it was done (optional)

This protocol fills that gap—similar to what OpenAI Operator, Adept’s AWL, and WebArena’s annotated programs use.

📦 Proposed Data Model (v0.1)

Using pydantic for type safety and validation.

class BoundingBox(BaseModel):
    x1: int
    y1: int
    x2: int
    y2: int

class GUIElement(BaseModel):
    element_id: str
    tag: Optional[str] = None
    text: Optional[str] = None
    role: Optional[str] = None
    bbox: Optional[BoundingBox] = None
    visible: bool = True

class VisualState(BaseModel):
    screenshot_path: str
    screen_resolution: tuple[int, int]
    elements: list[GUIElement]
    timestamp: float

class GUIAction(BaseModel):
    type: Literal["click", "type", "hover", "launch_app", "scroll"]
    target_id: Optional[str] = None
    bbox: Optional[BoundingBox] = None
    text: Optional[str] = None
    delay: Optional[float] = None  # e.g. before typing

class InteractionStep(BaseModel):
    timestamp: float
    visual_state: VisualState
    action: GUIAction

🧪 Examples

{
  "timestamp": 4.1,
  "visual_state": {
    "screenshot_path": "frames/frame_002.png",
    "screen_resolution": [1920, 1080],
    "elements": [
      {
        "element_id": "url_bar",
        "text": "Search or type URL",
        "bbox": [120, 80, 800, 120],
        "visible": true
      }
    ]
  },
  "action": {
    "type": "click",
    "target_id": "url_bar",
    "bbox": [120, 80, 800, 120]
  }
}

✅ Acceptance Criteria

Protocol spec exists as Python pydantic models with JSON schema export
Example logs (real or synthetic) stored in versioned protocol/ directory
Validator for loading, validating, and pretty-printing logs
Unit tests for schema validity and round-trip I/O
Integration into AgentExecutor logging pipeline (optional, stub OK)

📚 References

📌 Priority

High. This is foundational to planning, replay, dataset creation, and eventual fine-tuning. Enables reuse of traces across components and simplifies future evaluation and debugging.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Define Version 0.1 Protocol for GUI Interaction State and Action Sequences #25

Define Version 0.1 Protocol for GUI Interaction State and Action Sequences #25

abrichr commented Apr 5, 2025

Define Version 0.1 Protocol for GUI Interaction State and Action Sequences #25

Define Version 0.1 Protocol for GUI Interaction State and Action Sequences #25

Comments

abrichr commented Apr 5, 2025

🧩 Description

🧠 Background

📦 Proposed Data Model (v0.1)

🧪 Examples

✅ Acceptance Criteria

📚 References

📌 Priority