Skip to content

Define Version 0.1 Protocol for GUI Interaction State and Action Sequences #25

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
5 tasks
abrichr opened this issue Apr 5, 2025 · 0 comments
Open
5 tasks

Comments

@abrichr
Copy link
Member

abrichr commented Apr 5, 2025

🧩 Description

We need to define and implement a minimal but extensible protocol for representing GUI interaction sequences. This protocol will unify the visual state, action metadata, and interaction history into a single structured format—enabling consistent logging, dataset creation, LLM training, planning, and replay.

This format serves as the foundation for downstream systems including the Action Graph (#10), ModelDrivenVisualState, and planner/LLM interfaces.


🧠 Background

OmniMCP currently:

  • Captures visual state via OmniParser
  • Plans actions using an LLM
  • Executes actions via InputController

But there is no standardized, reusable format for representing:

  • What was seen
  • What was done
  • Why it was done (optional)

This protocol fills that gap—similar to what OpenAI Operator, Adept’s AWL, and WebArena’s annotated programs use.


📦 Proposed Data Model (v0.1)

Using pydantic for type safety and validation.

class BoundingBox(BaseModel):
    x1: int
    y1: int
    x2: int
    y2: int

class GUIElement(BaseModel):
    element_id: str
    tag: Optional[str] = None
    text: Optional[str] = None
    role: Optional[str] = None
    bbox: Optional[BoundingBox] = None
    visible: bool = True

class VisualState(BaseModel):
    screenshot_path: str
    screen_resolution: tuple[int, int]
    elements: list[GUIElement]
    timestamp: float

class GUIAction(BaseModel):
    type: Literal["click", "type", "hover", "launch_app", "scroll"]
    target_id: Optional[str] = None
    bbox: Optional[BoundingBox] = None
    text: Optional[str] = None
    delay: Optional[float] = None  # e.g. before typing

class InteractionStep(BaseModel):
    timestamp: float
    visual_state: VisualState
    action: GUIAction

🧪 Examples

{
  "timestamp": 4.1,
  "visual_state": {
    "screenshot_path": "frames/frame_002.png",
    "screen_resolution": [1920, 1080],
    "elements": [
      {
        "element_id": "url_bar",
        "text": "Search or type URL",
        "bbox": [120, 80, 800, 120],
        "visible": true
      }
    ]
  },
  "action": {
    "type": "click",
    "target_id": "url_bar",
    "bbox": [120, 80, 800, 120]
  }
}

✅ Acceptance Criteria

  • Protocol spec exists as Python pydantic models with JSON schema export
  • Example logs (real or synthetic) stored in versioned protocol/ directory
  • Validator for loading, validating, and pretty-printing logs
  • Unit tests for schema validity and round-trip I/O
  • Integration into AgentExecutor logging pipeline (optional, stub OK)

📚 References


📌 Priority

High. This is foundational to planning, replay, dataset creation, and eventual fine-tuning. Enables reuse of traces across components and simplifies future evaluation and debugging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant