Blog / Prompt engineering / How GLM-4.6V Sees UIs Like an Agent

How GLM-4.6V Sees UIs Like an Agent

Learn how GLM-4.6V-style vision-first agents use screenshots directly for tool use, not OCR pipelines. See prompt examples and workflows. Try free.

Ilia Ilinskii
Rephrase · May 5, 2026

Prompt engineering8 min read

On this page

Key Takeaways What is vision-driven tool use?Why does text-first UI conversion break so often?How do vision-first agents actually decide actions?How should you prompt GLM-4.6V-style agents?What changes for product teams building computer-use workflows?References

Most AI agents still "see" a UI by flattening it into text. That works until it doesn't. Dense toolbars, tiny icons, modal overlays, and custom desktop apps are where the text-first trick starts falling apart.

Key Takeaways

Vision-first GUI agents act on screenshots directly instead of depending on OCR or DOM conversion first.
Recent research shows coordinate-only grounding is brittle when resolution, aspect ratio, or layout changes. [1][2]
Stronger agents increasingly use multi-step actions, zooming, and grounded verification rather than one-shot click prediction. [1][2][3]
The practical prompting shift is simple: describe goals, visible cues, and action constraints, not just text labels.
Tools like Rephrase can help turn rough computer-use instructions into cleaner, model-ready prompts fast.

What is vision-driven tool use?

Vision-driven tool use means the agent treats the UI as a visual environment first and a text artifact second. It looks at screenshots, icons, spacing, and state changes to decide the next action, which is much closer to how humans operate software than OCR-first or DOM-first pipelines. [1][2]

That framing matters because a GUI is not just text on a page. A disabled button looks different from an active one. An "X" icon next to a search field means something very different from an "X" in a window chrome. In ToolTok, the authors argue that the dominant one-step grounding setup reduces GUI use to coordinate prediction, which is fragile across resolutions and aspect ratios. They show that when screen settings drift from training conditions, performance drops sharply. [1]

That's the core reason the "see first, act second" story around GLM-4.6V is interesting, even if the broader pattern is bigger than one model release. The direction of travel is clear: agents are moving away from "convert screen to text, then guess" and toward native visual grounding.

Why does text-first UI conversion break so often?

Text-first conversion breaks because it strips away the exact information agents need for reliable action: spatial relationships, icon semantics, local context, and visual state. Once the screen is flattened into text, the agent often knows what is present but not what is actionable, adjacent, highlighted, or safe to click. [1][3]

You can see this in two adjacent research threads. ZoomUI shows that many hard grounding problems are not really "language" problems. They are visual focus problems. The model needs to refine the instruction into visible features, then zoom into the right region iteratively. [2] Meanwhile, IVG calls this broader issue the Pixel-Only Bottleneck in chart agents: if a model only stares at a static image without access to richer interaction or state, it hallucinates, misses exact values, and confuses overlapping elements. [3]

For desktop and web agents, the equivalent failure is obvious. OCR may tell you there are three "Share" labels on screen. It won't reliably tell you which one is attached to the active panel you actually need.

How do vision-first agents actually decide actions?

Vision-first agents increasingly decide actions through grounded, multi-step control instead of one-shot coordinate guesses. The newer pattern is: identify relevant visual cues, narrow the region, choose a tool or action token, verify the state, then continue. [1][2]

That's where the comparison gets useful:

Approach	What it sees	Typical output	Main weakness
OCR/DOM-first	Extracted text or page structure	Text reasoning, then click	Loses visual state and layout nuance
Coordinate grounding	Screenshot	Bounding box or x,y click	Brittle across screen sizes and dense UIs [1]
Vision-first pathfinding	Screenshot plus action history	Tool/action tokens over multiple steps	More steps, but better robustness [1]
Vision + zoom/interaction	Screenshot plus iterative focus	Refined region and grounded action	Higher inference cost, but stronger on hard cases [2]

What I noticed across the sources is that the strongest direction is not "vision replaces all structure." It's "vision becomes the primary grounding layer." Then the agent can still use tools when needed. That's a big difference.

How should you prompt GLM-4.6V-style agents?

You should prompt these agents around goals, visible anchors, and action constraints rather than treating the UI like a text document. The best prompts tell the model what success looks like on screen, what visual cues matter, and how cautiously it should act before committing. [1][2]

Here's the common weak prompt:

Open settings and turn on dark mode.

That sounds fine, but it leaves too much implicit. A better version is:

Goal: enable dark mode in the current app.

Use the screenshot as the primary source of truth.
Look for visible settings entry points such as a gear icon, profile menu, or sidebar item labeled Settings or Preferences.
Prefer the option closest to the active app window, not browser chrome or OS menus.
Before clicking a toggle, verify that it belongs to appearance/theme settings.
Return the next action only.

That rewrite does three things. It anchors the agent in visible UI. It constrains ambiguity. And it reduces false positives from similarly named elements elsewhere on screen.

Here's a before-and-after example for a dense interface:

Before	After
"Click share."	"Click the Share control associated with the currently open document pane. Ignore global nav and window toolbar icons unless the document pane has no local Share action."
"Delete the item."	"Find the selected item in the active content area. Verify it is selected before taking any destructive action. If delete is ambiguous, open the item actions menu first instead of clicking the first trash icon you see."

If you write prompts for coding agents, design tools, or ops workflows, you'll find more examples on the Rephrase blog. The pattern carries over surprisingly well.

What changes for product teams building computer-use workflows?

Product teams should design around visual uncertainty, not pretend it doesn't exist. That means allowing multi-step actions, asking the model to verify state, and treating screenshots as first-class context instead of lossy text sources. [1][2][3]

In practice, I'd use four rules.

First, avoid "one prompt, one click" expectations on dense UIs. Research like ToolTok strongly suggests that multi-step pathfinding is more robust than direct coordinate regression. [1]

Second, require local verification before risky actions. This is especially true for destructive actions, payments, publishing, or permission changes.

Third, expose interaction history. Zooming, previous clicks, and current focus help the model stay grounded. [2][3]

Fourth, standardize the action format. If every step must return something like Action: click(target) or a tool token, debugging gets much easier.

If your team is constantly rewriting rough instructions into cleaner prompts for ChatGPT, Claude, or a GUI agent, Rephrase is useful here because it can rewrite the intent into a more constrained, skill-specific prompt without breaking your flow.

The big shift is not just that GLM-4.6V can "see." Lots of models can see. The real shift is that agents are being asked to use vision as the primary grounding mechanism for tool use instead of treating screenshots as temporary input for text extraction.

That's a better fit for how software actually looks and behaves. And it changes how we should prompt. Less "summarize this screen." More "use visible evidence, narrow ambiguity, verify, then act."

References

Documentation & Research

ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents - arXiv cs.LG (link)
Zoom to Essence: Trainless GUI Grounding by Inferring upon Interface Elements - arXiv cs.LG (link)
Beyond Pixels: Introspective and Interactive Grounding for Visualization Agents - arXiv cs.CL (link)
Adaptive Vision-Language Model Routing for Computer Use Agents - arXiv cs.CL (link)

Community Examples

None used.

Frequently asked

What is vision-driven tool use in AI agents?

Vision-driven tool use means an agent reads the screen as an image and decides what action to take from visual context. Instead of converting the UI into OCR text or DOM first, it grounds actions directly in pixels and layout.

How do GUI agents know where to click?

Most GUI agents either predict coordinates or bounding boxes from screenshots, or they generate structured action tokens that gradually navigate toward a target. Newer approaches also use zooming, interaction history, and tool-based grounding instead of one-shot clicks.

How can I write better prompts for computer-use agents?

Give the agent a clear goal, the current UI context, constraints, and an action format. It also helps to ask for short reasoning tied to visible elements and to require verification before risky clicks.