Free your hands for what matters

An AI agent operator MVP design

Timeline

Device

Team

Responsibilities

Jun. - Aug. 2024
Desktop
PM: Yu Shi
AI Researcher: Jay Huang
Engineer Lead: Chaoyun Zhang
Wireframe, Design iteration, Hi-fi mockup, Prototyping, Usability testing, Literature review

Project Overview

Windows system is exploring how to enhance work efficiency and free users' hands for more meaningful tasks with AI technology.

In the summer of 2024, I interned at Microsoft and worked on an innovative AI project that explored the potential of AI technology beyond conversational AI. Collaborating with the AI research team, we investigated various approaches to productize AI with screenshot-based methods, ultimately developing a task operating AI agent MVP.

Adopting new technology often presents challenges in building trust between users and the system. Following Microsoft's Human-AI Interaction principles, I explored various methods to reduce learning curves and foster trust in a system that could potentially take control of users' devices.

Project context

⚙️ What if there were an AI agent that could help users complete all types of tasks on their personal devices?

As a leader in AI technology, Microsoft continues to push the boundaries of what's possible. After launching Copilot, the company is now striving to transcend the limitations of conversational AI and develop an AI agent designed to truly assist users in completing tasks.

Design scope

This is an engineer-led project aimed at exploring the possibilities of screenshot-based AI technology. As the only product designer involved, my role is to explore using scenarios, understand users' expectations, design an MVP product prototype for exploration, design and conduct a evaluative research.

Thus, the design task I was handed was designing an intuitive AgentOS MVP experience for users to operate tasks automatically.

I came up with this requirement from engineers

Microsoft wants to build an AI operator that help users operating all kinds of tasks with the advantage of Microsoft's ecosystem

What I designed

Input task in a tray application

Help users quickly understand the capabilities of the agent OS and seamlessly input tasks for the agent to execute.

Monitor tasks on a screenshot in different size of windows

Since the AI agent relies on image recognition to complete tasks on the desktop, I adopted the picture-in-picture (PiP) approach to enable users to monitor the agent's work in real time. The PiP window is adjustable, allowing users to resize it to suit their needs while seamlessly multitasking.

Pause tasks at sensitive steps

While task is being conducted, the agent will pause the task at essential steps to let users either authorize agent's action or review content before damage could happen. When more instruction is needed, the AI Agent will also require users to provide clarification to keep operating.

Quality check before complete tasks

After complete task, users can check the task outcome and ask agent to revise the outcome with natural language input.

Easily redirect or correct task operation by taking over the task

Users can easily redirect or correct task operation by click the taking over button and operate in the mirrored desktop window to redirect the task.

Redirect task operation by intuitive chatting

Users can intuitively redirect or correct task operation by chatting with the AI agent

First diamond

Approaches to automating browser and desktop applications

There are three method options for this product that MML can be applied on task execution. I listed pros and cons from user's experience perspective, discussed with the PM and engineers on which approach is feasible. Due to the limitations of the Windows Operating System and the out-of-date API on existing windows applications, we ultimately chose to adopt the picture-in-picture (PiP) approach, cloning the user's interface to enable the AI agent to control the complete tasks based on screenshots.

Screenshot-Based Methods

Textual Parsing Methods

Taking Control of the Cursor

Operating within picture-in-picture

Run tasks at backstage with API

Pros
1. Commonly used: Agent can operate most tasks by directly control the cursor and conduct on desktop
2. Transparence: Easy for users to track task operation

Pros
1. Synchroneity: Agent can perform tasks simultaneously while users operate the device.
2. Transparence: Users can monitor agent's operation to prevent damage or errors in the visualized task working process

Pros
1. Non-distraction: Agent can operating tasks with minimum distraction to users
2.  Efficiency: The operation process is faster

Cons
1. Unproductivty: The desktop is occupied by agent's operation
2.Irritablity: Can be intervened by mouse movement easily

Cons
1. High resource demand: Requires significant compute resources to process visual data and translate it into actionable insights.
2. Irritablity: Can be intervened by image noice

Cons
1. Non-transparence: Can't visualized operation process. It is hard for users to identify potential damage, errors, and operation quality.
2. Error-prone: May lose critical context when simplifying complex DOM structures

User's concern

Before diving into the interfaces, I conduct 5 internal interviews with different roles in Microsoft to gain more insight on how users' opinion on AI technologies. I found out three main insight from users to adapting AI operator.

🔏

Users generally have concerns about privacy risks within the task operating system

🛡️

Users are concerned about the agent's operational errors and the potential damage these errors could cause during execution

🙅

Users don’t know AI agent’s capacity on complex tasks with comprehensive context

Use scenario

AgentOS is designed to streamline users' daily work and enhance efficiency. However, the tasks users handle vary widely in scope and complexity. To better understand their pain points, I conducted an internal survey to gather user feedback on their workflows. The findings revealed key concerns: low error tolerance, high privacy risks, and complex task flows

Design probelm

The insights form interview and data all pointed to the main issues:

Capacity
They don't know what tasks can be operated by the AI Agent

Controllability
They have concern on the error that the agent may cause and how they could take over the control

Thus, I asked the question

How might we help users feel confident in AI’s capabilities and their control over the AI agent?

Microsoft's Guideline Human-AI interaction

In addition, the design has to follow the Microsoft's Guidelines for Human-AI Interaction. I identify the most four principle to follow for this project which could solve users' concern above.

Problem 1: How might we help users feel confident in AgentOS’ capabilities?

visibility

Initially

Make clear what system can do

touch_app

During operation

Show contextually relevant information

Problem 2: How might we help users feel confident in their control over the AgentOS?

touch_app

Druing operation

Time service based on context

warning

When wrong

Support efficient correction

schedule

Over time

Convey the consequences of users actions

Flows & control of the task

After understanding the problem form previous research process, I started to created a users workflow to explore the potential flow for the MVP task operating feature.

Second diamond

Early exploration

According to the users flow, I started to explore several possible main page layout with Mid-fi prototype. I firstly explored the design of input window:

❌ Looks like a conversational AI chat bot. Users might be confused on this appearance

✅ The interface consists of two primary sections: task suggestions and an input field, ensuring clarity for users
✅ Task list creation as an advanced feature



In terms of the task monitor window, I tried several options on window size and layout format, eventually came up with a interface layout:

✅ Intuit chat interaction with the operational AI  
❌ Display all agent running backstage data may overwhelm users

✅ Only display essential data to help users understand operation process
✅ Easy to monitor task operation in the mirrored desktop



Finally, I develop the monitor window as three various size and operation detail display:

Size S: small monitor screen floating on the desktop

Size M: monitor screen with side panel

Size L: Full screen task monitor

Design highlight

Problem 1: How might we help users quickly build confidence in using the AI agent?

Before diving into the interfaces, I reviewed the existing new user journey and online age verification process. Based on this, I designed a user flow seamlessly integrated into the current experience.

Iteration: The operating sub-task and the sub-task plan list are relevant data that can be combined together. Users also want to easily know what the position of current operating sub-task in the whole sub-task flow

Problem 2: How might we help users quickly build trust in using the AI agent?

Before diving into the interfaces, I reviewed the existing new user journey and online age verification process. Based on this, I designed a user flow seamlessly integrated into the current experience.

Takeaway

1. Navigating Constraints & Capabilities in a Tech-Driven Environment

This project was developed in an engineer-led environment, where technology takes precedence, and designers collaborate with cross-functional teams to explore its potential. Unlike traditional design processes, where problems are clearly defined before solutions are proposed, technical solutions often emerge first in this setting. To thrive in such an environment, designers must work closely with engineers, proactively understanding technological constraints and capabilities while adapting their design approach to align with evolving technical solutions.

2. UX methods in human-AI collaboration design

Designing for human-AI collaboration requires a nuanced approach that balances user needs with AI capabilities. While traditional UX methods remain valuable, they must be adapted to accommodate AI’s unpredictability, evolving behavior, and decision-making processes. First, designers must develop a deep understanding of both user needs and AI capabilities to create meaningful interactions. Additionally, incorporating AI into the prototyping and iteration process allows for continuous improvement, ensuring a seamless and adaptive user experience.