OVERVIEW

There’s tons of metrics we can use to determine high quality AI agent interactions. But which ones really move the needle for real-world satisfaction?


The Conversational AI Design team at Oracle, in their pursuit to create high quality conversational experiences, had been working on measuring user satisfaction using a standard set of quantitative AI Agent quality metrics. I designed a research experiment to identify which metrics actually mattered for users.
SCOPE

UX Research
Conversational Design
Data Visualization

TOOLS

Figma
APEX Surveys
RAG AI
Pandas/Python










RESEARCH QUESTIONS


1. Which quality metrics are the most important for producing satisfactory AI interactions for users?

REASONINGTo identify a select few metrics to recommend my stakeholders prioritize.



2. In what ways do those quality metrics correlate to real world user satisfaction?

REASONING
Aligning the initial metric definitions with how users percieved them.



3. How are users of Oracle AI defining quality compared to our quality metrics?

REASONING
Zooming out to clarify what a high quality experience means for users.






METHODOLOGY

I designed an APEX Survey where 44 participants assessed real world scenarios of an employee interacting with an agent.


Through the survey, participants read through six conversational experiences where a fictional employee approaches a Benefits AI Agent with a query that mirrors real world scenarios like trying to understand your 401(k) contribution options or how to take a leave of absence.

For each scenario, participants rated the interaction on four quality metrics that were top of mind for the Conversational AI Design team: Thorough, Clear, Concise, and Inanimate.



OUTCOME: QUESTION ONE

Of the quality metrics predefined by our Conversational AI team, which ones are the most important for producing satisfactory AI interactions for users?



To answer this question, I conducted a comprehensive statistical analysis to draw a correlation between average overall quality scores and each of the individual metrics. The metrics of our original 4 that rose to the top as being the most correlated to satisfaction was  thorough .

As seen in the following scatterplot,  thorough was seen to be highly correlated to overall satisfaction where with every one point increase thoroughness, overall satisfaction increases by 53% of a point . When we look at the impact of this correlation, we see that on average, thoroughness contributes 2.6 points out of our 7 point scale of perceived overall satisfaction.






The second metric that had a significant correlation was concise. In looking at the following scatterplot, we see that with every one point increase in conciseness, overall satisfaction increases by 30% of a point. Again when we look at the impact of this correlation, we see that on average, conciseness contributes 1.4 points out of our 7 point scale of perceived overall satisfaction.





So what does this mean for the Conversational AI Design team?


It is important to have an integrated understanding of both  thoroughness AND conciseness in optimizing our AI system prompts towards high user satisfaction. In cases where we do have to lean in one direction, the choice should be thorough. 

That being said, in some cases thoroughness and conciseness can actually be seen as opposites, 
so what does it look like to balance these two in practice?




OUTCOME: QUESTION TWO

In what ways do those quality metrics correlate to real world user satisfaction?

Though coding my qualitative data, I was able to help address my more granular questions about the correlation by adding some nuance.



According to the study participants, the goal of thoroughness is to ensure that the user gets their problem solved to completion. A thorough answer shouldn’t mean everything the model knows about something but rather everything the user needs to accomplish their task. Focusing on the how, as opposed to the what, not just summarizing all of the info it knows, but being an aid in focused problem solving.






This sentiment can be illustrated by the following user quote. By their AI agent interaction focusing on reaching a complete resolution of their query through giving them actionable steps,  tasks, and information, it can increase their confidence in what to do next and build their trust that this AI agent will help them get there.






This next user further expands on this expentation of going beyond “comprehension” or “competion” but going beyond it by providing information that a user wouldn’t think to ask for or even asking the user clarifying questions to address informational gaps. Doing this can help refine the AI response to truly answering the question to completion, yielding higher user satisfaction.






Participants articulated the goal of conciseness to be about getting their answer in a way that is scannable and approachable. Here, conciseness was less about the content of the answer or its length but its efficiency, where the emphasis was placed on visual form.






For example the following participant alluded to this desire to visually skip around an AI response alludes to the importance including things like line breaks, bullet points, and overall organization in improving scannability.







OUTCOME: QUESTION THREE

How are users of Oracle AI measuring quality compared to our quality metrics?


This final question allowed me to really zoom out to understand the use of these quality metrics to begin with. It’s not enough for AI interactions to be algorithmically deemed as “high quality”, it must provide a verifiably satisfying experience. Thus, the way that we defined quality must be aligned with users.

Beyond any of our initial metrics, for a high quality interaction with an AI agent, participants simply want to know exactly what actions they need to take next to address their query just by scanning through the agent's response. Once they can do that, they’re able to build a clearer understanding overall. 



It is with this guiding goal, understanding next actions at a glance, that we can align our development of Agentic AI with real world user satisfaction.






FUTURE OPPORTUNITIES

After presenting my research to three Conversational AI Design teams across the company, further research opportunities and use cases arose.


If I had more time to refine my research, my primary goal would be to make this framework actionable for teams and AI Agent use cases across Oracle beyond the Benefits Agent. To accomplish this, I’d develop a more focused studies around other domains.

Additionally, I only selected the 4 metrics that were top of mind for the Conversational AI Design team to test in this study. However, I would have loved to develop variations of this study with new sets of metrics, helping to refine them at a larger scale.