ORACLE


Aligning and scoping quantitative agentic AI quality metrics with real world user satisfaction.

TIMELINE Summer 2025

ROLESolo Design Researcher
TOOLSFigma, APEX Surveys, RAG AI, Pandas/Python







CONTEXT
The Conversational AI Design team at Oracle, in their pursuit to create high quality conversational experiences, has been working on measuring user satisfaction using a standard set of AI Agent quality metrics. But with the sheer amount of metrics being calculated, how do we really determine which metrics move the needle for real world interactions? With this question in mind, I explored the following need:






Breaking this need down into three primary research questions: 

ONE Which of our predefined quality metrics are the most vital for yielding satisfactory AI interactions?
TWOIn what ways do those quality metrics correlate to real world user satisfaction?THREEHow are users of Oracle AI defining “quality” compared to the way our quality metrics define it?





METHODOLOGY
To explore these three questions, I designed an evaluative research experiment where participants assessed an Oracle AI Agent’s responses based on this subset of the various quality metrics that the Conversational AI Design team had been working with, chosen based on study design requirements, time constraints, and relevance to User Experience. They were defined as follows:






Using these four metrics, I built an APEX survey where 44 participants read through 6 example scenarios where a fictional employee approached HCM Benefits AI Agent with a query that mirrors real world scenarios like trying to understand your 401(k) contribution options or how to go about taking a leave of absence. 

For each scenario, I generated each response using a RAG LLM pipeline so they simulate real world AI agent interactions. I gathered three types of data:

  1. Quantitative real-world overall satisfaction scores
  2. Quantitative ratings on each of the 4 metrics (thorough, clear, concise, inanimate)
  3. Qualitative free response answers describing the elements that informed their ratings




OUTCOMES

QUESTION ONE

Of the quality metrics predefined by our Conversational AI team, which ones are the most important for producing satisfactory AI interactions for users?



To answer this question, I conducted a comprehensive statistical analysis to draw a correlation between average overall quality scores and each of the individual metrics. The metrics of our original 4 that rose to the top as being the most correlated to satisfaction was  thorough . As seen in the following scatterplot, thoroughness was seen to be highly correlated to overall satisfaction where with every one point increase thoroughness, overall satisfaction increases by 53% of a point. When we look at the impact of this correlation, we see that on average, thoroughness contributes 2.6 points out of our 7 point scale of perceived overall satisfaction






The second metric that had a significant correlation was concise. In looking at the following scatterplot, we see that with every one point increase in conciseness, overall satisfaction increases by 30% of a point. Again when we look at the impact of this correlation, we see that on average, conciseness contributes 1.4 points out of our 7 point scale of perceived overall satisfaction




So what does this mean for the Conversational AI Design team?


It is important to have an integrated understanding of both  thoroughness AND conciseness in optimizing our AI system prompts towards high user satisfaction. In cases where we do have to lean in one direction, the choice should be thorough. That being said, in some cases thoroughness and conciseness can actually be seen as opposites, so what does it look like to balance these two in practice?




QUESTION TWO

In what ways do those quality metrics correlate to real world user satisfaction?


Analyzing the qualitative data was able to help address my more granular questions about correlation by adding some further nuance. I analyzed the data using inductive coding and affinity mapping to start to draw new boundaries and definitions around our quality metrics.




According to the study participants, the goal of thoroughness is to ensure that the user gets their problem solved to completion. A thorough answer shouldn’t mean everything the model knows about something but rather everything the user needs to accomplish their task. Focusing on the how, as opposed to the what, not just summarizing all of the info it knows, but being an aid in focused problem solving.






This sentiment can be illustrated by the following user quote. By their AI agent interaction focusing on reaching a complete resolution of their query through giving them actionable steps,  tasks, and information, it can increase their confidence in what to do next and build their trust that this AI agent will help them get there.






This next user further expands on this expentation of going beyond “comprehension” or “competion” but going beyond it by providing information that a user wouldn’t think to ask for or even asking the user clarifying questions to address informational gaps. Doing this can help refine the AI response to truly answering the question to completion, yielding higher user satisfaction.






Moving on the second most correlated metric, participants articulated the goal of conciseness to be about getting their answer in a way that is scannable and approachable. Here, conciseness was less about the content of the answer or its length but rather its efficiency, where the emphasis was placed on visual form.






For example the following participant alluded to this desire to visually skip around an AI response alludes to the importance including things like line breaks, bullet points, and overall organization in improving scannability.







QUESTION THREE

How are users of Oracle AI measuring quality compared to our quality metrics?


This final question allowed me to really zoom out to understand the use of these quality metrics to begin with. It’s not enough for AI interactions to be algorithmically deemed as “high quality”, it must provide a verifiably satisfying experience. Thus, the way that we define quality must be aligned with users. So how did they define it? 

Beyond any of our initial metrics, for a high quality interaction with an AI agent, participants simply want to know exactly what actions they need to take next to address their query just by scanning through the agent's response. Once they can do that, they’re able to build a clearer understanding overall. 



It is with this guiding goal, understanding next actions at a glance, that we can align our development of Agentic AI with real world user satisfaction.




FUTURE OPPORTUNITIES

Afer presenting this research to the company-wide design department, I also presented it to the Conversational AI Design team as well as another internal AI development team who was designing in alignment with their own separate quality metrics. Through the conversations that I had with AI professionals across the company, there emerged three research opportunities to yield wider and more focused applications:


How demographics impact conversational experiences


Through looking at the demographics of the 44 participatns, I found that measured elements like age group, English fluency, and AI expertise did yield slightly different outcomes. Developing a more focused study looking at how different use cases of AI Agents yield different definitions and experiences of quality might illuminate more insights.


Testing with a new batch of quality metrics


As previously mentioned, I only selected the 4 metrics that were top of mind for the Conversational AI Design team to test in this study. However, if I had more time, I would have loved to develop variations of this study with new sets of metrics, helping to refine them at a larger scale.




Previous ProjectNext Project