What AI Can and Cannot Do for UX Research

The biggest mistake I see teams make with AI is treating it like a magic black box. They throw unstructured data in and expect coherent, reliable insights to come out. This is dangerous. We are currently spending more time figuring out how to define reliable workflows than we are saving time by using them.

Large Language Models like ChatGPT, Claude, and Gemini have become ubiquitous, and with them comes a torrent of hype, fear, and misunderstanding. It is crucial to curb the hype train with a dose of reality.

The Current State

For many agencies, vendors, and individual researchers, we are currently in an investment phase. This is not a failure of the technology, it is a natural stage in its adoption. We are not just using a tool; we are inventing the processes for that tool.

On one side, you have evangelists who claim AI will automate the entire research process, making human researchers obsolete. On the other, you have skeptics who dismiss the technology as a flawed, biased, and unreliable gimmick.

As is often the case, the reality is more nuanced.

What LLMs Actually Are

The "T" in GPT stands for Transformer ^[2]. This is not just a technical term, it is the most useful description of its core function.

An LLM is not a knowledge-generation machine; it is a concept-transformation engine. It is exceptionally good at taking information in one format and structuring or rephrasing it into another. It is less a generator of new facts and more a manipulator of existing concepts.

How They Work

At its core, an LLM is an advanced prediction machine. It analyzes vast amounts of text to learn statistical relationships between words. When you give it a prompt, it does not "think" or "know" the answer like a human does, it calculates the most probable next word (or "token") based on patterns it has learned.

This predictive nature is why models can "hallucinate". They are designed to generate plausible-sounding text, not to state verified facts.

Understanding this is the key to using an LLM effectively: treat it as a powerful autocomplete, search, and connector for concepts, not as a perfectly factual database.

What the Research Shows

One study ^[1] provides a clear picture of the current state:

When experienced UX professionals evaluated usability issues suggested by an LLM:

They agreed with 77% of the issues the AI found
But the AI missed around 60% of the unique problems that human experts identified

This is not a failure of the technology. It is a clear signal of its proper role: assistant, not replacement.

AI excels at identifying common, pattern-based issues because it has been trained on vast amounts of data reflecting these known problems. It is less effective at:

Uncovering novel issues
Understanding deep contextual nuance
Identifying subtle emotional reactions that a human observer would catch

The Evolving Role of the Researcher

These limitations highlight the evolution of the researcher's role. The value of the human researcher shifts away from the tedious work of raw analysis and toward the strategic work that AI cannot do:

Strategic Framing

Asking the right questions and designing sound research in the first place. AI cannot tell you what questions matter, that requires understanding the business context and user landscape.

Critical Validation

Questioning the AI's output, spotting its biases, and separating signal from noise. The AI produces drafts; you produce judgment.

Foundational models are often trained to be helpful and agreeable, a trait known as "sycophancy". To get objective results, you must turn an agreeable assistant into a critical sparring partner.

Influential Communication

Translating findings into clear, actionable recommendations that drive business decisions. The political and organizational skill of getting insights implemented remains distinctly human.

Best Use Cases for LLMs in Research

Based on experience, here are the tasks where LLMs provide the most reliable value:

Task	Why It Works
Tagging and Thematic Analysis	Systematically categorizing qualitative data based on a taxonomy you provide
Generative Ideation	Exploring ideas for target groups, segments, or research questions based on a brief
Instrument Stress-Testing	Reviewing interview guides or survey questions for structural issues
Code Generation	Writing Python or R scripts for quantitative analysis
Translation and Localization	Initial translations for cross-cultural research (with human review)
Communication Polish	Feedback on reports and clearer ways to present findings
Efficiency Gains	Reducing time on repetitive tasks (see ROI of UX Research)

Practical Workflow: Thematic Analysis with an LLM

Here is a concrete, step-by-step workflow for one of the most common AI-assisted research tasks: thematic analysis of qualitative data.

Step 1: Prepare Tidy Data

The biggest mistake is feeding unstructured transcripts into an LLM. Instead, use "Tidy Data" principles. Create a simple table where every row is a participant quote and columns represent metadata (participant ID, task context, timestamp). Anonymize all PII (Personally Identifiable Information) before upload.

Step 2: Engineer a Structured Prompt

Do not ask the AI to "find insights." Give it a mechanical task with explicit constraints:

Role: "You are a meticulous UX Researcher."
Task: "Categorize each user quote based on the taxonomy provided below."
Taxonomy: Provide strict definitions (e.g., "Usability," "Feature Request," "Trust/Security").

Step 3: The Committee of Raters

To increase reliability, use multiple models (e.g., GPT-4 and Claude) as a "Committee of Raters." Feed them the same data and prompt.

Where they agree, you have high confidence.
Where they disagree, you have a signal for nuance that requires human review.

This approach mirrors traditional inter-rater reliability practices in qualitative research, using AI disagreement as a flag for human attention rather than a failure.

Step 4: Human Validation (The Nuance Check)

The AI sees text; you saw the session. Perform a "Nuance Check" on the output:

Sarcasm: Did the user say "Great job" with an eye-roll? AI will tag that as "Positive Sentiment." You must correct it.
Silence: Did the user hesitate before clicking? AI cannot see silence.
Context: Did the user's frustration stem from the interface or from an unrelated interruption during the session?

What This Means for Practice

The goal is not to replace your judgment with AI but to use AI to amplify your judgment. The most effective researchers will be those who:

Understand what LLMs are actually good at (transformation, not generation)
Provide structured inputs that play to those strengths
Maintain rigorous human oversight of all outputs
Focus their own energy on the strategic work AI cannot do

This is not about learning a specific tool, tools will change. It is about learning a way of thinking about human-AI partnership that will outlast any particular model or platform.

Summary