Voice Meets Visual: Building a Chatbot That Talks Back with SpeechRecognition and D-ID

The web is evolving beyond clicks and text boxes — users can now speak to websites and see lifelike avatars respond in real time.

By combining the SpeechRecognition API, an LLM-powered chatbot, and D-ID’s virtual avatars, we built a natural, human-like conversation layer directly inside the browser. The experience feels more personal, more intuitive, and far closer to how people actually communicate.

Listening with the SpeechRecognition API

Modern browsers support speech input through the SpeechRecognition interface — part of the Web Speech API. It converts your voice to text on the fly and makes it easy to feed spoken prompts directly into your chatbot.

JavaScript

const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
const recognition = new SpeechRecognition();

recognition.continuous = true;
recognition.interimResults = false;
recognition.lang = 'en-US';

recognition.onresult = (event) => {
  const transcript = event.results[event.results.length - 1][0].transcript;
  console.log('User said:', transcript);
  sendToChatbot(transcript);
};

recognition.start();

const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
const recognition = new SpeechRecognition();

recognition.continuous = true;
recognition.interimResults = false;
recognition.lang = 'en-US';

recognition.onresult = (event) => {
  const transcript = event.results[event.results.length - 1][0].transcript;
  console.log('User said:', transcript);
  sendToChatbot(transcript);
};

recognition.start();

Once initialized, your website starts listening for voice input and passing the transcribed text to your chatbot API route.

Adding Context Before Sending to the LLM

The next step is to connect that text to your chatbot backend — but not as a blind prompt.

For web applications (like e-commerce), it’s powerful to include contextual information from the page: product SKU, variant, description, or even pricing. That context lets your chatbot respond intelligently to the environment the user is already in.

This gives the model live awareness of the page — so if the user asks, “Does this come in black?” or “Is it available in medium?” the AI can tailor the answer using the data already in the DOM.

The Chatbot Backend

Here’s how the backend might look using the Vercel AI SDK with OpenAI or Ollama:

JavaScript

import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';

export async function POST(req) {
  const { prompt, context } = await req.json();

  const response = await streamText({
    model: openai('gpt-4o-mini'),
    messages: [
      {
        role: 'system',
        content: `You are a helpful virtual assistant on a shopping site.
        Use product context when relevant:
        SKU: ${context.sku}, Variant: ${context.variant}, Price: ${context.price}.
        Always keep responses concise and conversational.`,
      },
      { role: 'user', content: prompt },
    ],
  });

  return response.toAIStreamResponse();
}

import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';

export async function POST(req) {
  const { prompt, context } = await req.json();

  const response = await streamText({
    model: openai('gpt-4o-mini'),
    messages: [
      {
        role: 'system',
        content: `You are a helpful virtual assistant on a shopping site.
        Use product context when relevant:
        SKU: ${context.sku}, Variant: ${context.variant}, Price: ${context.price}.
        Always keep responses concise and conversational.`,
      },
      { role: 'user', content: prompt },
    ],
  });

  return response.toAIStreamResponse();
}

This approach gives the chatbot real situational awareness — something static prompts can’t replicate.

It’s what makes the interaction feel personalized, not generic.

For Modere, we used a slightly different approach. I created a function to handle the call to the API for our chatbot which we had setup to take in more context in JSON data to be used in the RAG to help provide the AI with more context. It also provided cart data so if the user asked to add an item to the cart the AI would know which product you were looking at so it could add it to your cart.

JavaScript

const dataToSend = {
      messages: messages,
      response_type: 'history',
      user_context: context,
      session_id: session_id,
      locale: locale,
      cart_id: cart_id,
      object_id: object_id,
      product_id: product_id,
      list_price: list_price,
      currency_code: currency_code,
      merchandise_id: merchandise_id
    };

const dataToSend = {
      messages: messages,
      response_type: 'history',
      user_context: context,
      session_id: session_id,
      locale: locale,
      cart_id: cart_id,
      object_id: object_id,
      product_id: product_id,
      list_price: list_price,
      currency_code: currency_code,
      merchandise_id: merchandise_id
    };

Giving It a Face with D-ID

Once the chatbot replies, we hand that text off to D-ID, which generates a short, animated video of a lifelike avatar speaking the response.

JavaScript

async function agentFetch(AgentID: string) {
    const myHeaders = new Headers();
    myHeaders.append('accept', 'application/json');
    myHeaders.append(
      'authorization',
      'Basic yourkeyhere'
    );

    const response = await fetch(`https://api.d-id.com/agents/${AgentID}`, {
      method: 'GET',
      headers: myHeaders,
      redirect: 'follow'
    });
    const responseJson = await response.json();
    setAgentIdle(responseJson.presenter.idle_video);
  }

async function agentFetch(AgentID: string) {
    const myHeaders = new Headers();
    myHeaders.append('accept', 'application/json');
    myHeaders.append(
      'authorization',
      'Basic yourkeyhere'
    );

    const response = await fetch(`https://api.d-id.com/agents/${AgentID}`, {
      method: 'GET',
      headers: myHeaders,
      redirect: 'follow'
    });
    const responseJson = await response.json();
    setAgentIdle(responseJson.presenter.idle_video);
  }

Within seconds, the chatbot’s text response turns into a video of a virtual assistant speaking naturally — a human-like interface right inside the browser.

Just remember to switch the video back to idle after the response.

How It All Connects

User speaks →
SpeechRecognition API →
Chatbot API (with page context) →
LLM generates reply →
D-ID renders avatar →
Browser plays response

All handled inside the web stack — no native app required.

Why This Matters

This isn’t just a technical novelty.
It’s a glimpse of how conversation and commerce will merge on the web:

Shoppers can ask about pricing, flavors, or alternates hands-free.
Brands can create on-page digital hosts that guide users through decisions.
Accessibility improves dramatically — no typing or reading required.
Businesses get human-like interactivity without needing human staff on every page.

The combination of voice input, AI understanding, and visual feedback transforms a static product page into a living experience.

The Takeaway

With just a few modern browser APIs and a well-structured AI backend, Modere was able to:

Listen to users with SpeechRecognition,
Think contextually with an LLM, and
Speak back visually with D-ID avatars.

This is the next evolution of web interaction — one where your website isn’t just a destination but a digital conversation.

Modere Chatbot

Listening with the SpeechRecognition API

Adding Context Before Sending to the LLM

The Chatbot Backend

Giving It a Face with D-ID

How It All Connects

Why This Matters

The Takeaway

More posts

How to Reignite a Company’s Digital Culture

Relearning Shopify (And Remembering That Arbonne Proof of Concept)

Testing Chrome’s Prompt API: On-Device AI for the Web

Google’s September 2025 “Perspective” Update