Skip to content

How KidStory Works

Navigation: README | Architecture & Diagrams | How It Works | Database Schema | Hackathon Requirements | Deployment

KidStory is an AI-powered application built for the Gemini Live Agent Challenge that transforms a child's simple idea into a fully illustrated, narrated, and interactive digital book.

This document explains the internal logic, the Agentic Workflow design, and how we leverage Google's Gemini models to create a seamless experience.

For Parents

Register an account, then create stories for your children to read and enjoy together.

For Kids

Let your child speak their own story idea and watch the AI bring it to life with pictures and narration.

The "Agentic Orchestrator" Concept

Unlike simple chatbots, this application uses an Agentic Workflow. The backend acts as a "Director," coordinating multiple specialized AI models to build a complex multimedia product (a book) in real-time.

The specialized Gemini Models We Use (via Vertex AI)

We utilize three specialized Gemini models deployed through Google Cloud Vertex AI for their unique strengths, demonstrating sophisticated model orchestration and deep Google Cloud integration:

  1. Gemini 2.5 Flash Image (gemini-2.5-flash-image) - The Creative Director: This is the heart of the app. We use it for true interleaved output (responseModalities: ["TEXT", "IMAGE"]) in Story Generation. It generates JSON text alongside page illustrations in a single multimodal stream.

  2. Gemini 2.5 Flash TTS (gemini-2.5-flash-preview-tts) - The Voice: Provides warm, expressive, and human-like narration with multiple voice personalities (Luna, Stella, and Kiko).

  3. Gemini 2.5 Flash (gemini-2.5-flash) - The Quiz Master: Used for quiz generation and feedback, optimized for text-only tasks to reduce latency and cost.

Key Technical Detail: All models are accessed through Google Cloud Vertex AI, not the direct Gemini API, providing enterprise-grade scalability, monitoring, and integration with other GCP services. We use the modern @google/genai SDK with vertexai: true configuration, demonstrating a clean, single-SDK architecture.

The SDK & Vertex AI Integration

  • @google/genai with Vertex AI: We use the modern Google AI SDK configured for Vertex AI (vertexai: true), providing enterprise-grade AI services with interleaved output support (responseModalities) and high-performance streaming.
  • Single SDK Architecture: We use only @google/genai demonstrating a clean, modern architecture.
  • Google Cloud Integration: All AI models are accessed through Vertex AI endpoints, demonstrating deep integration with Google Cloud Platform.

Workflow 1: Creating a Story

When a child enters a prompt (e.g., "A dragon who loves ice cream"), the following happens:

1. Interleaved Multimodal Generation (The Magic Step)

  • The Orchestrator sends the prompt to Gemini 2.5 Flash Image.
  • System Prompt: "You are Stella, a magical children's storyteller. Generate a story JSON followed by exactly N beautiful watercolor illustrations (one for each page)."
  • Streaming Response: Gemini streams back the story JSON (title, pages, image prompts). As it "writes" the story, it also "paints" every page illustration in the same fluid stream.
  • Real-Time UI: The frontend captures these interleaved image parts. While the text is still generating, the child sees a live preview of the story being built: "Page 1 is Ready! ✨" with the actual illustration, then Page 2, and so on. This makes the generation feel active and magical.

2. Parallel Audio Assembly

As the story JSON chunks arrive, the app starts generating narration in parallel:

  • Narration: Sends each page text to Gemini 2.5 Flash TTS to generate audio with the selected narrator personality.
  • Efficiency: Any illustrations already provided by the interleaved stream are used directly, skipping redundant image generation calls.

3. Assembly & Delivery

  • Media assets are stored in Google Cloud Storage with signed URLs.
  • The complete story book is persisted in Firebase Firestore.
  • The child can immediately start reading, listening, and interacting.

4. Character Consistency

  • Children can upload character photos (face references) during story creation.
  • Photos are compressed client-side (max 512px, JPEG, 70% quality) before being sent as base64.
  • The image generation model receives these references to maintain character appearance across all pages.

Workflow 2: The "Magic Quiz" (Optimized Multimodal Experience)

This is where we demonstrate sophisticated model orchestration — using the right model for the right task to create an engaging multimodal experience.

The Goal

After reading a story page, the child takes an interactive quiz. The AI generates questions, speaks them aloud, and the child can answer by voice or touch.

How It Works: Optimized Model Orchestration

Step 1: Question Generation (Optimized Text-Only via Vertex AI)

typescript
// Optimized: Text-only model for fast, cost-effective quiz generation
const response = await ai.models.generateContent({
  model: "gemini-2.5-flash", // Text-only model via Vertex AI
  contents: quizPrompt,
  config: {
    responseMimeType: "application/json", // Structured JSON output
  },
});

// Parse JSON response
const quizData = JSON.parse(response.text);

The model returns:

  • Text: A JSON object with the question, 3 options, correct answer, encouragement, and correction text.
  • Optimization: Using gemini-2.5-flash (text-only) reduces latency and cost while maintaining quality for quiz generation.

Step 2: Voice Narration & SFX

  • The question text is sent to Gemini 2.5 Flash TTS to generate spoken audio.
  • The child hears the question read aloud with options ("Is it A, B, or C?").
  • Multimodal Feedback: In the feedback loop, the model suggests a bracketed sound effect (e.g., [cheer]). The app's sound engine then plays the corresponding audio (e.g., correct.mp3) synchronized with the visual feedback.

Step 3: Voice Interaction

  • The child speaks their answer ("I think it's option A!").
  • The browser's Web Speech API converts speech to text.
  • Fuzzy matching identifies which option was selected (supports "A", "option A", or the full option text).

Step 4: Instant Feedback & Audio Celebration

  • The quiz provides immediate feedback using the pre-generated encouragement or correction text.
  • Audio Feedback: The feedback text is sent to Gemini 2.5 Flash TTS for spoken feedback.
  • SFX Integration: Gemini suggests a bracketed sound effect (e.g., [sparkle]) in the feedback text. The app maps these descriptions (like "sparkle", "drumroll", or "oops") to specific audio assets to play alongside the feedback.
  • Result: Sub-second response time with high-quality audio reinforcement.

Step 5: Duplicate Prevention

  • All previously asked questions are tracked and sent with each new request.
  • The prompt explicitly instructs Gemini to avoid repeating or rephrasing any previous questions.
  • This ensures all 5 quiz questions are unique and cover different story details.

Technology Stack

LayerTechnology
FrontendNext.js 16 (App Router), React 19, TypeScript 5, Tailwind CSS 4, Framer Motion
BackendNext.js API Routes (Server-Side)
AI PlatformGoogle Cloud Vertex AI (Enterprise AI platform)
AI SDKs@google/genai v1.44+ (Vertex AI integration)
DatabaseFirebase Firestore (NoSQL)
AuthenticationFirebase Auth (Google OAuth 2.0)
StorageGoogle Cloud Storage (signed URLs, 7-day expiry)
Voice InputWeb Speech API (Chrome/Edge)
HostingGoogle Cloud Run (standalone Next.js)
AI ModelsGemini 2.5 Flash Image, Gemini 2.5 Flash TTS, Gemini 2.5 Flash (via Vertex AI)

Released under the MIT License.