How to Use Gemini Flash Lite API: Setup Guide (2026)

Blog image

Hey fellow API tinkerers — if you've ever spent 20 minutes debugging a "model not found" error just because you had the ID slightly wrong, this guide is for you.

I'm Hanks. I test AI tools inside real workflows, not demos. Gemini 3.1 Flash-Lite dropped on March 3, 2026, and I've been running it inside actual pipelines since day one. This guide skips the "what is Flash-Lite" explanation — you already know what it's for. What you need is: the exact setup, working code for both Python and Node.js, the safe defaults you should set before you go anywhere near production, and the errors you'll probably hit in the first two hours.

Let's go.

Get Access — Gemini API vs Vertex AI

Gemini API (AI Studio) — fastest path to first call

For most developers, Google AI Studio is the right starting point. Two steps:

Go to aistudio.google.com/apikey and generate an API key — no billing required to start

Blog image

Set it as an environment variable: export GEMINI_API_KEY="your_key_here"

That's the whole setup for the free tier. You get access to Flash-Lite immediately, both input and output tokens are free during preview, and you don't need a credit card.

The catch: free tier rate limits for preview models are stricter than what you'd expect. Based on December 2025 quota adjustments, free tier users land in the range of 10–15 RPM and 250K TPM, with a daily request ceiling. You'll hit those limits quickly if you're testing anything at volume. Enable billing in Google Cloud to upgrade to Tier 1 (150–300 RPM, 1M TPM) — for Flash-Lite's price tier, the monthly cost at typical dev volumes is measured in cents, not dollars.

One thing to know before you start: rate limits are applied per project, not per API key. If you have multiple keys under the same project, they share the same quota bucket.

Vertex AI — when to choose it (GCP billing, enterprise, compliance)

Blog image

If you're in a Google Cloud environment already, or if your company has compliance requirements around data residency and audit logging, go straight to Vertex AI. The model is the same — same performance, same pricing — but you authenticate using Application Default Credentials (ADC) instead of an API key, and the data handling falls under your Cloud contract.

The Vertex AI setup uses the same @google/genai / google-genai SDK. The only difference is initialization:

# Vertex AI initialization
import os
from google import genai
# Set these environment variables:
# GOOGLE_CLOUD_PROJECT=your_project_id
# GOOGLE_CLOUD_LOCATION=global
# GOOGLE_GENAI_USE_VERTEXAI=True
client = genai.Client()  # picks up env vars automatically

For the Vertex AI path, you also need to enable the Vertex AI API in your GCP project and run gcloud auth application-default login to set up credentials. Full setup is in the Vertex AI Gemini API quickstart.

Preview access notes — what's restricted before GA

Flash-Lite is in preview as of March 2026. Preview status has two practical consequences worth knowing before you write any infrastructure code:

Rate limits are tighter. Preview models have more restrictive quota limits than their GA equivalents. The limits you see today will likely increase when the model goes GA.
The model ID includes -preview. gemini-3.1-flash-lite-preview is the correct string today. When GA ships, the ID will change. Any code that hardcodes the preview ID will need a one-line update. Treat model IDs as config values, not string literals buried in your code.

There's no public timeline for GA. Monitor the Gemini API changelog for updates.

Blog image

Your First Working Request

The exact model ID — gemini-3.1-flash-lite-preview (common wrong variants listed)

The correct model ID is: gemini-3.1-flash-lite-preview

This is what I see people get wrong most often in the first hour:

Wrong ID you might try

Why it fails

gemini-3.1-flash-lite

Missing -preview suffix — model not found

gemini-flash-lite

Incomplete ID — model not found

gemini-3-flash-lite-preview

Wrong generation number — different model

gemini-2.5-flash-lite

Previous generation — works but not Flash-Lite 3.1

gemini-3.1-flash-preview

That's Gemini 3.1 Flash, not Flash-Lite

Save yourself the confusion: copy the ID from the table above and store it in a config constant.

Minimal Python example — 10 lines, no extras

Install the SDK first:

pip install google-genai

Then:

import os
from google import genai
# API key is read from GEMINI_API_KEY env var automatically
client = genai.Client()
response = client.models.generate_content(
    model="gemini-3.1-flash-lite-preview",
    contents="Classify this review as positive, negative, or neutral: 'Fast shipping, great quality.'"
)
print(response.text)

That's a working request. No extra config, no imports beyond the basics. Run it and you'll get a response in under two seconds on standard tier.

Minimal Node.js example

Install the SDK:

npm install @google/genai

Note: The current package is @google/genai. The older @google/generative-ai is deprecated and does not receive Gemini 2.0+ features. Don't use it for this.

import { GoogleGenAI } from "@google/genai";
// API key is read from GEMINI_API_KEY env var automatically
const ai = new GoogleGenAI({});
async function main() {
  const response = await ai.models.generateContent({
    model: "gemini-3.1-flash-lite-preview",
    contents: "Classify this review as positive, negative, or neutral: 'Fast shipping, great quality.'"
  });
  console.log(response.text);
}
main();

Requires Node.js v18 or later. Run with node --experimental-vm-modules if you're on an older Node setup that doesn't handle ES modules by default, or add "type": "module" to your package.json.

Streaming vs non-streaming — one-line decision rule

Use streaming if: the user is waiting and you want to show output as it arrives. Latency feels faster even if total time is the same.

Use non-streaming if: you're processing the output programmatically (parsing JSON, routing to next step, storing to DB). Streaming requires reassembling chunks before you can safely parse them, which adds complexity with no benefit.

Here's streaming in Python — it's literally a one-method swap:

# Non-streaming (default)
response = client.models.generate_content(
    model="gemini-3.1-flash-lite-preview",
    contents="Summarize this article..."
)
print(response.text)
# Streaming — swap to generate_content_stream
for chunk in client.models.generate_content_stream(
    model="gemini-3.1-flash-lite-preview",
    contents="Summarize this article..."
):
    print(chunk.text, end="", flush=True)

For Flash-Lite's typical use cases — classification, extraction, translation — you're almost always in non-streaming territory. Streaming is more relevant for user-facing chat interfaces.

Best Defaults for Production

Max output tokens — set a cap, never leave open

Flash-Lite defaults to its maximum output of 64K tokens if you don't constrain it. The model is also on the verbose side — benchmarks show it generating roughly 2.5× more output tokens than comparable models at the same price tier. An open max_tokens setting will inflate your output costs and slow down your pipeline.

Set it based on your task:

from google import genai
from google.genai import types
response = client.models.generate_content(
    model="gemini-3.1-flash-lite-preview",
    contents="Classify sentiment: 'Loved it!'",
    config=types.GenerateContentConfig(
        max_output_tokens=20,    # Classification: label only
    )
)

Rough guidelines:

Task

Recommended max_output_tokens

Classification / sentiment

10–50

Named entity extraction

100–200

Short translation (< 500 words input)

2× input token estimate

JSON data extraction

Size of expected JSON + 20% buffer

Summarization (short)

200–500

Thinking level config — how to set minimal / low / medium / high

Gemini 3.1 Flash-Lite supports four thinking levels: minimal, low, medium, and high. If no thinking level is specified, the model defaults to high — which is the most expensive and slowest option. For Flash-Lite's primary use cases, you almost always want minimal.

response = client.models.generate_content(
    model="gemini-3.1-flash-lite-preview",
    contents="Translate to French: 'The weather is great today.'",
    config=types.GenerateContentConfig(
        max_output_tokens=100,
        thinking_config=types.ThinkingConfig(
            thinking_level="minimal"   # Fast, cheap — right for translation/classification
        )
    )
)

Migration note for Gemini 2.5 users: thinking_budget is still supported for backward compatibility, but the recommended parameter for Gemini 3 models is thinking_level. If you previously set thinking_budget=0, the equivalent is thinking_level="minimal". Don't use both parameters in the same request.

When to go above minimal:

thinking_level

Use when

minimal

Translation, classification, simple extraction — anything with a clear right answer

low

Extraction with some ambiguity, structured output with optional fields

medium

Multi-step instructions, moderate conditional logic

high

Complex instruction following, judgment calls — if you're here consistently, consider Flash instead

JSON schema output pattern (system prompt template)

For extraction and classification tasks, structured JSON output is more reliable than asking the model to format things in prose. Use response_mime_type with a Pydantic schema for clean, parseable output:

from pydantic import BaseModel, Field
from google import genai
from google.genai import types
client = genai.Client()
class SentimentResult(BaseModel):
    label: str = Field(description="positive, negative, or neutral")
    confidence: float = Field(description="0.0 to 1.0")
    reasoning: str = Field(description="One sentence explanation")
response = client.models.generate_content(
    model="gemini-3.1-flash-lite-preview",
    contents="Classify: 'Delivery was late but customer service was helpful.'",
    config=types.GenerateContentConfig(
        max_output_tokens=150,
        thinking_config=types.ThinkingConfig(thinking_level="minimal"),
        response_mime_type="application/json",
        response_json_schema=SentimentResult.model_json_schema(),
    )
)
import json
result = json.loads(response.text)
print(result["label"])  # "neutral"

This pattern eliminates JSON parsing errors caused by the model wrapping its output in markdown code fences or adding explanatory text. The response_json_schema parameter enforces the structure at the API level.

Timeout and retry defaults

Preview models can occasionally take longer to respond, and you will hit rate limits. Build retry logic from the start:

import time
from google import genai
from google.genai import types
from google.api_core import exceptions as google_exceptions
client = genai.Client()
def generate_with_retry(prompt: str, max_retries: int = 3) -> str:
    """Generate content with exponential backoff on 429s."""
    for attempt in range(max_retries):
        try:
            response = client.models.generate_content(
                model="gemini-3.1-flash-lite-preview",
                contents=prompt,
                config=types.GenerateContentConfig(
                    max_output_tokens=200,
                    thinking_config=types.ThinkingConfig(thinking_level="minimal"),
                )
            )
            return response.text
        except Exception as e:
            if "429" in str(e) or "RESOURCE_EXHAUSTED" in str(e):
                wait = 2 ** attempt        # 1s, 2s, 4s
                print(f"Rate limited. Waiting {wait}s...")
                time.sleep(wait)
                continue
            raise  # Re-raise non-rate-limit errors immediately

    raise RuntimeError(f"Failed after {max_retries} retries")

Set a request timeout in production. For Flash-Lite, a 30-second timeout is generous — if a request takes longer than that, something is wrong and you want to fail fast rather than hold open connections.

Troubleshooting — Common First-Day Failures

⚠️ "Model not found" — wrong ID or endpoint

Symptoms: 404 Not Found, message contains "models/gemini-3.1-flash-lite" or similar

Cause: Either the model ID is wrong, or you're using a legacy SDK that doesn't support Gemini 3 models.

Fixes:

Confirm the exact model ID: gemini-3.1-flash-lite-preview (copy it, don't type it)
Check your SDK version: pip show google-genai or npm list @google/genai
If you see @google/generative-ai in your code instead of @google/genai — that's the deprecated package. Install the current one: npm install @google/genai
If using Vertex AI, confirm the model is available in your selected region — not all preview models are in all regions at launch

⚠️ 429 on first call — preview quota is stricter

Symptoms: 429 RESOURCE_EXHAUSTED immediately or within the first few requests

Cause: Preview models have tighter rate limits than standard models. Free tier compounds this. The December 2025 quota adjustments tightened free tier limits significantly across all Gemini models.

Fixes:

Enable billing to upgrade to Tier 1. The upgrade is instant — you don't need to do anything else, it kicks in as soon as billing is enabled on your Google Cloud project
Add exponential backoff retry logic (see the pattern in the previous section)
Check that you're not running multiple test scripts simultaneously under the same project — limits are per-project, not per-key
If you legitimately need higher limits than Tier 1, submit a quota increase request via the Google Cloud Console quotas page

One thing that trips people up: daily request limits (RPD) reset at midnight Pacific Time. If you burned through your daily quota in testing, you may need to wait until the next day rather than debugging a code issue.

⚠️ Safety refusal — quick prompt rewrite tips

Symptoms: Response contains safety disclaimer or refuses to answer. response.text may be empty, or response.candidates[0].finish_reason returns SAFETY.

Cause: Flash-Lite, like all Gemini models, applies safety filters. The default thresholds are calibrated for general use. Some legitimate professional or classification tasks can trip them, particularly anything involving descriptions of harmful content (even in a moderation context) or certain sensitive topics.

Quick diagnosis:

response = client.models.generate_content(
    model="gemini-3.1-flash-lite-preview",
    contents=your_prompt
)
# Check finish reason before accessing text
if response.candidates[0].finish_reason.name == "SAFETY":
    print("Safety filter triggered")
    print(response.candidates[0].safety_ratings)
else:
    print(response.text)

Rewrite tips that usually help:

Add a system instruction that establishes your context: "You are a content moderation classifier for a professional platform. Classify the following content as safe or unsafe."
Rephrase the task from "describe the harmful thing" to "classify whether this content contains [category]"
If you're processing user-generated content for moderation, frame it as: "The following text was submitted by a user. Determine if it violates our policies on [X]."
For legitimate use cases where default safety settings are too conservative, the Gemini safety settings documentation covers how to adjust category thresholds — though this requires careful evaluation before doing it in production

Blog image

At Macaron, we route tasks like classification, translation, and extraction through exactly this kind of lightweight model layer — turning the structured outputs straight into workflow steps without extra context switching. If you're building something similar and want to see how a real task runs end-to-end, test your own workflow at macaron.im and judge the output yourself.

FAQ

Do I need billing enabled to use the free tier?

No. The Gemini API free tier requires no credit card and no billing setup. You get an API key from Google AI Studio and start making requests immediately. The free tier for Flash-Lite includes both input and output tokens at no charge during preview. Enabling billing upgrades you to Tier 1 with higher rate limits and access to features like Context Caching and Batch API — but it's not required for initial testing.

Can I use the same code for Gemini Flash (non-Lite)?

Almost. The only required change is the model ID:

# Flash-Lite
model="gemini-3.1-flash-lite-preview"
# Flash (full)
model="gemini-3-flash-preview"

Everything else — SDK initialization, GenerateContentConfig, ThinkingConfig, streaming — works identically across both models. The thinking_level parameter applies to both. One behavioral note: Flash defaults to high thinking level, same as Flash-Lite, so set your thinking level explicitly in either case rather than relying on the default.

When will Flash-Lite move from preview to GA?

Google hasn't published a timeline. The API changelog is the right place to track this. When GA ships, the model ID will change (the -preview suffix will drop), and rate limits will increase. Both of those are good things, but both require a code update if you've hardcoded the preview ID. Store the model ID in a config constant, not inline.

Does Flash-Lite support function calling?

Yes. Function calling, structured outputs, code execution, URL context, file search, and Grounding with Google Search are all supported. Live API and image/audio generation are not supported — Flash-Lite outputs text only. The full capability table is in the official model documentation.