
Hey fellow API tinkerers — if you've ever spent 20 minutes debugging a "model not found" error just because you had the ID slightly wrong, this guide is for you.
I'm Hanks. I test AI tools inside real workflows, not demos. Gemini 3.1 Flash-Lite dropped on March 3, 2026, and I've been running it inside actual pipelines since day one. This guide skips the "what is Flash-Lite" explanation — you already know what it's for. What you need is: the exact setup, working code for both Python and Node.js, the safe defaults you should set before you go anywhere near production, and the errors you'll probably hit in the first two hours.
Let's go.
For most developers, Google AI Studio is the right starting point. Two steps:

export GEMINI_API_KEY="your_key_here"That's the whole setup for the free tier. You get access to Flash-Lite immediately, both input and output tokens are free during preview, and you don't need a credit card.
The catch: free tier rate limits for preview models are stricter than what you'd expect. Based on December 2025 quota adjustments, free tier users land in the range of 10–15 RPM and 250K TPM, with a daily request ceiling. You'll hit those limits quickly if you're testing anything at volume. Enable billing in Google Cloud to upgrade to Tier 1 (150–300 RPM, 1M TPM) — for Flash-Lite's price tier, the monthly cost at typical dev volumes is measured in cents, not dollars.
One thing to know before you start: rate limits are applied per project, not per API key. If you have multiple keys under the same project, they share the same quota bucket.

If you're in a Google Cloud environment already, or if your company has compliance requirements around data residency and audit logging, go straight to Vertex AI. The model is the same — same performance, same pricing — but you authenticate using Application Default Credentials (ADC) instead of an API key, and the data handling falls under your Cloud contract.
The Vertex AI setup uses the same @google/genai / google-genai SDK. The only difference is initialization:
# Vertex AI initialization
import os
from google import genai
# Set these environment variables:
# GOOGLE_CLOUD_PROJECT=your_project_id
# GOOGLE_CLOUD_LOCATION=global
# GOOGLE_GENAI_USE_VERTEXAI=True
client = genai.Client() # picks up env vars automatically
For the Vertex AI path, you also need to enable the Vertex AI API in your GCP project and run gcloud auth application-default login to set up credentials. Full setup is in the Vertex AI Gemini API quickstart.
Flash-Lite is in preview as of March 2026. Preview status has two practical consequences worth knowing before you write any infrastructure code:
-preview. gemini-3.1-flash-lite-preview is the correct string today. When GA ships, the ID will change. Any code that hardcodes the preview ID will need a one-line update. Treat model IDs as config values, not string literals buried in your code.There's no public timeline for GA. Monitor the Gemini API changelog for updates.

The correct model ID is: gemini-3.1-flash-lite-preview
This is what I see people get wrong most often in the first hour:
Save yourself the confusion: copy the ID from the table above and store it in a config constant.
Install the SDK first:
pip install google-genai
Then:
import os
from google import genai
# API key is read from GEMINI_API_KEY env var automatically
client = genai.Client()
response = client.models.generate_content(
model="gemini-3.1-flash-lite-preview",
contents="Classify this review as positive, negative, or neutral: 'Fast shipping, great quality.'"
)
print(response.text)
That's a working request. No extra config, no imports beyond the basics. Run it and you'll get a response in under two seconds on standard tier.
Install the SDK:
npm install @google/genai
Note: The current package is
@google/genai. The older@google/generative-aiis deprecated and does not receive Gemini 2.0+ features. Don't use it for this.
import { GoogleGenAI } from "@google/genai";
// API key is read from GEMINI_API_KEY env var automatically
const ai = new GoogleGenAI({});
async function main() {
const response = await ai.models.generateContent({
model: "gemini-3.1-flash-lite-preview",
contents: "Classify this review as positive, negative, or neutral: 'Fast shipping, great quality.'"
});
console.log(response.text);
}
main();
Requires Node.js v18 or later. Run with node --experimental-vm-modules if you're on an older Node setup that doesn't handle ES modules by default, or add "type": "module" to your package.json.
Use streaming if: the user is waiting and you want to show output as it arrives. Latency feels faster even if total time is the same.
Use non-streaming if: you're processing the output programmatically (parsing JSON, routing to next step, storing to DB). Streaming requires reassembling chunks before you can safely parse them, which adds complexity with no benefit.
Here's streaming in Python — it's literally a one-method swap:
# Non-streaming (default)
response = client.models.generate_content(
model="gemini-3.1-flash-lite-preview",
contents="Summarize this article..."
)
print(response.text)
# Streaming — swap to generate_content_stream
for chunk in client.models.generate_content_stream(
model="gemini-3.1-flash-lite-preview",
contents="Summarize this article..."
):
print(chunk.text, end="", flush=True)
For Flash-Lite's typical use cases — classification, extraction, translation — you're almost always in non-streaming territory. Streaming is more relevant for user-facing chat interfaces.
Flash-Lite defaults to its maximum output of 64K tokens if you don't constrain it. The model is also on the verbose side — benchmarks show it generating roughly 2.5× more output tokens than comparable models at the same price tier. An open max_tokens setting will inflate your output costs and slow down your pipeline.
Set it based on your task:
from google import genai
from google.genai import types
response = client.models.generate_content(
model="gemini-3.1-flash-lite-preview",
contents="Classify sentiment: 'Loved it!'",
config=types.GenerateContentConfig(
max_output_tokens=20, # Classification: label only
)
)
Rough guidelines:
Gemini 3.1 Flash-Lite supports four thinking levels: minimal, low, medium, and high. If no thinking level is specified, the model defaults to high — which is the most expensive and slowest option. For Flash-Lite's primary use cases, you almost always want minimal.
response = client.models.generate_content(
model="gemini-3.1-flash-lite-preview",
contents="Translate to French: 'The weather is great today.'",
config=types.GenerateContentConfig(
max_output_tokens=100,
thinking_config=types.ThinkingConfig(
thinking_level="minimal" # Fast, cheap — right for translation/classification
)
)
)
Migration note for Gemini 2.5 users:
thinking_budgetis still supported for backward compatibility, but the recommended parameter for Gemini 3 models isthinking_level. If you previously setthinking_budget=0, the equivalent isthinking_level="minimal". Don't use both parameters in the same request.
When to go above minimal:
For extraction and classification tasks, structured JSON output is more reliable than asking the model to format things in prose. Use response_mime_type with a Pydantic schema for clean, parseable output:
from pydantic import BaseModel, Field
from google import genai
from google.genai import types
client = genai.Client()
class SentimentResult(BaseModel):
label: str = Field(description="positive, negative, or neutral")
confidence: float = Field(description="0.0 to 1.0")
reasoning: str = Field(description="One sentence explanation")
response = client.models.generate_content(
model="gemini-3.1-flash-lite-preview",
contents="Classify: 'Delivery was late but customer service was helpful.'",
config=types.GenerateContentConfig(
max_output_tokens=150,
thinking_config=types.ThinkingConfig(thinking_level="minimal"),
response_mime_type="application/json",
response_json_schema=SentimentResult.model_json_schema(),
)
)
import json
result = json.loads(response.text)
print(result["label"]) # "neutral"
This pattern eliminates JSON parsing errors caused by the model wrapping its output in markdown code fences or adding explanatory text. The response_json_schema parameter enforces the structure at the API level.
Preview models can occasionally take longer to respond, and you will hit rate limits. Build retry logic from the start:
import time
from google import genai
from google.genai import types
from google.api_core import exceptions as google_exceptions
client = genai.Client()
def generate_with_retry(prompt: str, max_retries: int = 3) -> str:
"""Generate content with exponential backoff on 429s."""
for attempt in range(max_retries):
try:
response = client.models.generate_content(
model="gemini-3.1-flash-lite-preview",
contents=prompt,
config=types.GenerateContentConfig(
max_output_tokens=200,
thinking_config=types.ThinkingConfig(thinking_level="minimal"),
)
)
return response.text
except Exception as e:
if "429" in str(e) or "RESOURCE_EXHAUSTED" in str(e):
wait = 2 ** attempt # 1s, 2s, 4s
print(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
continue
raise # Re-raise non-rate-limit errors immediately
raise RuntimeError(f"Failed after {max_retries} retries")
Set a request timeout in production. For Flash-Lite, a 30-second timeout is generous — if a request takes longer than that, something is wrong and you want to fail fast rather than hold open connections.
⚠️ "Model not found" — wrong ID or endpoint
Symptoms: 404 Not Found, message contains "models/gemini-3.1-flash-lite" or similar
Cause: Either the model ID is wrong, or you're using a legacy SDK that doesn't support Gemini 3 models.
Fixes:
gemini-3.1-flash-lite-preview (copy it, don't type it)pip show google-genai or npm list @google/genai@google/generative-ai in your code instead of @google/genai — that's the deprecated package. Install the current one: npm install @google/genai⚠️ 429 on first call — preview quota is stricter
Symptoms: 429 RESOURCE_EXHAUSTED immediately or within the first few requests
Cause: Preview models have tighter rate limits than standard models. Free tier compounds this. The December 2025 quota adjustments tightened free tier limits significantly across all Gemini models.
Fixes:
One thing that trips people up: daily request limits (RPD) reset at midnight Pacific Time. If you burned through your daily quota in testing, you may need to wait until the next day rather than debugging a code issue.
⚠️ Safety refusal — quick prompt rewrite tips
Symptoms: Response contains safety disclaimer or refuses to answer. response.text may be empty, or response.candidates[0].finish_reason returns SAFETY.
Cause: Flash-Lite, like all Gemini models, applies safety filters. The default thresholds are calibrated for general use. Some legitimate professional or classification tasks can trip them, particularly anything involving descriptions of harmful content (even in a moderation context) or certain sensitive topics.
Quick diagnosis:
response = client.models.generate_content(
model="gemini-3.1-flash-lite-preview",
contents=your_prompt
)
# Check finish reason before accessing text
if response.candidates[0].finish_reason.name == "SAFETY":
print("Safety filter triggered")
print(response.candidates[0].safety_ratings)
else:
print(response.text)
Rewrite tips that usually help:
"You are a content moderation classifier for a professional platform. Classify the following content as safe or unsafe."
At Macaron, we route tasks like classification, translation, and extraction through exactly this kind of lightweight model layer — turning the structured outputs straight into workflow steps without extra context switching. If you're building something similar and want to see how a real task runs end-to-end, test your own workflow at macaron.im and judge the output yourself.
No. The Gemini API free tier requires no credit card and no billing setup. You get an API key from Google AI Studio and start making requests immediately. The free tier for Flash-Lite includes both input and output tokens at no charge during preview. Enabling billing upgrades you to Tier 1 with higher rate limits and access to features like Context Caching and Batch API — but it's not required for initial testing.
Almost. The only required change is the model ID:
# Flash-Lite
model="gemini-3.1-flash-lite-preview"
# Flash (full)
model="gemini-3-flash-preview"
Everything else — SDK initialization, GenerateContentConfig, ThinkingConfig, streaming — works identically across both models. The thinking_level parameter applies to both. One behavioral note: Flash defaults to high thinking level, same as Flash-Lite, so set your thinking level explicitly in either case rather than relying on the default.
Google hasn't published a timeline. The API changelog is the right place to track this. When GA ships, the model ID will change (the -preview suffix will drop), and rate limits will increase. Both of those are good things, but both require a code update if you've hardcoded the preview ID. Store the model ID in a config constant, not inline.
Yes. Function calling, structured outputs, code execution, URL context, file search, and Grounding with Google Search are all supported. Live API and image/audio generation are not supported — Flash-Lite outputs text only. The full capability table is in the official model documentation.
January 2025. Anything after that requires Grounding with Google Search to be accurate — or you provide the relevant context in the prompt directly.
Related Articles