How to Use Gemini 3.1 Pro API: Setup Guide with Code Examples (2026)

What's up, API tinkerers — if you've been watching Gemini 3.1 Pro drop on February 19 and immediately opened a new tab to check the docs, this guide is the one I wish I'd had.

I spent the first couple of hours after launch hitting walls. The thinking_budget parameter I had muscle-memorized from 2.5 Pro? Doesn't work here — returns a 400. The old model ID? Obviously wrong. The caching setup? Slightly different than what the 2.5 docs say. None of it is hard, but it's all just different enough to waste time if you're copying from old guides.

I'm Hanks. I test AI tools in real workflows. Here's the complete, actually-working setup as of February 2026.


Prerequisites — What You Need Before Starting

Google Account, AI Studio Access, Billing

Three things before you write a single line of code:

  1. A Google account with AI Studio access. Go to aistudio.google.com and sign in. AI Studio gives you a free-tier playground with rate limits — good enough for prototyping and verifying your setup before you touch production. Gemini 3.1 Pro Preview is available in the model dropdown from day one of its launch.

  1. A billing account attached. AI Studio's free tier uses the Gemini API at no charge, but with tight rate limits. For production work, you'll need a Google Cloud billing account linked. The Gemini API bills separately from other Google Cloud services — check the Gemini API pricing page to confirm current rates before you start running large jobs.
  2. SDK version awareness. This matters for Gemini 3 specifically: the Gemini 3 API features require Gen AI SDK for Python version 1.51.0 or later. If you're running an older version, thinking level parameters will fail silently or throw unexpected errors. Pin your version explicitly.

Install the SDK (Python or Node.js)

Python:

bash

pip install google-genai --upgrade
# Verify you're on 1.51.0+
python -c "import google.genai; print(google.genai.__version__)"

Node.js:

bash

npm install @google/genai
# For Interactions API (multi-agent) you need 1.33.0+
npm list @google/genai

Vertex AI (enterprise): If you're deploying through Vertex AI rather than the Gemini API directly, Gemini 3.1 Pro Preview is only available on global endpoints. Set your location explicitly:

bash

export GOOGLE_CLOUD_PROJECT=your-project-id
export GOOGLE_CLOUD_LOCATION=global
export GOOGLE_GENAI_USE_VERTEXAI=True

Get Your API Key

Where to Find It in AI Studio

  1. Open Google AI Studio
  2. Click Get API Key in the left sidebar
  3. Select Create API key — choose an existing Google Cloud project or create a new one
  4. Copy the key immediately. The full key only displays once.

One thing that tripped me up early on: AI Studio keys and Vertex AI keys are separate credential systems. An AI Studio API key won't work with the Vertex AI SDK, and vice versa. Pick one path and stay on it for a given project.

Setting It as an Environment Variable

Never hardcode your API key. The SDK reads GEMINI_API_KEY automatically when it's set in your environment.

Mac/Linux:

bash

export GEMINI_API_KEY="your-api-key-here"
# Make it permanent by adding to ~/.bashrc or ~/.zshrc
echo 'export GEMINI_API_KEY="your-api-key-here"' >> ~/.zshrc

Windows (PowerShell):

powershell

$env:GEMINI_API_KEY = "your-api-key-here"
# Permanent:
[System.Environment]::SetEnvironmentVariable("GEMINI_API_KEY", "your-api-key-here", "User")

Python — verify it's loading:

python

import os
from google import genai
# SDK picks up GEMINI_API_KEY automatically if set
client = genai.Client()
# Or pass explicitly (not recommended for production):
# client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])

Your First Request — Code Walkthrough

Basic Text Generation Example

The model ID for Gemini 3.1 Pro is gemini-3.1-pro-preview. Don't drop the -preview — there is no GA model ID yet as of February 2026.

Python:

python

from google import genai
client = genai.Client()
response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents="Explain the difference between a race condition and a deadlock."
)
print(response.text)
# Check token usage
print(f"Input tokens: {response.usage_metadata.prompt_token_count}")
print(f"Output tokens: {response.usage_metadata.candidates_token_count}")

Node.js:

javascript

import { GoogleGenAI } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const response = await ai.models.generateContent({
    model: "gemini-3.1-pro-preview",
    contents: "Explain the difference between a race condition and a deadlock."
});
console.log(response.text);

One default to be aware of: if you omit thinking_level, the model defaults to HIGH. That means full Deep Think Mini reasoning on every request — which is great for quality but increases output token count and latency. For simple queries, set it explicitly to LOW or MEDIUM (covered in the next section).

Multimodal Request (Image + Text)

Gemini 3.1 Pro accepts images as inline base64 or via the Files API for larger files. Inline base64 works for images under ~20MB; use the Files API for anything larger or for video.

Python — inline image:

python

import base64
from pathlib import Path
from google import genai
from google.genai import types
client = genai.Client()
# Read and encode image
image_bytes = Path("architecture_diagram.png").read_bytes()
image_b64 = base64.b64encode(image_bytes).decode()
response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        types.Part(
            inline_data=types.Blob(
                mime_type="image/png",
                data=image_b64
            )
        ),
        types.Part(text="Identify any architectural issues in this system diagram.")
    ]
)
print(response.text)

Python — Files API (for large files, video, audio):

python

import time
from pathlib import Path
from google import genai
client = genai.Client()
# Upload the file
uploaded_file = client.files.upload(file=Path("system_recording.mp4"))
# Wait for processing to complete
while uploaded_file.state.name == "PROCESSING":
    time.sleep(2.5)
    uploaded_file = client.files.get(name=uploaded_file.name)
print(f"File ready: {uploaded_file.uri}")
# Use the uploaded file in a request
response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[uploaded_file, "Summarize the key decisions made in this meeting."]
)
print(response.text)

Supported multimodal inputs per the Gemini 3 developer guide: text, images (JPEG, PNG, GIF, WebP, HEIC/HEIF), audio (MP3, WAV, FLAC, AAC, OGG, OPUS), video (MP4, MPEG, MOV, AVI, WebM), and PDF. The 1M token context window translates to approximately 900 images or 1 hour of video in a single prompt.

Setting Thinking Level (Low / Medium / High)

This is the new parameter for Gemini 3.1 Pro, and it's different enough from 2.x models that it deserves its own section.

The thinking_level parameter accepts "LOW", "MEDIUM", or "HIGH". Default when omitted is HIGH.

Critical: you cannot mix thinking_level with the legacy thinking_budget parameter. Doing so returns a 400 error. If you're migrating from Gemini 2.5, remove thinking_budget entirely before adding thinking_level.

python

from google import genai
from google.genai import types
client = genai.Client()
# LOW — fast, minimal reasoning. Good for classification, simple Q&A, autocomplete.
response_low = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents="What is the capital of Japan?",
    config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_level="LOW")
    )
)
# MEDIUM — balanced. Recommended default for most production tasks.
# Equivalent in quality to the old Gemini 3 Pro HIGH setting.
response_medium = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents="Review this pull request for logic errors and edge cases: [paste diff]",
    config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_level="MEDIUM"),
        max_output_tokens=65536
    )
)
# HIGH — Deep Think Mini. Use for complex multi-step problems.
# Higher latency and output token cost. Set max_output_tokens explicitly.
response_high = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents="Find the root cause of this intermittent race condition: [paste code]",
    config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_level="HIGH"),
        max_output_tokens=65536
    )
)
print(response_medium.text)

One gotcha: the default max_output_tokens is 8,192 even though the model supports up to 65,536. If you're generating long code, detailed analysis, or extended documents, set it explicitly or you'll hit silent truncation.

Here's a quick reference for thinking level selection:

Task Type
Recommended Level
Why
Simple Q&A, keyword extraction
LOW
Fast, cheap, sufficient
Code review, document summary
MEDIUM
Old 3 Pro HIGH quality, lower cost than HIGH
Complex debugging, multi-step planning
HIGH
Deep Think Mini reasoning
Migrating from Gemini 3 Pro HIGH
MEDIUM
Quality-equivalent, don't overpay

Working with the 1M Token Context Window

Uploading Files via the API

The Files API is the right path for any file over ~20MB or for repeated use across multiple requests. Uploaded files persist for 48 hours on Google's servers.

python

import time
from pathlib import Path
from google import genai
client = genai.Client()
# Upload a large PDF (e.g., a full codebase documentation or legal contract)
uploaded_pdf = client.files.upload(
    file=Path("full_codebase_docs.pdf"),
    config={"display_name": "codebase-docs-v2"}
)
# Wait for processing
while uploaded_pdf.state.name == "PROCESSING":
    time.sleep(2)
    uploaded_pdf = client.files.get(name=uploaded_pdf.name)
print(f"Uploaded: {uploaded_pdf.name}, URI: {uploaded_pdf.uri}")
# Now query it multiple times without re-uploading
questions = [
    "What authentication patterns are used in this codebase?",
    "List all external API dependencies.",
    "Identify any deprecated functions still in active use."
]
for question in questions:
    response = client.models.generate_content(
        model="gemini-3.1-pro-preview",
        contents=[uploaded_pdf, question],
        config={"thinking_config": {"thinking_level": "MEDIUM"}}
    )
    print(f"\nQ: {question}\nA: {response.text}")

One prompt engineering tip from the official long context guide: when working with large context, put your specific question at the end of the prompt, after all the data. Start your question with "Based on the information above..." to anchor the model's reasoning to the provided context. This consistently improves retrieval accuracy across long documents.

Using Context Caching to Save Cost

Context caching is where the economics of large-context work really change. The Gemini API caching documentation confirms two modes:

  • Implicit caching — automatically enabled. If your request hits a cached prefix, Google passes on cost savings. Zero configuration required, no guarantee it triggers.
  • Explicit caching — you define the cache, guaranteed savings when it hits, additional developer work.

Use explicit caching when you have a large, repeated context (a system prompt + reference corpus) that you'll query many times. Here's a working example:

python

from google import genai
from google.genai import types
client = genai.Client()
# Step 1: Create the cache with your large, repeated context
# Minimum: 32,768 tokens for Gemini 3.1 Pro (check current limits at ai.google.dev)
system_instruction = """
You are a senior code reviewer for a Python-based financial services platform.
You enforce PEP 8, type safety, and OWASP top-10 security guidelines.
Flag any hardcoded credentials, SQL injection vectors, or unsafe deserialization.
"""
# Upload your large reference document first
uploaded_docs = client.files.upload(file="coding_standards_v4.pdf")
cache = client.caches.create(
    model="models/gemini-3.1-pro-preview",
    config=types.CreateCachedContentConfig(
        display_name="code-review-context",
        system_instruction=system_instruction,
        contents=[uploaded_docs],
        ttl="3600s"  # Cache lives for 1 hour; adjust based on your request frequency
    )
)
print(f"Cache created: {cache.name}")
# Step 2: Reference the cache in each review request
# Only the new code diff is sent as fresh tokens — the system context is cached
response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents="Review this diff for security issues: [paste diff here]",
    config=types.GenerateContentConfig(
        cached_content=cache.name
    )
)
print(response.text)
# Check cache hit in usage metadata
print(f"Cached tokens used: {response.usage_metadata.cached_content_token_count}")

The cost math when caching works: cache writes at $0.50/M tokens, cache reads at $0.20/M tokens (vs $2.00/M standard input). For a 50K-token system prompt queried 1,000 times/month, you pay $0.025 to write the cache once and $10 for reads — vs $100 without caching. At higher volumes, this compounds significantly.


Common Errors and Fixes

These are the ones I hit during setup and the ones that come up repeatedly in the Gemini API developer forum:

400 Bad Request: thinking_level and thinking_budget cannot be used together You're mixing old and new parameter styles. Remove thinking_budget entirely. The new API only accepts thinking_level. If you're migrating from a Gemini 2.5 prompt, search your config for thinking_budget and delete it.

404: models/gemini-3.1-pro-preview is not found Two possible causes. First: you're using Vertex AI with a regional endpoint — Gemini 3.1 Pro Preview on Vertex requires GOOGLE_CLOUD_LOCATION=global. Second: your SDK version is too old. Run pip install google-genai --upgrade and verify you're on 1.51.0+.

Model ignores custom tools and keeps using bash commands Switch from gemini-3.1-pro-preview to gemini-3.1-pro-preview-customtools. As documented in the official model page, this variant is optimized for agentic workflows using custom function calling. Note the caveat: it may show quality differences on tasks that don't benefit from tool use.

Output silently truncates around 8K tokens You forgot to set max_output_tokens. The default is 8,192 even though the model supports 65,536. Add max_output_tokens=65536 to your config for long-output tasks.

400 error on multi-turn function calling Missing thought signatures. Gemini 3 models require thoughtSignature to be returned in subsequent turns when using function calling. If you're using the official Python or Node.js SDK with standard chat history, this is handled automatically. If you're building a custom REST client, you need to manually pass signatures back. Use the official SDK unless you have a specific reason not to.

503 Service Unavailable on image generation endpoint This is a server-side overload issue, not an API key problem. The error message confirms you won't be charged for failed requests. Add retry logic with exponential backoff:

python

import time
import random
from google.api_core import exceptions
def generate_with_retry(client, model, contents, config, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.models.generate_content(
                model=model,
                contents=contents,
                config=config
            )
        except exceptions.ServiceUnavailable:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"Service unavailable. Retrying in {wait:.1f}s...")
            time.sleep(wait)

Frequently Asked Questions

What's the correct model ID for Gemini 3.1 Pro?gemini-3.1-pro-preview for the Gemini API and Vertex AI. Use gemini-3.1-pro-preview-customtools if you need agentic workflows with custom function calling and the model keeps defaulting to bash. There is no GA (non-preview) model ID as of February 2026.

What SDK version do I need? Python: google-genai 1.51.0 or later. Node.js: @google/genai 1.33.0 or later for Interactions API support. Run pip install google-genai --upgrade to get the latest.

Is there a free tier for Gemini 3.1 Pro API? No free API tier — Gemini 3.1 Pro Preview does not have a free quota in the Gemini API. AI Studio gives you access to the playground UI for free with rate limits, but programmatic API calls require a billing account. Confirmed in the Gemini 3 developer guide.

Can I use Gemini 3.1 Pro through OpenAI-compatible endpoints? Yes, via OpenRouter (google/gemini-3.1-pro-preview) or third-party proxy APIs that offer OpenAI-compatible endpoints. The advantage: you can drop in the model ID without installing Google's SDK. The tradeoff: you lose direct access to Gemini-specific features like thinking_level config through the native SDK.

What happens if I send a prompt longer than 1M tokens? The API returns a 400 context_length_exceeded error. The 1,048,576 input token limit is hard. For workflows approaching this limit, use the Files API to upload documents (which handles tokenization more efficiently than raw text) and implement chunking for document sets that may exceed the limit.

How do I monitor token usage and costs? Every response includes usage_metadata with prompt_token_count, candidates_token_count, and cached_content_token_count. Log these per request. For production budgeting: multiply output tokens by $0.000012 (standard rate) and input tokens by $0.000002. Set up alerting at your monthly spend threshold in Google Cloud billing.

Does thinking level affect response quality on simple tasks? Yes, but not always in the direction you'd expect. On simple factual queries, HIGH thinking can actually introduce unnecessary reasoning steps and slightly longer, less direct answers. LOW is genuinely better for simple classification and factual lookup tasks — faster, cheaper, and often cleaner output.


You're Set — Now the Interesting Part Starts

The setup itself is fifteen minutes. The interesting questions start once you're running actual tasks and watching where the model surprises you — and where it doesn't.

Quick checklist before you move to production:

  • SDK version confirmed: Python 1.51.0+ / Node.js 1.33.0+
  • API key set as environment variable, not hardcoded
  • Model ID: gemini-3.1-pro-preview (or -customtools for agentic workflows)
  • thinking_budget removed if migrating from Gemini 2.5
  • max_output_tokens=65536 set for long-output tasks
  • Vertex AI users: GOOGLE_CLOUD_LOCATION=global configured
  • Retry logic in place for 503 errors
  • Token usage logging enabled for cost tracking

We built Macaron to turn model experiments into structured, repeatable workflows with tracked configs and token usage. Start testing your setup inside a system that shows you what works — try it free and compare results side by side.

Hey, I’m Hanks — a workflow tinkerer and AI tool obsessive with over a decade of hands-on experience in automation, SaaS, and content creation. I spend my days testing tools so you don’t have to, breaking down complex processes into simple, actionable steps, and digging into the numbers behind “what actually works.”

Apply to become Macaron's first friends