AI Cost Tracking for Multi-Model Workflows

The Hidden Cost Problem in Multi-Model Pipelines

A typical production AI agent running customer research tasks spends roughly 60% of its token budget on a single step: the initial reasoning pass. The other 40% is scattered across embedding lookups, image analysis, summarization, and tool calls. If you're billing customers per task or paying per resolution like OneShot does with its agent tools, that 60/40 split determines whether you're profitable or not. Most teams don't know their split because they've never instrumented it.

The problem gets worse when you use multiple models. Claude for reasoning, Gemini for document vision, GPT-4o-mini for embeddings and fast classification. Each model has a different price per million tokens, different context window economics, and different latency characteristics. Without a unified cost attribution layer, you end up with a billing line that says "OpenAI: $847 this month" and no idea which tasks or tools drove that number.

Let's walk through building a cost tracker that attributes every LLM call to a task, tool, and model, then aggregates that into metrics that actually help you optimize: cost per successful task, cost per tool invocation, and model spend concentration.

Cost Attribution: The Data Model

The core idea is tagging every LLM call at the point of invocation. You need four fields at minimum: a task ID that groups all calls belonging to one unit of work, a tool name (the function or capability that triggered the call), the model identifier, and the token counts. Everything else is derived.

// cost-tracker.ts
interface LLMCall {
  taskId: string;
  tool: string;
  model: string;
  promptTokens: number;
  completionTokens: number;
  latencyMs: number;
  success: boolean;
  timestamp: number;
}

interface ModelPricing {
  inputPerMillion: number;   // USD
  outputPerMillion: number;  // USD
}

const MODEL_PRICING: Record<string, ModelPricing> = {
  "claude-3-5-sonnet-20241022": { inputPerMillion: 3.00, outputPerMillion: 15.00 },
  "claude-3-haiku-20240307":    { inputPerMillion: 0.25, outputPerMillion: 1.25 },
  "gpt-4o":                     { inputPerMillion: 2.50, outputPerMillion: 10.00 },
  "gpt-4o-mini":                { inputPerMillion: 0.15, outputPerMillion: 0.60 },
  "gemini-1.5-pro":             { inputPerMillion: 1.25, outputPerMillion: 5.00 },
  "text-embedding-3-small":     { inputPerMillion: 0.02, outputPerMillion: 0.00 },
};

function computeCost(call: LLMCall): number {
  const pricing = MODEL_PRICING[call.model];
  if (!pricing) throw new Error(`No pricing for model: ${call.model}`);
  return (
    (call.promptTokens / 1_000_000) * pricing.inputPerMillion +
    (call.completionTokens / 1_000_000) * pricing.outputPerMillion
  );
}

Keep this pricing table in one place and update it when providers change rates. OpenRouter's model listing is a useful reference for current prices across providers, though you should verify against each provider's billing page directly since prices shift without notice.

Wrapping Your LLM Clients

The cleanest way to instrument calls is a thin wrapper around your LLM client. This keeps the tracking logic in one place rather than scattered across every tool function.

// tracked-llm.ts
import Anthropic from "@anthropic-ai/sdk";
import { CostStore } from "./cost-store";

export class TrackedAnthropic {
  private client: Anthropic;
  private store: CostStore;

  constructor(store: CostStore) {
    this.client = new Anthropic();
    this.store = store;
  }

  async complete(
    taskId: string,
    tool: string,
    params: Anthropic.MessageCreateParamsNonStreaming
  ): Promise<Anthropic.Message> {
    const start = Date.now();
    let success = false;

    try {
      const response = await this.client.messages.create(params);
      success = true;

      const call: LLMCall = {
        taskId,
        tool,
        model: params.model,
        promptTokens: response.usage.input_tokens,
        completionTokens: response.usage.output_tokens,
        latencyMs: Date.now() - start,
        success,
        timestamp: start,
      };

      await this.store.record(call);
      return response;
    } catch (err) {
      await this.store.record({
        taskId,
        tool,
        model: params.model,
        promptTokens: 0,
        completionTokens: 0,
        latencyMs: Date.now() - start,
        success: false,
        timestamp: start,
      });
      throw err;
    }
  }
}

The wrapper records failed calls too, with zero tokens. This matters because failed calls still consume latency and sometimes partial tokens, and they inflate your cost-per-successful-task metric in ways that reveal broken prompts or flaky tool integrations.

Where the Money Actually Goes

Once you have a few days of data, the breakdown is usually surprising. Here's a representative cost profile for an agent doing customer research tasks (estimates based on typical production workloads):

Reasoning pass (Claude Sonnet): 58% of spend, 12% of call volume
Document analysis (Gemini Pro): 22% of spend, 8% of call volume
Classification and routing (GPT-4o-mini): 4% of spend, 45% of call volume
Embeddings (text-embedding-3-small): 1% of spend, 30% of call volume
Summarization (Claude Haiku): 15% of spend, 5% of call volume

The classification calls are cheap per call but high volume. The reasoning calls are expensive per call but low volume. The document analysis is mid-volume and mid-price but contributes a disproportionate share because Gemini Pro's output pricing for long-form responses adds up quickly.

This breakdown tells you where to focus optimization effort. Cutting 20% off your classification prompt saves almost nothing. Cutting 20% off your reasoning prompt saves real money. But if you're running an agent like OneShot's tool suite where voice calls and email sends are also line items, your LLM cost might be secondary to your per-tool costs. You need to see both in the same view.

The Metric That Actually Matters: Cost Per Successful Task

Cost per token is a provider metric. Cost per successful task is a business metric. The gap between them is your optimization opportunity.

Here's how to compute it from your stored call records:

// analytics.ts
interface TaskSummary {
  taskId: string;
  totalCostUsd: number;
  totalLatencyMs: number;
  callCount: number;
  successful: boolean;
  costByTool: Record<string, number>;
  costByModel: Record<string, number>;
}

async function summarizeTask(
  taskId: string,
  store: CostStore
): Promise<TaskSummary> {
  const calls = await store.getByTaskId(taskId);

  const summary: TaskSummary = {
    taskId,
    totalCostUsd: 0,
    totalLatencyMs: 0,
    callCount: calls.length,
    successful: calls.some((c) => c.success),
    costByTool: {},
    costByModel: {},
  };

  for (const call of calls) {
    const cost = computeCost(call);
    summary.totalCostUsd += cost;
    summary.totalLatencyMs += call.latencyMs;
    summary.costByTool[call.tool] = (summary.costByTool[call.tool] ?? 0) + cost;
    summary.costByModel[call.model] = (summary.costByModel[call.model] ?? 0) + cost;
  }

  return summary;
}

async function costPerSuccessfulTask(
  summaries: TaskSummary[]
): Promise<number> {
  const successful = summaries.filter((s) => s.successful);
  const totalCost = summaries.reduce((sum, s) => sum + s.totalCostUsd, 0);
  return totalCost / successful.length;
}

The numerator includes failed task costs. This is intentional. If your agent fails 30% of tasks and retries them, those retry costs are real costs that belong to the successful outcomes that eventually happened. Excluding them makes your unit economics look better than they are.

A well-optimized research agent should run under $0.08 per successful task at current model prices. If you're at $0.25, you have a 3x optimization gap. If you're at $0.04, you're probably skimping on reasoning quality and should check your success rate.

Optimization Levers, Ranked by Impact

Model Routing

The highest-impact lever is routing tasks to cheaper models when quality requirements allow it. A classification task that needs "is this email a complaint or a sales inquiry?" does not need Claude Sonnet. GPT-4o-mini at $0.15/M input tokens handles it with accuracy indistinguishable from Sonnet for this use case, at roughly 17x lower cost per input token.

Build a routing layer that maps task types to model tiers. Use a fast, cheap model as the default and escalate to expensive models only when the task requires deep reasoning, long context, or multimodal input. Track escalation rates per tool, because a high escalation rate on a supposedly simple tool is a signal that your prompt or task decomposition is wrong.

Prompt Compression

Long system prompts are a tax on every call. A 2,000-token system prompt on a tool that runs 500 times per day costs you 1,000,000 tokens per day in system prompt alone, before any user content. At Claude Sonnet rates, that's $3.00/day or roughly $90/month for one verbose system prompt.

Audit your system prompts with a token counter. Anything over 500 tokens should be reviewed for redundancy. Instructions like "always be helpful and professional" consume tokens and change nothing. Cut them.

Caching

Anthropic's prompt caching reduces costs on repeated large context blocks by roughly 90% for cached portions (cache read is $0.30/M vs $3.00/M for fresh input on Sonnet). If your agent repeatedly loads the same reference documents or tool schemas into context, caching pays for itself immediately.

The catch is that cache TTL is 5 minutes on Anthropic's implementation, so caching only helps if you're making multiple calls within that window. Batch your calls or structure your pipeline to reuse cached context within a session.

Batch Processing

OpenAI and Anthropic both offer batch APIs at 50% discount for non-real-time workloads. If your agent runs overnight research jobs or processes queued emails, batch mode halves your model costs with no quality tradeoff. The latency hit is up to 24 hours, which is acceptable for async workflows.

The implementation pattern is straightforward: instead of calling the API directly, serialize your requests to a JSONL file, submit the batch, poll for completion, and process results. The OneShot SDK handles the tool execution layer while you control which model tier your reasoning calls use.

Dashboard Design: What to Actually Display

Most observability tools like LangSmith give you per-trace token counts and latency. That's necessary but not sufficient. The metrics that drive decisions are:

Cost per successful task, trended daily. A rising trend with stable success rate means your prompts are getting longer or you're escalating to expensive models more often.
Cost by tool, sorted descending. This shows where to focus compression and routing effort.
Success rate by tool. A tool with high cost and low success rate is your biggest problem.
Model spend concentration. If one model accounts for over 70% of spend, you have a routing opportunity.
P95 cost per task. The average hides expensive outliers. Long-tail expensive tasks often indicate runaway retry loops or context window bloat.

Store your aggregated metrics in a time-series format so you can detect regressions. A new prompt deployment that increases average task cost by 15% should trigger an alert before it runs for a week.

Connecting Tool Costs to LLM Costs

If your agent uses external tools alongside LLM calls, you need those costs in the same ledger. An agent that makes a voice call, sends an email, and runs three LLM reasoning steps has a blended cost that includes both. OneShot's pricing page documents per-tool costs for voice, email, SMS, and research calls, which you can add to your LLMCall records using the same taskId attribution pattern.

Extend your data model with a ToolCall type alongside LLMCall, both tagged with the same taskId. Your summarizeTask function then aggregates across both types to give you the true cost of a task, not just the model portion of it.

For agents running in the Soul.Markets ecosystem where agents pay for tools with USDC via x402, this matters even more: your LLM spend and your tool spend are both real money flows, and the ratio between them determines your margin on each task type.

What to Build Next

Once you have cost per successful task instrumented and stable, the next useful thing is a cost budget per task type. Set a ceiling, say $0.12 for a research task, and have your orchestrator abort or downgrade the model tier if a task is tracking over budget at the midpoint. This prevents runaway costs from edge cases where the agent gets stuck in a reasoning loop.

After that, build a model routing experiment framework. Route 10% of classification calls to a cheaper model, measure success rate and cost, and promote the cheaper model if success rate doesn't drop more than 2 percentage points. This is how you systematically reduce model spend without sacrificing quality. The goal is not the cheapest possible setup but the cheapest setup that meets your quality threshold, measured per tool and per task type.

By Q3 2026, expect most production teams running multi-model pipelines to have cost-per-task as a first-class metric in their deployment dashboards. Teams that instrument this now will have 6-12 months of baseline data when they need it for capacity planning and pricing decisions.