Skip to content

Faster and cheaper! parallel processing of Anthropic API requests, optimizing for speed, cost-efficiency, and rate limit compliance.

License

Notifications You must be signed in to change notification settings

milistu/anthropic-parallel-calling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Anthropic Parallel API Processor

Anthropic Logo

Overview

This tool is designed to generate datasets faster and more cost-effectively by parallel calling the Anthropic API with optional caching for Claude. It's perfect for processing large volumes of requests while respecting rate limits.

A key feature of this processor is its approach to token estimation. Since there's no open-source tokenizer model available for the new Claude models, we use an older tokenizer to make initial estimates. Once we receive a response from the API, we update these estimates with the actual token usage. This adaptive approach allows us to maintain efficient processing while adhering to Anthropic's rate limits.

Key Features

  • Parallel Processing: Maximize throughput with concurrent API requests.
  • Rate Limiting: Stay within Anthropic's API limits for requests and tokens.
  • Adaptive Token Estimation: Use initial estimates and update with actual usage.
  • Caching Support: Optional caching for efficient processing of repeated content.
  • Error Handling: Retry failed requests and log issues for easy debugging.
  • Memory Efficient: Stream requests from file to handle large datasets.

Quick Start

Installation

This project was developed using Python 3.10.13. For optimal compatibility and performance, it is strongly recommended that you use the same version.

  1. Clone the repository:

    git clone https://github.com/your-username/anthropic-parallel-processor.git
    cd anthropic-parallel-processor
  2. Create a virtual environment:

    python -m venv .venv

    For macOS/Linux:

    source .venv/bin/activate

    For Windows:

    .venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt
    
  4. Set up your Anthropic API key:

    You have two options to set up your API key:

    a. Create a .env file in the root of the project and add your API key:

    ANTHROPIC_API_KEY=your-api-key-here
    

    b. Or, set it as an environment variable in your terminal:

    export ANTHROPIC_API_KEY=your-api-key-here

    📎 Note: Replace 'your-api-key-here' with your actual Anthropic API key.

Usage

Without Caching

python api_request_parallel_processor.py \
--requests_filepath examples/test_requests_to_parallel_process.jsonl \
--save_filepath examples/data/test_requests_to_parallel_process_results.jsonl \
--request_url https://api.anthropic.com/v1/messages \
--max_requests_per_minute 40 \
--max_tokens_per_minute 16000 \
--max_attempts 5 \
--logging_level INFO

With Caching

python api_request_parallel_processor.py \
--requests_filepath examples/test_caching_requests_to_parallel_process.jsonl \
--save_filepath examples/data/test_caching_requests_to_parallel_process_results.jsonl \
--request_url https://api.anthropic.com/v1/messages \
--use_caching True \
--max_requests_per_minute 40 \
--max_tokens_per_minute 16000 \
--max_attempts 5 \
--logging_level INFO

Important Notes ⚠️

Cache Limitations

The minimum cacheable prompt length is:

  • 1024 tokens for Claude 3.5 Sonnet and Claude 3 Opus
  • 2048 tokens for Claude 3 Haiku

We suggest first testing the prompt manually until you see that caching works. In our script, there is an initial call for caching that will let you know if caching is used. If not, you will receive a warning and a 10-second delay before starting the parallel API calls, giving you time to kill the process if needed.

You can learn more about caching from the official Anthropic documentation: Prompt Caching

Rate Limits

Users need to be aware of and check their specific rate limits. The default settings in this script (40 requests per minute and 16,000 tokens per minute) are set to approximately 80% of the Tier 1 limits. However, based on your tier, there can be different rate limits.

To check your Tier and rate limits:

  • Go to the Anthropic Console
  • Navigate to the Settings tab
  • Look for the Limits tab in the sidebar on the left side
  • Here you can see Rate limits for all models and your current tier

You can read more about rate limits and tiers in the official Anthropic documentation: Anthropic API Rate Limits

Make sure to adjust the max_requests_per_minute and max_tokens_per_minute configuration options according to your specific tier and needs to ensure optimal performance and avoid hitting rate limits.

Input File Format

The input file should be a JSONL file where each line is a JSON object representing a single API request. Here's an example structure:

{"model": "claude-3-5-sonnet-20240620", "max_tokens": 1024, "messages": [{"role": "user", "content": "Tell me a joke"}], "metadata": {"row_id": 1}}

For caching, use the following structure:

{
  "model": "claude-3-5-sonnet-20240620",
  "max_tokens": 1024,
  "system": [
    {
      "type": "text",
      "text": "You are an AI assistant tasked with analyzing blogs."
    },
    {
      "type": "text",
      "text": "<blog content here>",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": "Analyze the main themes of this blog."
    }
  ]
}

Generating Request Files

You can generate JSONL files for API requests using Python. The following examples demonstrate one approach for both non-caching and caching scenarios, but keep in mind that there are many ways to create these files depending on your specific needs and data sources. These examples are meant to serve as a starting point:

Without Caching

To generate a JSONL file for standard requests:

import json

filename = "examples/test_requests_to_parallel_process.jsonl"
n_requests = 10
jobs = [
    {
        "model": "claude-3-5-sonnet-20240620",
        "max_tokens": 1024,
        "temperature": 0,
        "messages": [
            {
                "role": "user",
                "content": f"How much is 8 * {x}? Return only the result.\n Result:",
            }
        ],
    }
    for x in range(n_requests)
]
with open(filename, "w") as f:
    for job in jobs:
        json_string = json.dumps(job)
        f.write(json_string + "\n")

With Caching

For requests utilizing caching:

import json

filename = "examples/test_caching_requests_to_parallel_process.jsonl"
queries = [
    "<query/instruction_1>",
    "<query/instruction_2>",
    # ...
]
jobs = [
    {
        "model": "claude-3-5-sonnet-20240620",
        "max_tokens": 1024,
        "temperature": 0,
        "system": [
            {
                "type": "text",
                "text": "You are an AI assistant tasked with... Your goal is to provide insightful information and knowledge.\n",
            },
            {
                "type": "text",
                "text": "<Large repetitive prompt you want to cache.>",
                "cache_control": {"type": "ephemeral"},
            },
        ],
        "messages": [
            {
                "role": "user",
                "content": query,
            }
        ],
    }
    for query in queries
]
with open(filename, "w") as f:
    for job in jobs:
        json_string = json.dumps(job)
        f.write(json_string + "\n")

Remember to replace <Large repetitive prompt you want to cache.> and <query/instruction_X> with your actual data.

Note on Metadata 💡

You can add a metadata key to each request object to include any additional information you want to associate with the request. This can be particularly useful for tracking or mapping requests to your dataset. For example:

"metadata": {"row_id": 1, "source": "dataset_A", "category": "science"}

The metadata will be preserved in the output, allowing you to easily map the results back to your original data or include any other relevant information for post-processing.

Configuration Options

  • requests_filepath: Path to the input JSONL file.
  • save_filepath: Path for the output JSONL file (optional).
  • request_url: Anthropic API endpoint (default: "https://api.anthropic.com/v1/messages").
  • api_key: Your Anthropic API key (can be set as an environment variable).
  • max_requests_per_minute: Target requests per minute (default: 40).
  • max_tokens_per_minute: Target tokens per minute (default: 16,000).
  • max_attempts: Number of retries for failed requests (default: 5).
  • logging_level: Logging verbosity (default: INFO).
    • ERROR or 40: Logs when requests fail after all retries
    • WARNING or 30: Logs when requests hit rate limits or other errors
    • SUCCESS or 25: Logs successful operations (Loguru-specific level)
    • INFO or 20: Logs when requests start and the status at finish
    • DEBUG or 10: Logs various things as the loop runs to see when they occur
    • TRACE or 5: Logs very detailed information for debugging
  • use_caching: Enable caching for repeated content (optional).

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project Origin

This project was inspired by and adapted from the OpenAI API Request Parallel Processor, which can be found in the OpenAI Cookbook. I've modified and extended the original script to work with Anthropic's API. Sources:

We appreciate the work done by OpenAI in creating the original script, which served as a valuable starting point for this project.

About

Faster and cheaper! parallel processing of Anthropic API requests, optimizing for speed, cost-efficiency, and rate limit compliance.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages