Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Function calling with stream vs without stream, arguments=None when stream option is enabled #9693

Open
1 task done
ankush13r opened this issue Oct 25, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@ankush13r
Copy link

ankush13r commented Oct 25, 2024

Your current environment

Dockerfile: vllm/vllm-openai:v0.6.3

Parameters:
--enable-auto-tool-choice --tool-call-parser hermes

Model Input Dumps

No response

🐛 Describe the bug

I'm using the VLLM library with a Docker container as a REST API, specifically the v1/chat/completion/ endpoint with the OpenAI client.

When I run chat completions without streaming, it returns tool_calls with the tool name and its arguments as expected. However, when I enable the streaming option, it only returns the tool name, with arguments set to None. I'm not sure why this is happening.

I've tried searching for related issues but haven’t found anything helpful.
Have tried stream_options={"include_usage": True} and it gives same output.

Model generate this output:

<tool_call>
{"arguments": {"n1": 2, "n2": 2}, "name": "sum"}
</tool_call>
chat_completion = client.chat.completions.create(
        model="tgi",
        messages=messages,
        stream=True,
        max_tokens=2000,
        temperature=0.3,
        tools=tools,
        tool_choice="auto",
    )
chunks = []
for chunk in chat_completion:
    chunks.append(chunk)
    if chunk.choices[0].delta.tool_calls:
        print(chunk.choices[0].delta.tool_calls[0])
    else:
        print(chunk.choices[0].delta)


chat_completion = client.chat.completions.create(
        model="tgi",
        messages=messages,
        stream=False,
        max_tokens=2000,
        temperature=0.3,
        tools=tools,
        tool_choice="auto",
    )
print(chat_completion.choices[0].message.tool_calls[0])

Output:

  • with stream:
ChoiceDelta(content='', function_call=None, refusal=None, role='assistant', tool_calls=None)
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-ac7886c6cea04451b439d4e24b21ab7a', function=ChoiceDeltaToolCallFunction(arguments=None, name='sum'), type='function')
ChoiceDelta(content='', function_call=None, refusal=None, role=None, tool_calls=None)
  • Without stream
ChatCompletionMessageToolCall(id='chatcmpl-tool-736bd066f6744f9985817df30c73aad3', function=Function(arguments='{"n1": 2, "n2": 2}', name='sum'), type='function')

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@ankush13r ankush13r added the bug Something isn't working label Oct 25, 2024
@DarkLight1337
Copy link
Member

@K-Mistele can you take a look into this?

@ankush13r
Copy link
Author

I’ve been debugging the issue on my own and think I've identified the solution. After testing the API, I noticed that it currently generates tool_calls where the function name and arguments are in separate yield statements, which is causing issues. Here’s an example of the current output:

Current Output:

[Choice(delta=ChoiceDelta(content='', function_call=None, refusal=None, role='assistant', tool_calls=None), finish_reason=None, index=0, logprobs=None)]
[Choice(delta=ChoiceDelta(content=None, function_call=None, refusal=None, role=None, tool_calls=[ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(None, name='sum'), type=None)]), finish_reason=None, index=0, logprobs=None)]
[Choice(delta=ChoiceDelta(content=None, function_call=None, refusal=None, role=None, tool_calls=[ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='{"n1": 2, "n2": 2}', name=None), type=None)]), finish_reason=None, index=0, logprobs=None)]
[Choice(delta=ChoiceDelta(content='', function_call=None, refusal=None, role=None, tool_calls=None), finish_reason='tool_calls', index=0, logprobs=None, stop_reason=None)]

In this example, the function name is yielded separately from its arguments. However, for functionality like chatbot integration and API calls—where multiple frameworks expect the tool_call to be complete in a single field—it would be more efficient if both the name and arguments were generated in the same yield statement.

Expected Behavior: The API should generate tool_calls with the function name and arguments combined, so the function can be utilized directly without additional processing. Here’s an example of the ideal output:

[Choice(delta=ChoiceDelta(content='', function_call=None, refusal=None, role='assistant', tool_calls=None), finish_reason=None, index=0, logprobs=None)]
[Choice(delta=ChoiceDelta(content=None, function_call=None, refusal=None, role=None, tool_calls=[ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='{"n1": 2, "n2": 2}', name='sum'), type=None)]), finish_reason=None, index=0, logprobs=None)]
[Choice(delta=ChoiceDelta(content='', function_call=None, refusal=None, role=None, tool_calls=None), finish_reason='tool_calls', index=0, logprobs=None, stop_reason=None)]

@K-Mistele
Copy link
Contributor

hi @ankush13r! You are correct in that the function name and function arguments are handled in separate yield statements. vLLM's OpenAI-compatible tool calling implementation follows OpenAI's standard for tool streaming, which works as follows.

here's an example request you can make with postman or something similar to illustrate what the streamed Server-sent events will look like according to OpenAI's standard:

{
  "model": "gpt-4o",
   "messages": [
    {
      "role": "user",
      "content": "Can you tell me the weather in dallas in fahrenheit?"
    }

  ],
  "stream": true,
  "temperature": 0.7,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "city": {
              "type": "string",
              "description": "The city to find the weather for, e.g. 'San Francisco'"
            },
            "state": {
              "type": "string",
              "description": "the two-letter abbreviation for the state that the city is in, e.g. 'CA' which would mean 'California'"
            },
            "unit": {
              "type": "string",
              "description": "The unit to fetch the temperature in",
              "enum": [
                "celsius",
                "fahrenheit"
              ]
            }
          }
        }
      }
    }
  ]
}

Here is what this request generates from OpenAI using streaming:

Long list of Server-sent events from OpenAI
data: {"id":"chatcmpl-AMH4lG0yiD21BtZzSfsQtoMo7zDRi","object":"chat.completion.chunk","created":1729872219,"model":"gpt-4o-2024-08-06","system_fingerprint":"fp_90354628f2","choices":[{"index":0,"delta":{"role":"assistant","content":null,"tool_calls":[{"index":0,"id":"call_671mMmDFiC5r38Myya1UQub8","type":"function","function":{"name":"get_current_weather","arguments":""}}],"refusal":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-AMH4lG0yiD21BtZzSfsQtoMo7zDRi","object":"chat.completion.chunk","created":1729872219,"model":"gpt-4o-2024-08-06","system_fingerprint":"fp_90354628f2","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-AMH4lG0yiD21BtZzSfsQtoMo7zDRi","object":"chat.completion.chunk","created":1729872219,"model":"gpt-4o-2024-08-06","system_fingerprint":"fp_90354628f2","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"city"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-AMH4lG0yiD21BtZzSfsQtoMo7zDRi","object":"chat.completion.chunk","created":1729872219,"model":"gpt-4o-2024-08-06","system_fingerprint":"fp_90354628f2","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\":\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-AMH4lG0yiD21BtZzSfsQtoMo7zDRi","object":"chat.completion.chunk","created":1729872219,"model":"gpt-4o-2024-08-06","system_fingerprint":"fp_90354628f2","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"d"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-AMH4lG0yiD21BtZzSfsQtoMo7zDRi","object":"chat.completion.chunk","created":1729872219,"model":"gpt-4o-2024-08-06","system_fingerprint":"fp_90354628f2","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"allas"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-AMH4lG0yiD21BtZzSfsQtoMo7zDRi","object":"chat.completion.chunk","created":1729872219,"model":"gpt-4o-2024-08-06","system_fingerprint":"fp_90354628f2","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\",\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-AMH4lG0yiD21BtZzSfsQtoMo7zDRi","object":"chat.completion.chunk","created":1729872219,"model":"gpt-4o-2024-08-06","system_fingerprint":"fp_90354628f2","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"unit"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-AMH4lG0yiD21BtZzSfsQtoMo7zDRi","object":"chat.completion.chunk","created":1729872219,"model":"gpt-4o-2024-08-06","system_fingerprint":"fp_90354628f2","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\":\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-AMH4lG0yiD21BtZzSfsQtoMo7zDRi","object":"chat.completion.chunk","created":1729872219,"model":"gpt-4o-2024-08-06","system_fingerprint":"fp_90354628f2","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"fahren"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-AMH4lG0yiD21BtZzSfsQtoMo7zDRi","object":"chat.completion.chunk","created":1729872219,"model":"gpt-4o-2024-08-06","system_fingerprint":"fp_90354628f2","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"heit"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-AMH4lG0yiD21BtZzSfsQtoMo7zDRi","object":"chat.completion.chunk","created":1729872219,"model":"gpt-4o-2024-08-06","system_fingerprint":"fp_90354628f2","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"}"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-AMH4lG0yiD21BtZzSfsQtoMo7zDRi","object":"chat.completion.chunk","created":1729872219,"model":"gpt-4o-2024-08-06","system_fingerprint":"fp_90354628f2","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"tool_calls"}]}

data: [DONE]

There are a couple important things to observe here:

  • the first SSE event specifies the message role (assistant) and sets up the tool call array, and includes the name of the called tool
  • subsequent SSE events include an arguments field for the function that's being called. Each of these is a diff of the arguments. To construct the entire arguments strings, you would concatenate these, and then try to parse the complete string as JSON, handling validation and errors(since they are not guaranteed to be valid JSON either in OpenAI's API or vLLM's)
    • NOTE that argument diffs in vLLM may be empty strings in some cases, however this should not break processing as that does not affect concatenation

This is the OpenAI standard for server-sent events for tool streaming, and this is the standard that vLLM follows. A function's name is always streamed before argument deltas arrive, and argument deltas will never be streamed in the same event as the function's name. Multiple argument deltas will be received that must be concatenated; the entire arguments stream (should) never be received all at once.

When you're receiving deltas from vLLM, are these (below) the only deltas that you are receiving before the stream ends, or are you receiving additional deltas with arguments diffs like shown above?

ChoiceDelta(content='', function_call=None, refusal=None, role='assistant', tool_calls=None)
ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-ac7886c6cea04451b439d4e24b21ab7a', function=ChoiceDeltaToolCallFunction(arguments=None, name='sum'), type='function')
ChoiceDelta(content='', function_call=None, refusal=None, role=None, tool_calls=None)

If these are the only deltas you receive, that probably indicates a bug, since you should receive argument deltas as well. If you do receive additional deltas, you just need to handle concatenating and parsing them as described above & in the docs example that I linked to.

Can you please share your entire vLLM start command and the entire request and all received deltas so that I can help you debug it?

You should be able to see an example of how this works, including delta processing for arguments, in this example from the vLLM docs.

I actually created this demo with hermes, so it should work for testing your purposes.

@ankush13r
Copy link
Author

ankush13r commented Oct 26, 2024

Now I see that the arguments are being yielded separately. However, I found a bug in the Hermes parser during debugging, which causes it to return a response without arguments. Below is an example of the output received:

[Choice(delta=ChoiceDelta(content='', function_call=None, refusal=None, role='assistant', tool_calls=None), finish_reason=None, index=0, logprobs=None)]
[Choice(delta=ChoiceDelta(content=None, function_call=None, refusal=None, role=None, tool_calls=[ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-eb20e37d0a2b449694953e3647e13603', function=ChoiceDeltaToolCallFunction(arguments=None, name='sum'), type='function')]), finish_reason=None, index=0, logprobs=None)]
[Choice(delta=ChoiceDelta(content='', function_call=None, refusal=None, role=None, tool_calls=None), finish_reason='tool_calls', index=0, logprobs=None, stop_reason=None)]
[]

Debug Findings:
After investigating, I found that the parser throws a ValueError and NoneType in the hermes_tool_parser.py file. Specifically, the function extract_tool_calls_streaming attempts to locate delta_text within cur_arguments_json, but fails if the substring delta_text isn’t found. Here’s the relevant debugging output:

ERROR 10-26 16:44:41 hermes_tool_parser.py:337] Error trying to handle streaming tool call: 'NoneType' object has no attribute 'get'
INFO 10-26 16:44:41 metrics.py:345] Avg prompt throughput: 34.1 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
ERROR 10-26 16:44:41 hermes_tool_parser.py:337] Error trying to handle streaming tool call: substring not found
ERROR 10-26 16:44:41 hermes_tool_parser.py:337] Error trying to handle streaming tool call: cannot access local variable 'tool_call_portion' where it is not associated with a value

Proposed Solution:

The solution that mitigates this bug is to add a check to verify that delta_text exists within cur_arguments_json before attempting to find its index and check if current_tool_call is not None. Here’s the current and modified code:
current:

function_name: Union[str, None] = current_tool_call.get("name")

cur_arguments = current_tool_call.get("arguments")

# get the location where previous args differ from current
args_delta_start_loc = cur_arguments_json.index(delta_text) \
                       + len(delta_text)

arguments_delta = cur_arguments_json[:args_delta_start_loc]

https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py#L227C51-L227C72
https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py#L265
https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py#L291

Updated Code:

function_name: Union[str, None] = current_tool_call.get("name") if current_tool_call else None

cur_arguments = current_tool_call.get("arguments") if current_tool_call else None


args_delta_start_loc = None
if delta_text in cur_arguments_json:
    args_delta_start_loc = cur_arguments_json.index(delta_text) \
                           + len(delta_text)

arguments_delta = cur_arguments_json[:args_delta_start_loc]

This fix both bugs. However, it still produces reponses with arguments='', name=None , as shown below, but I get correct arguments.

[Choice(delta=ChoiceDelta(content='', function_call=None, refusal=None, role='assistant', tool_calls=None), finish_reason=None, index=0, logprobs=None)]
[Choice(delta=ChoiceDelta(content=None, function_call=None, refusal=None, role=None, tool_calls=[ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-a35d521fb400478aa8d20371876adf0f', function=ChoiceDeltaToolCallFunction(arguments=None, name='sum'), type='function')]), finish_reason=None, index=0, logprobs=None)]
[Choice(delta=ChoiceDelta(content=None, function_call=None, refusal=None, role=None, tool_calls=[ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='{"n1": 2, "n2": 2}', name=None), type=None)]), finish_reason=None, index=0, logprobs=None)]
[Choice(delta=ChoiceDelta(content=None, function_call=None, refusal=None, role=None, tool_calls=[ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='', name=None), type=None)]), finish_reason=None, index=0, logprobs=None)]
[Choice(delta=ChoiceDelta(content='', function_call=None, refusal=None, role=None, tool_calls=None), finish_reason='tool_calls', index=0, logprobs=None, stop_reason=None)]

To prevent empty responses, the solution is to check if arguments are not an empty string before yielding. The proposed solution involves adding an if condition as follows: if diff: # Diff can be an empty string ''.
https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py#L193
Here’s the adjusted code:

if diff:
                    diff = json.dumps(diff).replace(
                        self.streamed_args_for_tool[self.current_tool_id], "")
                    
                    if diff: #Diff can be empty string ''
                        logger.debug(
                            "Finishing tool and found diff that had not "
                            "been streamed yet: %s", diff)
                        self.streamed_args_for_tool[self.current_tool_id] \
                            += diff
                        return DeltaMessage(tool_calls=[
                            DeltaToolCall(index=self.current_tool_id,
                                        function=DeltaFunctionCall(
                                            arguments=diff).model_dump(
                                                exclude_none=True))
                        ])
                    else:
                        return None

Let me know if you think this should fix the bug or if the issue lies with the model's response generation. I'm open to collaborating to resolve the bug and can make pull request.

@K-Mistele
Copy link
Contributor

Can you please share the request you're using (messages, tools, vLLM config) so that I can try to reproduce the issue? It's not impossible that there's a bug in the Hermes tool parser, but it has been used and tested pretty robustly so I'm curious what's different about this and I'd like to be able to step through the streaming parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants