Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigjipvxzoorsewslbdorsv7ziue5u2igdmxmr44y27tt4r4zvpjqy",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkn2nk7e5db2"
  },
  "path": "/t/issue-with-qwen-model-in-huggingface/175637#post_2",
  "publishedAt": "2026-04-29T11:20:19.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "InferenceClient",
    "Inference Providers",
    "chat-completion task",
    "huggingface_hub v1 migrated from requests/aiohttp to httpx",
    "TGI streaming docs",
    "huggingface_hub v1 migration guide",
    "InferenceClient reference",
    "Hugging Face chat-completion task",
    "Default ToolCallingAgent InferenceClient examples failing",
    "v1 migration guide",
    "LlamaIndex HuggingFaceInferenceAPI reference",
    "Hugging Face conversational API returns 404",
    "404 Not Found when using Qwen models with HuggingFaceInferenceAPI",
    "Agents Course Unit 2.2 error 404",
    "Hugging Face Inference Providers",
    "Hugging Face chat completion",
    "LlamaIndex Hugging Face LLM integration",
    "huggingface_hub v1 changed the HTTP layer to httpx",
    "LlamaIndex has many separate llama-index-* packages",
    "Qwen function calling docs",
    "Qwen-Agent GitHub repo",
    "Qwen-Agent quickstart",
    "Qwen3 model card agentic-use note",
    "Hugging Face streaming docs",
    "smolagents provider/tool-calling issue",
    "LlamaIndex issue: Hugging Face conversational API returns 404",
    "LlamaIndex changelog"
  ],
  "textContent": "Because API specifications change rapidly, drift issues are likely to occur. If you’re lucky, updating `huggingface_hub` alone may resolve the issue, but “the features supported via the API may have changed”:\n\n* * *\n\n# Qwen model errors in smolagents and LlamaIndex: likely causes and practical fixes\n\nYou are probably dealing with **two different integration problems** , not one “Qwen is broken” problem.\n\nThe two errors point to different layers:\n\n\n    # smolagents\n    Error in generating model output:\n    Client.post() got an unexpected keyword argument 'stream'\n\n\nThis points to a **streaming / HTTP-client / wrapper compatibility problem**.\n\n\n    # LlamaIndex\n    RuntimeError: Cannot send a request, as the client has been closed.\n\n\nThis points to an **async client lifecycle problem** , especially if you used async calls, streaming calls, reused the same LLM/query engine object, or ran this in a notebook/server where objects persist between calls.\n\nThe common pattern is:\n\n\n    Qwen model\n    + Hugging Face provider routing\n    + wrapper library\n    + stream/tool/chat/async behavior\n    + package versions\n    = failure surface\n\n\nSo the better diagnosis is not:\n\n\n    Qwen does not work.\n\n\nThe better diagnosis is:\n\n\n    The selected Qwen route works differently through different wrappers.\n    smolagents and LlamaIndex are hitting different edge cases in the inference stack.\n\n\n* * *\n\n## 1. Background: why this happens with Qwen + Hugging Face wrappers\n\nModern Hugging Face inference is no longer just:\n\n\n    model_id -> API call -> output\n\n\nIt is more like:\n\n\n    model_id\n    -> task type\n    -> provider selection\n    -> Hugging Face router / provider backend\n    -> Python client\n    -> framework wrapper\n    -> sync / async / streaming / tool-calling mode\n\n\nHugging Face’s InferenceClient is a unified client that can work with the free Inference API, self-hosted Inference Endpoints, and third-party Inference Providers.\n\nHugging Face Inference Providers also support provider-backed serverless inference, and the chat-completion task supports OpenAI-compatible calls, tools/constraints, and streaming.\n\nThat matters because Qwen models are often used for:\n\n  * chat\n  * coding\n  * agents\n  * tool calling\n  * streaming\n  * RAG response synthesis\n  * OpenAI-compatible routes\n  * Hugging Face provider routes\n\n\n\nThose features do **not** always have the same support across every provider/wrapper combination.\n\nA Qwen model might work here:\n\n\n    Direct Hugging Face InferenceClient\n    -> explicit provider\n    -> non-streaming chat\n\n\nbut fail here:\n\n\n    smolagents\n    -> auto provider\n    -> tool-calling agent\n    -> streaming\n\n\nor here:\n\n\n    LlamaIndex\n    -> HuggingFaceInferenceAPI\n    -> async streaming\n    -> reused client\n\n\nSame model name. Different route. Different behavior.\n\n* * *\n\n## 2. The smolagents error\n\n### Error\n\n\n    Error in generating model output:\n    Client.post() got an unexpected keyword argument 'stream'\n\n\n### What this error usually means\n\nThis is a Python client-level error. It usually means some code called something shaped like:\n\n\n    client.post(..., stream=True)\n\n\nbut that `Client.post()` method does **not** accept a `stream` keyword argument.\n\nThis is important because huggingface_hub v1 migrated from requests/aiohttp to httpx. That migration is a good thing overall, but it can expose old wrapper assumptions.\n\nOlder `requests`-style code commonly uses:\n\n\n    requests.post(url, stream=True)\n\n\nBut `httpx` streaming is shaped differently. If a wrapper forwards `stream=True` into the wrong layer, you can get:\n\n\n    Client.post() got an unexpected keyword argument 'stream'\n\n\n### Critical distinction: two meanings of `stream`\n\nThere are two different layers where “streaming” can appear.\n\n#### Valid: model/API-level streaming\n\nThis is normal:\n\n\n    client.chat.completions.create(\n        model=\"...\",\n        messages=[...],\n        stream=True,\n    )\n\n\nHugging Face’s TGI streaming docs show this style: pass `stream=True` to `InferenceClient.chat.completions.create(...)` and iterate over chunks.\n\n#### Risky: HTTP-client-level streaming\n\nThis is the suspicious shape:\n\n\n    client.post(..., stream=True)\n\n\nYour error suggests that `stream=True` is reaching a lower-level `.post()` method that does not accept it.\n\nSo the likely problem is **not** “streaming is impossible.” The likely problem is:\n\n\n    stream=True is being passed to the wrong abstraction layer.\n\n\n* * *\n\n## 3. Likely causes of the smolagents error\n\n### Cause 1 — version mismatch between `smolagents`, `huggingface_hub`, `httpx`, and possibly `openai`\n\nThis is the most likely cause.\n\nA broken environment may look like this:\n\n\n    smolagents: older\n    huggingface_hub: newer, HTTPX-based\n    httpx: newer\n    openai: newer\n\n\nor the reverse:\n\n\n    smolagents: newer\n    huggingface_hub: older\n    httpx: incompatible\n\n\nThe wrapper expects one API shape; the installed lower-level client exposes another.\n\nRelevant docs:\n\n  * huggingface_hub v1 migration guide\n  * InferenceClient reference\n  * Hugging Face chat-completion task\n\n\n\n### Cause 2 — streaming is enabled implicitly\n\nYou may not be explicitly writing `stream=True`, but a framework can enable streaming internally.\n\nFor example, an agent framework may stream model output to:\n\n  * show tokens incrementally\n  * collect intermediate tool calls\n  * support function/tool-call deltas\n  * stream logs into a UI\n  * support notebook display\n\n\n\nSo even if your own code does not contain `stream=True`, the wrapper path may still use streaming.\n\n### Cause 3 — provider auto-selection changed or selected a provider that does not support the needed mode\n\nThis is especially relevant for smolagents.\n\nThere is a relevant smolagents issue: Default ToolCallingAgent InferenceClient examples failing. In that case, the report says Hugging Face provider selection picked a provider that did not support tool calling, while another provider had worked before.\n\nThat issue is not the exact same `stream` error, but it is the same architectural class:\n\n\n    agent wrapper sends advanced parameters\n    -> selected provider does not support them\n    -> failure appears inside the wrapper\n\n\nIn that issue, the advanced parameters were `tools` / `tool_choice`.\n\nIn your case, the advanced parameter is probably `stream`.\n\n### Cause 4 — using an agent/tool-calling path before plain chat is verified\n\nAgent frameworks send more complex payloads than normal chat.\n\nPlain chat might send:\n\n\n    {\n      \"messages\": [...],\n      \"max_tokens\": 256\n    }\n\n\nAgent/tool-calling may send:\n\n\n    {\n      \"messages\": [...],\n      \"tools\": [...],\n      \"tool_choice\": \"auto\",\n      \"stream\": true,\n      \"max_tokens\": 256\n    }\n\n\nIf any part of that payload is unsupported by the provider or wrapper, you can get an error that looks unrelated to tools.\n\n* * *\n\n## 4. smolagents solutions\n\n### Solution A — upgrade related packages together\n\nDo not upgrade only one package.\n\nUse a clean environment if possible:\n\n\n    python -m venv .venv\n    source .venv/bin/activate\n\n    python -m pip install --upgrade pip\n    python -m pip install -U \\\n      \"smolagents[toolkit]\" \\\n      huggingface_hub \\\n      httpx \\\n      openai\n\n\nThen restart the Python process or notebook kernel.\n\nThis matters because if you install new packages in an already-running notebook, Python may still hold old imported modules in memory.\n\n### Solution B — print versions before debugging\n\nRun this in the failing environment:\n\n\n    import sys\n    import importlib.metadata as md\n\n    packages = [\n        \"smolagents\",\n        \"huggingface_hub\",\n        \"httpx\",\n        \"openai\",\n    ]\n\n    print(\"Python:\", sys.version)\n\n    for package in packages:\n        try:\n            print(f\"{package}: {md.version(package)}\")\n        except md.PackageNotFoundError:\n            print(f\"{package}: not installed\")\n\n\nLook for a mixed environment such as:\n\n\n    new huggingface_hub + old smolagents\n    new httpx + old wrapper\n    old notebook imports after pip upgrade\n\n\nAlso check your Python version. `huggingface_hub` v1 requires Python 3.9+ according to the v1 migration guide.\n\n### Solution C — test Hugging Face directly before smolagents\n\nBefore using smolagents, check whether the model/provider route works directly.\n\n\n    import os\n    from huggingface_hub import InferenceClient\n\n    client = InferenceClient(\n        provider=\"<provider>\",\n        api_key=os.environ[\"HF_TOKEN\"],\n    )\n\n    response = client.chat.completions.create(\n        model=\"<qwen-model-id>\",\n        messages=[\n            {\"role\": \"user\", \"content\": \"Return exactly: ok\"}\n        ],\n        max_tokens=10,\n        temperature=0,\n        stream=False,\n    )\n\n    print(response.choices[0].message.content)\n\n\nExample values:\n\n\n    <provider> = together\n    <qwen-model-id> = Qwen/Qwen3-Coder-30B-A3B-Instruct\n\n\nInterpretation:\n\nResult | Meaning\n---|---\nDirect call works | Qwen/provider/token/basic route are OK; smolagents wrapper is the likely issue\nDirect call fails with auth/quota/provider error | Fix token/provider/model availability first\nDirect call fails with stream/client error | Your lower-level client environment is suspect\nNon-streaming works but streaming fails | Streaming/provider/client path is the issue\n\n### Solution D — set provider explicitly\n\nAvoid provider auto-selection while debugging.\n\n\n    from smolagents import CodeAgent, InferenceClientModel\n    import os\n\n    model = InferenceClientModel(\n        model_id=\"<qwen-model-id>\",\n        provider=\"<provider>\",\n        api_key=os.environ[\"HF_TOKEN\"],\n        max_tokens=512,\n        temperature=0.2,\n    )\n\n    agent = CodeAgent(\n        tools=[],\n        model=model,\n    )\n\n    result = agent.run(\"Return exactly: ok\")\n    print(result)\n\n\nExample:\n\n\n    model = InferenceClientModel(\n        model_id=\"Qwen/Qwen3-Coder-30B-A3B-Instruct\",\n        provider=\"together\",\n        api_key=os.environ[\"HF_TOKEN\"],\n        max_tokens=512,\n        temperature=0.2,\n    )\n\n\nThe key parts are:\n\n\n    provider=\"<provider>\"\n    tools=[]\n\n\nThis removes two common sources of failure:\n\n  1. implicit provider selection\n  2. tool-calling payload complexity\n\n\n\nYou can later change `provider` to another provider that is currently supported for your exact Qwen model.\n\n### Solution E — add tools only after the basic model works\n\nUse this order:\n\n\n    1. Direct Hugging Face non-streaming call\n    2. Direct Hugging Face streaming call\n    3. smolagents with tools=[]\n    4. smolagents with one simple local tool\n    5. smolagents with multiple tools\n    6. smolagents with web/search/external tools\n    7. only then tool-calling + streaming together\n\n\nDo **not** start with:\n\n\n    Qwen + smolagents + auto provider + tools + streaming\n\n\nThat is too many moving parts.\n\n* * *\n\n## 5. The LlamaIndex error\n\n### Error\n\n\n    RuntimeError: Cannot send a request, as the client has been closed.\n\n\n### What this error usually means\n\nThis usually means a client object was closed and then reused.\n\nIn LlamaIndex, this is especially plausible with `HuggingFaceInferenceAPI`.\n\nThe LlamaIndex HuggingFaceInferenceAPI reference describes a wrapper around Hugging Face’s Inference API. It uses Hugging Face’s `InferenceClient` for sync calls and `AsyncInferenceClient` for async calls.\n\nThe same reference/source shows async streaming paths shaped like:\n\n\n    async for delta in await self._async_client.text_generation(\n        prompt, stream=True, **model_kwargs\n    ):\n        ...\n    await self._async_client.close()\n\n\nThat creates a likely failure sequence:\n\n\n    1. You create one HuggingFaceInferenceAPI object.\n    2. You use async streaming.\n    3. The stream finishes.\n    4. LlamaIndex closes its internal AsyncInferenceClient.\n    5. You reuse the same LLM object, query engine, chat engine, or app object.\n    6. The next request tries to use the closed client.\n    7. RuntimeError: Cannot send a request, as the client has been closed.\n\n\nThis fits your LlamaIndex error closely.\n\n* * *\n\n## 6. Additional LlamaIndex-specific issue: old conversational task path\n\nThere is another important problem in the LlamaIndex Hugging Face wrapper.\n\nThe LlamaIndex Hugging Face wrapper has historically separated chat/completion behavior around Hugging Face task types. A relevant issue, Hugging Face conversational API returns 404, reports that Hugging Face’s old `conversational` task route returned 404 and that setting `task=\"text-generation\"` was the workaround.\n\nThere are also Hugging Face course/notebook discussions showing similar Qwen + `HuggingFaceInferenceAPI` breakage, such as:\n\n  * 404 Not Found when using Qwen models with HuggingFaceInferenceAPI\n  * Agents Course Unit 2.2 error 404\n\n\n\nSo with Qwen + LlamaIndex, you may have **two overlapping problems** :\n\n\n    Problem A: async client closed and reused\n    Problem B: chat/conversational task route is fragile or stale\n\n\nThat is why the safer first test is:\n\n\n    sync + non-streaming + text-generation\n\n\n* * *\n\n## 7. LlamaIndex solutions\n\n### Solution A — first test sync, non-streaming, text generation\n\nStart here:\n\n\n    import os\n    from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\n\n    llm = HuggingFaceInferenceAPI(\n        model_name=\"<qwen-model-id>\",\n        token=os.environ[\"HF_TOKEN\"],\n        provider=\"<provider>\",\n        task=\"text-generation\",\n        is_chat_model=False,\n        num_output=128,\n        temperature=0.2,\n    )\n\n    response = llm.complete(\"Return exactly: ok\")\n    print(response.text)\n\n\nExample:\n\n\n    llm = HuggingFaceInferenceAPI(\n        model_name=\"Qwen/Qwen3-Coder-30B-A3B-Instruct\",\n        token=os.environ[\"HF_TOKEN\"],\n        provider=\"together\",\n        task=\"text-generation\",\n        is_chat_model=False,\n        num_output=128,\n        temperature=0.2,\n    )\n\n\nWhy this is the right first test:\n\n  * avoids async\n  * avoids streaming\n  * avoids the fragile chat/conversational route\n  * avoids query-engine complexity\n  * uses text generation directly\n  * keeps provider explicit\n\n\n\nIf this works, then Qwen + Hugging Face + LlamaIndex basic completion is fine.\n\nIf your full app still fails, the problem is likely in:\n\n  * async\n  * streaming\n  * query engine reuse\n  * chat engine reuse\n  * chat/conversational route\n  * package mismatch\n\n\n\n### Solution B — avoid reusing the same LlamaIndex object after async streaming\n\nRisky pattern:\n\n\n    llm = HuggingFaceInferenceAPI(...)\n\n    stream = await llm.astream_complete(\"First question\")\n    async for chunk in stream:\n        print(chunk.delta, end=\"\")\n\n    # Later reuse the same llm\n    stream = await llm.astream_complete(\"Second question\")\n\n\nSafer diagnostic pattern:\n\n\n    import os\n    from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\n\n    def make_llm():\n        return HuggingFaceInferenceAPI(\n            model_name=\"<qwen-model-id>\",\n            token=os.environ[\"HF_TOKEN\"],\n            provider=\"<provider>\",\n            task=\"text-generation\",\n            is_chat_model=False,\n            num_output=256,\n            temperature=0.2,\n        )\n\n    llm1 = make_llm()\n    # use llm1 once\n\n    llm2 = make_llm()\n    # use llm2 for the next async/streaming call\n\n\nThis is not the prettiest architecture, but it is a strong workaround if the issue is “closed client reused.”\n\n### Solution C — avoid async streaming until the basic path works\n\nFirst prove:\n\n\n    llm.complete(...)\n\n\nThen test:\n\n\n    await llm.acomplete(...)\n\n\nThen test:\n\n\n    llm.stream_complete(...)\n\n\nThen test:\n\n\n    await llm.astream_complete(...)\n\n\nDo not jump directly into a RAG query engine with async streaming.\n\n### Solution D — use LlamaIndex’s OpenAI-compatible route through the Hugging Face router\n\nBecause Hugging Face Inference Providers expose an OpenAI-compatible chat route, this can be cleaner than using the LlamaIndex Hugging Face-specific wrapper for chat-style Qwen calls.\n\nRelevant docs:\n\n  * Hugging Face Inference Providers\n  * Hugging Face chat completion\n  * LlamaIndex Hugging Face LLM integration\n  * LlamaIndex HuggingFaceInferenceAPI reference\n\n\n\nInstall:\n\n\n    python -m pip install -U \\\n      llama-index \\\n      llama-index-core \\\n      llama-index-llms-openai-like \\\n      openai\n\n\nThen:\n\n\n    import os\n    from llama_index.llms.openai_like import OpenAILike\n\n    llm = OpenAILike(\n        model=\"<qwen-model-id>:<provider>\",\n        api_base=\"https://router.huggingface.co/v1\",\n        api_key=os.environ[\"HF_TOKEN\"],\n        is_chat_model=True,\n        is_function_calling_model=False,\n        max_tokens=512,\n        temperature=0.2,\n    )\n\n    response = llm.chat([\n        {\"role\": \"user\", \"content\": \"Return exactly: ok\"}\n    ])\n\n    print(response.message.content)\n\n\nExample:\n\n\n    llm = OpenAILike(\n        model=\"Qwen/Qwen3-Coder-30B-A3B-Instruct:together\",\n        api_base=\"https://router.huggingface.co/v1\",\n        api_key=os.environ[\"HF_TOKEN\"],\n        is_chat_model=True,\n        is_function_calling_model=False,\n        max_tokens=512,\n        temperature=0.2,\n    )\n\n\nThis path is often conceptually simpler:\n\n\n    LlamaIndex OpenAILike\n    -> Hugging Face router /v1\n    -> explicit provider\n    -> Qwen\n\n\ninstead of:\n\n\n    LlamaIndex HuggingFaceInferenceAPI\n    -> conversational/text-generation task logic\n    -> sync/async Hugging Face clients\n    -> provider/router\n    -> Qwen\n\n\nFor your exact error, I would strongly consider the OpenAI-compatible route.\n\n* * *\n\n## 8. Package alignment\n\nBoth errors can be worsened by partially upgraded packages.\n\nUse a coherent install:\n\n\n    python -m pip install --upgrade pip\n\n    python -m pip install -U \\\n      huggingface_hub \\\n      httpx \\\n      openai \\\n      \"smolagents[toolkit]\" \\\n      llama-index \\\n      llama-index-core \\\n      llama-index-llms-huggingface-api \\\n      llama-index-llms-openai-like\n\n\nThen restart the Python process.\n\nPrint versions:\n\n\n    import sys\n    import importlib.metadata as md\n\n    packages = [\n        \"smolagents\",\n        \"huggingface_hub\",\n        \"httpx\",\n        \"openai\",\n        \"llama-index\",\n        \"llama-index-core\",\n        \"llama-index-llms-huggingface-api\",\n        \"llama-index-llms-openai-like\",\n    ]\n\n    print(\"Python:\", sys.version)\n\n    for package in packages:\n        try:\n            print(f\"{package}: {md.version(package)}\")\n        except md.PackageNotFoundError:\n            print(f\"{package}: not installed\")\n\n\nWhy this matters:\n\n  * huggingface_hub v1 changed the HTTP layer to httpx.\n  * LlamaIndex has many separate llama-index-* packages, and the changelog has explicitly warned in past releases that updating one package can require updating the rest.\n  * Notebook kernels can keep old imports alive after `pip install -U`.\n\n\n\n* * *\n\n## 9. Recommended debugging order\n\nUse this exact order.\n\n### Step 1 — verify environment\n\n\n    import sys\n    import importlib.metadata as md\n\n    for package in [\n        \"smolagents\",\n        \"huggingface_hub\",\n        \"httpx\",\n        \"openai\",\n        \"llama-index\",\n        \"llama-index-core\",\n        \"llama-index-llms-huggingface-api\",\n    ]:\n        try:\n            print(package, md.version(package))\n        except md.PackageNotFoundError:\n            print(package, \"not installed\")\n\n    print(sys.version)\n\n\n### Step 2 — direct Hugging Face non-streaming\n\n\n    import os\n    from huggingface_hub import InferenceClient\n\n    client = InferenceClient(\n        provider=\"<provider>\",\n        api_key=os.environ[\"HF_TOKEN\"],\n    )\n\n    response = client.chat.completions.create(\n        model=\"<qwen-model-id>\",\n        messages=[{\"role\": \"user\", \"content\": \"Return exactly: ok\"}],\n        max_tokens=10,\n        stream=False,\n    )\n\n    print(response.choices[0].message.content)\n\n\n### Step 3 — direct Hugging Face streaming\n\n\n    stream = client.chat.completions.create(\n        model=\"<qwen-model-id>\",\n        messages=[{\"role\": \"user\", \"content\": \"Count to 3.\"}],\n        max_tokens=64,\n        stream=True,\n    )\n\n    for chunk in stream:\n        delta = chunk.choices[0].delta.content\n        if delta:\n            print(delta, end=\"\")\n\n\nIf Step 2 works but Step 3 fails, the issue is specifically streaming/provider/client related.\n\n### Step 4 — smolagents with no tools\n\n\n    import os\n    from smolagents import CodeAgent, InferenceClientModel\n\n    model = InferenceClientModel(\n        model_id=\"<qwen-model-id>\",\n        provider=\"<provider>\",\n        api_key=os.environ[\"HF_TOKEN\"],\n        max_tokens=512,\n        temperature=0.2,\n    )\n\n    agent = CodeAgent(\n        tools=[],\n        model=model,\n    )\n\n    print(agent.run(\"Return exactly: ok\"))\n\n\nIf this fails with the `stream` keyword error, focus on:\n\n  * package versions\n  * `smolagents` / `huggingface_hub` / `httpx` compatibility\n  * whether smolagents is internally streaming\n  * provider route\n\n\n\n### Step 5 — LlamaIndex sync text generation\n\n\n    import os\n    from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\n\n    llm = HuggingFaceInferenceAPI(\n        model_name=\"<qwen-model-id>\",\n        token=os.environ[\"HF_TOKEN\"],\n        provider=\"<provider>\",\n        task=\"text-generation\",\n        is_chat_model=False,\n        num_output=128,\n        temperature=0.2,\n    )\n\n    print(llm.complete(\"Return exactly: ok\").text)\n\n\nIf this works, but your app fails, the issue is not basic Qwen generation. It is likely one of:\n\n  * async\n  * streaming\n  * query engine reuse\n  * chat route\n  * object lifecycle\n\n\n\n### Step 6 — LlamaIndex OpenAI-compatible path\n\n\n    import os\n    from llama_index.llms.openai_like import OpenAILike\n\n    llm = OpenAILike(\n        model=\"<qwen-model-id>:<provider>\",\n        api_base=\"https://router.huggingface.co/v1\",\n        api_key=os.environ[\"HF_TOKEN\"],\n        is_chat_model=True,\n        is_function_calling_model=False,\n        max_tokens=512,\n        temperature=0.2,\n    )\n\n    response = llm.chat([\n        {\"role\": \"user\", \"content\": \"Return exactly: ok\"}\n    ])\n\n    print(response.message.content)\n\n\nIf this works, you can use it as the stable LlamaIndex route.\n\n* * *\n\n## 10. Decision table\n\nSymptom | Most likely cause | What to try\n---|---|---\n`Client.post() got an unexpected keyword argument 'stream'` in smolagents | `stream` flag forwarded to wrong HTTP/client layer | Upgrade `smolagents`, `huggingface_hub`, `httpx`, `openai` together\nSame smolagents code worked before, now fails | Provider auto-selection changed | Set provider explicitly\nsmolagents works without tools but fails with tools | Provider/model does not support tool calling | Use a tool-capable provider/model pair\nDirect HF non-streaming works, streaming fails | Streaming support mismatch | Disable streaming or change provider\nLlamaIndex fails with `client has been closed` | Async client closed then reused | Avoid async streaming or recreate LLM object per stream\nLlamaIndex chat path fails but completion works | Fragile conversational/chat task route | Use `task=\"text-generation\"`\nLlamaIndex HF wrapper remains unstable | Wrapper/task/lifecycle mismatch | Use `OpenAILike` with HF router\nErrors vary between notebook runs | Kernel has mixed old/new imports | Restart runtime after package changes\n\n* * *\n\n## 11. What I would do in your case\n\n### For smolagents\n\nUse this first:\n\n\n    from smolagents import CodeAgent, InferenceClientModel\n    import os\n\n    model = InferenceClientModel(\n        model_id=\"Qwen/Qwen3-Coder-30B-A3B-Instruct\",\n        provider=\"together\",\n        api_key=os.environ[\"HF_TOKEN\"],\n        max_tokens=512,\n        temperature=0.2,\n    )\n\n    agent = CodeAgent(\n        tools=[],\n        model=model,\n    )\n\n    print(agent.run(\"Return exactly: ok\"))\n\n\nThen add tools one by one.\n\nIf that still gives:\n\n\n    Client.post() got an unexpected keyword argument 'stream'\n\n\nthen assume package/API mismatch and reinstall cleanly:\n\n\n    python -m pip install -U \\\n      \"smolagents[toolkit]\" \\\n      huggingface_hub \\\n      httpx \\\n      openai\n\n\nThen restart the runtime.\n\n### For LlamaIndex\n\nUse this first:\n\n\n    from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\n    import os\n\n    llm = HuggingFaceInferenceAPI(\n        model_name=\"Qwen/Qwen3-Coder-30B-A3B-Instruct\",\n        token=os.environ[\"HF_TOKEN\"],\n        provider=\"together\",\n        task=\"text-generation\",\n        is_chat_model=False,\n        num_output=512,\n        temperature=0.2,\n    )\n\n    print(llm.complete(\"Return exactly: ok\").text)\n\n\nIf your actual app needs chat-style LlamaIndex usage, move to:\n\n\n    from llama_index.llms.openai_like import OpenAILike\n    import os\n\n    llm = OpenAILike(\n        model=\"Qwen/Qwen3-Coder-30B-A3B-Instruct:together\",\n        api_base=\"https://router.huggingface.co/v1\",\n        api_key=os.environ[\"HF_TOKEN\"],\n        is_chat_model=True,\n        is_function_calling_model=False,\n        max_tokens=512,\n        temperature=0.2,\n    )\n\n\nIf your actual app needs async streaming, either:\n\n  * recreate the LLM object per streamed call\n  * avoid `astream_*` temporarily\n  * use sync calls\n  * switch to the OpenAI-compatible route\n\n\n\n* * *\n\n## 12. Why Qwen appears to be involved\n\nQwen is likely not the root cause, but Qwen makes the problem visible because Qwen models are often used with:\n\n  * coding agents\n  * tool calling\n  * chat completion\n  * OpenAI-compatible endpoints\n  * provider-backed inference\n  * streaming\n  * LlamaIndex RAG\n\n\n\nAll of those features stress integration layers.\n\nA plain Qwen call may work:\n\n\n    Qwen + direct HF client + non-streaming\n\n\nbut an agent/RAG call may fail:\n\n\n    Qwen + wrapper + provider auto-selection + streaming + tools + async reuse\n\n\nThat is why the same model can look broken in smolagents and LlamaIndex for different reasons.\n\nFor Qwen-native agent/tool work, also inspect:\n\n  * Qwen function calling docs\n  * Qwen-Agent GitHub repo\n  * Qwen-Agent quickstart\n  * Qwen3 model card agentic-use note\n\n\n\nThe deeper point:\n\n\n    Tool calling is not just a model capability.\n    It is a contract among:\n    model output format + serving-layer parser + client wrapper.\n\n\n* * *\n\n## 13. Most likely final explanation\n\n### smolagents\n\nThe smolagents wrapper is probably enabling or forwarding streaming, and the `stream` parameter is reaching a lower-level `Client.post()` method that does not accept it.\n\nMost likely causes:\n\n  * `smolagents` / `huggingface_hub` / `httpx` version mismatch\n  * wrapper forwarding `stream=True` to the wrong layer\n  * provider route not supporting the requested streaming/tool mode\n  * implicit provider selection choosing a different backend than expected\n\n\n\nRelevant links:\n\n  * huggingface_hub v1 migration guide\n  * InferenceClient reference\n  * Hugging Face streaming docs\n  * smolagents provider/tool-calling issue\n\n\n\n### LlamaIndex\n\nThe LlamaIndex Hugging Face wrapper likely closes its internal async client after an async streaming call, and then the same object is reused.\n\nAdditionally, LlamaIndex’s Hugging Face wrapper can hit fragile task-route assumptions around chat/conversational behavior. A Qwen-related issue reported that setting `task=\"text-generation\"` avoids a failing conversational route.\n\nRelevant links:\n\n  * LlamaIndex HuggingFaceInferenceAPI reference\n  * LlamaIndex Hugging Face LLM integration\n  * LlamaIndex issue: Hugging Face conversational API returns 404\n  * LlamaIndex changelog\n\n\n\n* * *\n\n## 14. Short summary\n\n  * Your smolagents error is probably a **streaming/client compatibility** problem.\n  * Your LlamaIndex error is probably an **async client reuse-after-close** problem.\n  * Qwen is likely not broken; the wrappers/routes around Qwen are the issue.\n  * Use **explicit provider** , not provider auto-selection.\n  * Start with **non-streaming** , **no tools** , **direct Hugging Face client**.\n  * For LlamaIndex, start with `task=\"text-generation\"` and sync `.complete()`.\n  * For LlamaIndex chat, consider `OpenAILike` with the Hugging Face router.\n  * Upgrade related packages together and restart the runtime after upgrading.\n\n\n\nRecommended debugging order:\n\n\n    1. Print package versions.\n    2. Test direct Hugging Face non-streaming.\n    3. Test direct Hugging Face streaming.\n    4. Test smolagents with tools=[].\n    5. Test LlamaIndex sync text-generation.\n    6. Test LlamaIndex OpenAI-compatible route.\n    7. Add tools.\n    8. Add streaming.\n    9. Add async.\n",
  "title": "Issue with Qwen model in HuggingFace"
}