Unable to do LLM extraction with azure openai #174

MeghanaSrinath · 2024-10-17T12:50:45Z

Hi,
We are trying to do the LLM extraction using the sample code provided here.
This is how we have added the LLM details

async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            extraction_strategy=LLMExtractionStrategy(
                provider="openai/gpt-4o",
                base_url="https://xxx.openai.azure.com/openai/deployments/xx/chat/completions?api-version=xx",
                api_token="xxxx", 
                instruction="Extract only content related to technology"
            ),
            bypass_cache=True,
        )

These same credentials are working in other codes that we have for other use cases. However, when we try to run the sample code, we are getting the error as below.

[LOG] 🌤️  Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
[LOG] 🕸️ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy...
[LOG] ✅ Crawled https://www.nbcnews.com/business successfully!
[LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 8.29 seconds
[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.34 seconds
[LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 0
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 1
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 2
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 3

Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.

[LOG] Call LLM for https://www.nbcnews.com/business - block index: 4
Error in thread execution: litellm.NotFoundError: NotFoundError: OpenAIException - Error code: 404 - {'error': {'code': '404', 'message': 'Resource not found'}}
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 5
Error in thread execution: litellm.NotFoundError: NotFoundError: OpenAIException - Error code: 404 - {'error': {'code': '404', 'message': 'Resource not found'}}
Error in thread execution: litellm.NotFoundError: NotFoundError: OpenAIException - Error code: 404 - {'error': {'code': '404', 'message': 'Resource not found'}}
[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 33.02 seconds.
Number of tech-related items extracted: 6
Traceback (most recent call last):
  File "C:\test.py", line 31, in <module>
    asyncio.run(extract_tech_content())
  File "C:\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\AppData\Local\Programs\Python\Python312\Lib\asyncio\base_events.py", line 687, in run_until_complete     
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\test.py", line 28, in extract_tech_content
    with open(".data/tech_content.json", "w", encoding="utf-8") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '.data/tech_content.json'

The text was updated successfully, but these errors were encountered:

unclecode · 2024-10-18T11:00:40Z

@MeghanaSrinath Thanks for using Crawl4AI. The error message is coming from the litellm library that we use to communicate with the language model. It seems that it cannot find the standard Open AI interface from the base URL that you passed. One thing we can do is try to use the standard Open AI base url (do not pass anything) and make sure that works. If that works, it means there must be something about the base URL that you are passing. In the worse scenario, you can create a temporary API token for me, and then I'll test it on my end to figure out why it doesn't work and I will fix it for you. Also please share with me the full code have you show me the full code, including the part where you are saving the data into tech_content.json.

mobyds · 2024-10-21T10:42:14Z

me I use the .env with this and I don't put base_url in the LLMExtractionStrategy:
AZURE_API_BASE=https://xxxxx.openai.azure.com/
AZURE_DEPLOYMENT=gpt4o-mini
AZURE_API_VERSION="2024-06-01"

unclecode · 2024-10-21T10:59:08Z

@mobyds Follow the explanation in this link https://docs.litellm.ai/docs/providers/azure

unclecode self-assigned this Oct 18, 2024

unclecode added the question Further information is requested label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to do LLM extraction with azure openai #174

Unable to do LLM extraction with azure openai #174

MeghanaSrinath commented Oct 17, 2024

unclecode commented Oct 18, 2024

mobyds commented Oct 21, 2024

unclecode commented Oct 21, 2024

Unable to do LLM extraction with azure openai #174

Unable to do LLM extraction with azure openai #174

Comments

MeghanaSrinath commented Oct 17, 2024

unclecode commented Oct 18, 2024

mobyds commented Oct 21, 2024

unclecode commented Oct 21, 2024