Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to do LLM extraction with azure openai #174

Open
MeghanaSrinath opened this issue Oct 17, 2024 · 3 comments
Open

Unable to do LLM extraction with azure openai #174

MeghanaSrinath opened this issue Oct 17, 2024 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@MeghanaSrinath
Copy link

Hi,
We are trying to do the LLM extraction using the sample code provided here.
This is how we have added the LLM details

async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            extraction_strategy=LLMExtractionStrategy(
                provider="openai/gpt-4o",
                base_url="https://xxx.openai.azure.com/openai/deployments/xx/chat/completions?api-version=xx",
                api_token="xxxx", 
                instruction="Extract only content related to technology"
            ),
            bypass_cache=True,
        )

These same credentials are working in other codes that we have for other use cases. However, when we try to run the sample code, we are getting the error as below.

[LOG] 🌤️  Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
[LOG] 🕸️ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy...
[LOG] ✅ Crawled https://www.nbcnews.com/business successfully!
[LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 8.29 seconds
[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.34 seconds
[LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 0
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 1
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 2
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 3

Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.

[LOG] Call LLM for https://www.nbcnews.com/business - block index: 4
Error in thread execution: litellm.NotFoundError: NotFoundError: OpenAIException - Error code: 404 - {'error': {'code': '404', 'message': 'Resource not found'}}
[LOG] Call LLM for https://www.nbcnews.com/business - block index: 5
Error in thread execution: litellm.NotFoundError: NotFoundError: OpenAIException - Error code: 404 - {'error': {'code': '404', 'message': 'Resource not found'}}
Error in thread execution: litellm.NotFoundError: NotFoundError: OpenAIException - Error code: 404 - {'error': {'code': '404', 'message': 'Resource not found'}}
[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 33.02 seconds.
Number of tech-related items extracted: 6
Traceback (most recent call last):
  File "C:\test.py", line 31, in <module>
    asyncio.run(extract_tech_content())
  File "C:\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\AppData\Local\Programs\Python\Python312\Lib\asyncio\base_events.py", line 687, in run_until_complete     
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\test.py", line 28, in extract_tech_content
    with open(".data/tech_content.json", "w", encoding="utf-8") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '.data/tech_content.json'
@unclecode unclecode self-assigned this Oct 18, 2024
@unclecode unclecode added the question Further information is requested label Oct 18, 2024
@unclecode
Copy link
Owner

@MeghanaSrinath Thanks for using Crawl4AI. The error message is coming from the litellm library that we use to communicate with the language model. It seems that it cannot find the standard Open AI interface from the base URL that you passed. One thing we can do is try to use the standard Open AI base url (do not pass anything) and make sure that works. If that works, it means there must be something about the base URL that you are passing. In the worse scenario, you can create a temporary API token for me, and then I'll test it on my end to figure out why it doesn't work and I will fix it for you. Also please share with me the full code have you show me the full code, including the part where you are saving the data into tech_content.json.

@mobyds
Copy link

mobyds commented Oct 21, 2024

me I use the .env with this and I don't put base_url in the LLMExtractionStrategy:
AZURE_API_BASE=https://xxxxx.openai.azure.com/
AZURE_DEPLOYMENT=gpt4o-mini
AZURE_API_VERSION="2024-06-01"

@unclecode
Copy link
Owner

@mobyds Follow the explanation in this link https://docs.litellm.ai/docs/providers/azure

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants