Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot import name 'WebCrawler' from 'crawl4ai' #122

Open
gulnihalk opened this issue Oct 2, 2024 · 3 comments
Open

cannot import name 'WebCrawler' from 'crawl4ai' #122

gulnihalk opened this issue Oct 2, 2024 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@gulnihalk
Copy link

gulnihalk commented Oct 2, 2024

Hi, when I try to run crawl4ai with microsoft edge on windows, I have this error below, ( same code works for ubuntu on chrome)

Traceback (most recent call last):
File "d:\work\indexing\scrapper.py", line 1, in
from crawl4ai import WebCrawler
ImportError: cannot import name 'WebCrawler' from 'crawl4ai' (C:\Users\abc..\Local\Programs\Python\Python310\lib\site-packages\crawl4ai_init_.py)

and here is my code below:

from crawl4ai import WebCrawler
import json
 
with open('D:\work\indexing\com\scrapped_urls.json', 'r') as file:
    json_data = json.load(file)
    print(type(json_data))
 
# Create an instance of WebCrawler
crawler = WebCrawler()
 
# Warm up the crawler (load necessary models)
crawler.warmup()
 
scrapped_file = 'D:\work\indexing\com\xyz.txt'
 
# Iterate through the JSON array
for item in json_data:
    #print("The url ", item["url"], " is scrapping...")
    # Run the crawler on a URL
    result = crawler.run(url=item["url"])
    # Put the scrapped text into file
    f = open(scrapped_file, "a")
    f.write(result.markdown)
    f.close()

Do you have any idea?

@unclecode
Copy link
Owner

Thanks for using our library. I do have a question. When you say running our library with Microsoft Edge and Windows, could you please clarify what you mean by that? Crawl4AI does not have any integration with Microsoft Edge or any other browser on your computer. So, I'm guessing you might be experiencing an error related to a Windows OS. If that's the case, I manage some additional tests on Windows to determine the root cause of the issue. I will also review the code you shared to see if I can identify the problem. Meanwhile, We are working on adding a scraping engine to the library, so please stay tuned for that update.

@unclecode unclecode self-assigned this Oct 3, 2024
@asumansaree
Copy link

asumansaree commented Oct 4, 2024

Thanks for using our library. I do have a question. When you say running our library with Microsoft Edge and Windows, could you please clarify what you mean by that? Crawl4AI does not have any integration with Microsoft Edge or any other browser on your computer. So, I'm guessing you might be experiencing an error related to a Windows OS. If that's the case, I manage some additional tests on Windows to determine the root cause of the issue. I will also review the code you shared to see if I can identify the problem. Meanwhile, We are working on adding a scraping engine to the library, so please stay tuned for that update.

Hi @unclecode, thanks for your interest about our problem (we work together with @gulnihalk).
I've wrote this code in Ubuntu and my browser is Chrome. It scrapes all the urls inside the json file very well. That library is really good work! But when we try exactly same code (except the file paths) in Windows OS that has only Microsoft Edge browser, we got the error
ImportError: cannot import name 'WebCrawler' from 'crawl4ai' (C:\Users\abc..\Local\Programs\Python\Python310\lib\site-packages\crawl4ai_init_.py)
Even we install all possible dependencies of crawl4ai, and even change the classes inside source code for Edge (like changing self.driver = webdriver.Chrome(service=self.service) to -> self.driver = webdriver.Edge(service=self.service) inside the crawler_strategy.py code) it still doesn't work. Maybe those source codes are related to Selenium part. Selenium part is mentioned in the source code as this:

Screenshot from 2024-10-04 14-42-23

@unclecode
Copy link
Owner

@asumansaree Sorry for my late response, I've been on a short trip. I figured why it behaves this way. You are still using it in previous version which was synchronous by default, now it's asynchronous. To use it in sync mode, you have to import the web crawler directly from the crawler module from crawl4ai.web_crawler import WebCrawler. I suggest you switch to async mode which is using Playwright, faster and better abilities. Please refer to the documents and examples; it's a significant improvement. I will share code example for async version:

from crawl4ai import AsyncWebCrawler

async def simple_crawl():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://www.nbcnews.com/business")
        print(result.markdown[:500])  

async def main():
    await simple_crawl()

if __name__ == "__main__":
    asyncio.run(main())

@unclecode unclecode added the question Further information is requested label Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants