You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to crawl links from websites, but it is either returning empty results or taking too long to retrieve the links. How can I implement a strategy to run it faster, stop redundant processes to save time, or add a retry mechanism to make it foolproof?
try:
result = await crawler.arun(
url=url,
bypass_cache=True,
verbose=True,
user_agent=random.choice(self.user_agents),
)
if hasattr(result, 'error_message') and result.error_message:
print(f"Error encountered while crawling {url}: {result.error_message}")
return []
print(f"Successfully crawled: {result.url}")
soup = BeautifulSoup(result.html, self.parser)
links = set()
base_netloc = urlparse(url).netloc
for a_tag in soup.find_all('a', href=True):
href = a_tag['href']
# Remove trailing colon from href if present
if href.endswith(':'):
href = href.rstrip(':')
if href.startswith('/'):
full_url = urljoin(url, href)
links.add(full_url)
else:
href_netloc = urlparse(href).netloc
if href_netloc == base_netloc or href.startswith(url):
links.add(href)
filtered_links = list(links) ```
The text was updated successfully, but these errors were encountered:
I am trying to crawl links from websites, but it is either returning empty results or taking too long to retrieve the links. How can I implement a strategy to run it faster, stop redundant processes to save time, or add a retry mechanism to make it foolproof?
The text was updated successfully, but these errors were encountered: