-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include links in the markdown #173
Comments
@ManojBhamsagar-Draup Hi, thx using Crawl4ai. The marked result contains links. Would you be willing to give me your code? Let me look at the code snippet you have with the URL, and everything. Thank you. |
Yes, here I'm trying to scrape apple careers: import asyncio async def main(): asyncio.run(main()) |
@ManojBhamsagar-Draup Oh, it's my bad - you're right. Well, you can get all the external and internal links in the |
@unclecode sure Thanks ✌️ |
@ManojBhamsagar-Draup That’s right! In the new version, when you set the flag to true, the links will be wrapped within the markdown, exactly like the example we shared 😎 |
@unclecode that's great |
@ManojBhamsagar-Draup We already added more granularity to control how you want to handle internal and external links for images as well as anchor tags. The following code demonstrates a better way of using this feature. I also provided a very lengthy explanation in this issue (#184), and I hope it will be helpful. async def main():
async with AsyncWebCrawler(headless = True, sleep_on_close = True) as crawler:
url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
result = await crawler.arun(
url=url,
# bypass_cache=True,
word_count_threshold = 10,
excluded_tags = ['form'], # Optional - Default is None, this adds more control over the content extraction for markdown
exclude_external_links = False, # Default is True
exclude_social_media_links = True, # Default is True
exclude_external_images = True, # Default is False
# social_media_domains = ["facebook.com", "twitter.com", "instagram.com", ...] Here you can add more domains, default supported domains are in config.py
html2text = {
"escape_dot": False,
# Add more options here
}
)
# Save markdown to file
with open(os.path.join(__data, "mexico_places.md"), "w") as f:
f.write(result.markdown)
print("Done") |
The result.markdown doesn't include links. I've a use case where I'll be passing the markdown to LLM to identify the product details. Here I want to get the product details url also. But the result.markdown doesn't have links.
The text was updated successfully, but these errors were encountered: