Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include links in the markdown #173

Closed
ManojBhamsagar-Draup opened this issue Oct 17, 2024 · 7 comments
Closed

Include links in the markdown #173

ManojBhamsagar-Draup opened this issue Oct 17, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@ManojBhamsagar-Draup
Copy link

The result.markdown doesn't include links. I've a use case where I'll be passing the markdown to LLM to identify the product details. Here I want to get the product details url also. But the result.markdown doesn't have links.

@unclecode unclecode self-assigned this Oct 17, 2024
@unclecode unclecode added the question Further information is requested label Oct 17, 2024
@unclecode
Copy link
Owner

@ManojBhamsagar-Draup Hi, thx using Crawl4ai. The marked result contains links. Would you be willing to give me your code? Let me look at the code snippet you have with the URL, and everything. Thank you.

@ManojBhamsagar-Draup
Copy link
Author

@ManojBhamsagar-Draup Hi, thx using Crawl4ai. The marked result contains links. Would you be willing to give me your code? Let me look at the code snippet you have with the URL, and everything. Thank you.

Yes, here I'm trying to scrape apple careers:

import asyncio
from crawl4ai import AsyncWebCrawler
import base64

async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://jobs.apple.com/en-us/search?page=1")
print(f"Basic crawl result: {result.markdown}")

asyncio.run(main())

@unclecode unclecode added the enhancement New feature or request label Oct 17, 2024
@unclecode
Copy link
Owner

@ManojBhamsagar-Draup Oh, it's my bad - you're right. Well, you can get all the external and internal links in the result.links and then you can add them to your markdown and pass them to your large language model. However, I'm going to add this extra flag that you can use if you want the links in the metadata or not in the next version, which I'm going to drop within one or two days. The version is 0.3.7 Thanks for reporting this; it's very helpful.

@ManojBhamsagar-Draup
Copy link
Author

ManojBhamsagar-Draup commented Oct 17, 2024

@unclecode sure Thanks ✌️
That would be great, but what if the links are embedded in the markdown that will be helpful. Something like this https://r.jina.ai/https://jobs.apple.com/en-us/search?page=1

@unclecode
Copy link
Owner

@ManojBhamsagar-Draup That’s right! In the new version, when you set the flag to true, the links will be wrapped within the markdown, exactly like the example we shared 😎

@ManojBhamsagar-Draup
Copy link
Author

@ManojBhamsagar-Draup That’s right! In the new version, when you set the flag to true, the links will be wrapped within the markdown, exactly like the example we shared 😎

@unclecode that's great

@unclecode
Copy link
Owner

@ManojBhamsagar-Draup We already added more granularity to control how you want to handle internal and external links for images as well as anchor tags. The following code demonstrates a better way of using this feature. I also provided a very lengthy explanation in this issue (#184), and I hope it will be helpful.

async def main():
    async with AsyncWebCrawler(headless = True, sleep_on_close = True) as crawler:
        url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
        result = await crawler.arun(
            url=url,
            # bypass_cache=True,
            word_count_threshold = 10,
            excluded_tags = ['form'], # Optional - Default is None, this adds more control over the content extraction for markdown
            exclude_external_links = False, # Default is True
            exclude_social_media_links = True, # Default is True
            exclude_external_images = True, # Default is False
            # social_media_domains = ["facebook.com", "twitter.com", "instagram.com", ...] Here you can add more domains, default supported domains are in config.py
            
            html2text = {
                "escape_dot": False,
                # Add more options here
            }
        )
        # Save markdown to file
        with open(os.path.join(__data, "mexico_places.md"), "w") as f:
            f.write(result.markdown)

    print("Done")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants