Include links in the markdown #173

ManojBhamsagar-Draup · 2024-10-17T10:14:57Z

The result.markdown doesn't include links. I've a use case where I'll be passing the markdown to LLM to identify the product details. Here I want to get the product details url also. But the result.markdown doesn't have links.

unclecode · 2024-10-17T12:08:30Z

@ManojBhamsagar-Draup Hi, thx using Crawl4ai. The marked result contains links. Would you be willing to give me your code? Let me look at the code snippet you have with the URL, and everything. Thank you.

ManojBhamsagar-Draup · 2024-10-17T12:26:41Z

@ManojBhamsagar-Draup Hi, thx using Crawl4ai. The marked result contains links. Would you be willing to give me your code? Let me look at the code snippet you have with the URL, and everything. Thank you.

Yes, here I'm trying to scrape apple careers:

import asyncio
from crawl4ai import AsyncWebCrawler
import base64

async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://jobs.apple.com/en-us/search?page=1")
print(f"Basic crawl result: {result.markdown}")

asyncio.run(main())

unclecode · 2024-10-17T13:28:11Z

@ManojBhamsagar-Draup Oh, it's my bad - you're right. Well, you can get all the external and internal links in the result.links and then you can add them to your markdown and pass them to your large language model. However, I'm going to add this extra flag that you can use if you want the links in the metadata or not in the next version, which I'm going to drop within one or two days. The version is 0.3.7 Thanks for reporting this; it's very helpful.

ManojBhamsagar-Draup · 2024-10-17T13:30:59Z

@unclecode sure Thanks ✌️
That would be great, but what if the links are embedded in the markdown that will be helpful. Something like this https://r.jina.ai/https://jobs.apple.com/en-us/search?page=1

unclecode · 2024-10-17T14:31:56Z

@ManojBhamsagar-Draup That’s right! In the new version, when you set the flag to true, the links will be wrapped within the markdown, exactly like the example we shared 😎

ManojBhamsagar-Draup · 2024-10-17T15:38:29Z

@ManojBhamsagar-Draup That’s right! In the new version, when you set the flag to true, the links will be wrapped within the markdown, exactly like the example we shared 😎

@unclecode that's great

unclecode · 2024-10-20T11:31:34Z

@ManojBhamsagar-Draup We already added more granularity to control how you want to handle internal and external links for images as well as anchor tags. The following code demonstrates a better way of using this feature. I also provided a very lengthy explanation in this issue (#184), and I hope it will be helpful.

async def main():
    async with AsyncWebCrawler(headless = True, sleep_on_close = True) as crawler:
        url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
        result = await crawler.arun(
            url=url,
            # bypass_cache=True,
            word_count_threshold = 10,
            excluded_tags = ['form'], # Optional - Default is None, this adds more control over the content extraction for markdown
            exclude_external_links = False, # Default is True
            exclude_social_media_links = True, # Default is True
            exclude_external_images = True, # Default is False
            # social_media_domains = ["facebook.com", "twitter.com", "instagram.com", ...] Here you can add more domains, default supported domains are in config.py
            
            html2text = {
                "escape_dot": False,
                # Add more options here
            }
        )
        # Save markdown to file
        with open(os.path.join(__data, "mexico_places.md"), "w") as f:
            f.write(result.markdown)

    print("Done")

unclecode self-assigned this Oct 17, 2024

unclecode added the question Further information is requested label Oct 17, 2024

unclecode added the enhancement New feature or request label Oct 17, 2024

chanmathew mentioned this issue Oct 20, 2024

Returned markdown missing formatting and links #184

Open

unclecode closed this as completed Oct 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include links in the markdown #173

Include links in the markdown #173

ManojBhamsagar-Draup commented Oct 17, 2024

unclecode commented Oct 17, 2024

ManojBhamsagar-Draup commented Oct 17, 2024

unclecode commented Oct 17, 2024

ManojBhamsagar-Draup commented Oct 17, 2024 •

edited

Loading

unclecode commented Oct 17, 2024

ManojBhamsagar-Draup commented Oct 17, 2024

unclecode commented Oct 20, 2024

Include links in the markdown #173

Include links in the markdown #173

Comments

ManojBhamsagar-Draup commented Oct 17, 2024

unclecode commented Oct 17, 2024

ManojBhamsagar-Draup commented Oct 17, 2024

unclecode commented Oct 17, 2024

ManojBhamsagar-Draup commented Oct 17, 2024 • edited Loading

unclecode commented Oct 17, 2024

ManojBhamsagar-Draup commented Oct 17, 2024

unclecode commented Oct 20, 2024

ManojBhamsagar-Draup commented Oct 17, 2024 •

edited

Loading