Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Returned markdown missing formatting and links #184

Open
chanmathew opened this issue Oct 20, 2024 · 5 comments
Open

Returned markdown missing formatting and links #184

chanmathew opened this issue Oct 20, 2024 · 5 comments

Comments

@chanmathew
Copy link

chanmathew commented Oct 20, 2024

Hi @unclecode - First of all thanks for your amazing work on this lib! It's exciting to have an open source crawler that works quite well!

I was testing switching over from a paid provider, however I'm noticing some differences in the markdown output, where it seems to retain less of the actual formatting, which affects our downstream processing (feeding to an LLM for extraction). Due to the stripped formatting, we're finding that the LLM tends to yield worse results because it's losing those subtle cues.

For example take this section about Guadalajara, Mexico from this page: https://janineintheworld.com/places-to-visit-in-central-mexico

Here's the extracted MD from other provider:
### 1. Guadalajara, Jalisco\nGuadalajara is a vibrant city that beautifully blends tradition with modernity, making it a must-visit destination. As the second-largest city in Mexico and the epicenter of the country’s tech scene, it offers a dynamic and exciting atmosphere.\n![Guadalajara's Catedral de la Asunción de María Santísima, with a flock of pigeons taking flight in the foreground](https://www.janineintheworld.com/wp-content/uploads/2024/01/guadalajara_Catedral-de-la-Asuncion-de-Maria-Santisima.jpg)*Guadalajara’s Catedral de la Asunción de María Santísima is an emblematic landmark of the city.*\nDive into the cultural richness by visiting the Mercado San Juan de Dios, the largest market in Latin America. Even if you’re not looking to buy anything, it’s cool to wander through and take in the expanse of it all. The market is also a great place to try some traditional Jaliscense food, like*tortas ahogadas*or*tacos de birria\*!\nNature lovers will adore the Bosque de Colomos, a vast nature park that provides a peaceful escape from the urban buzz. Wander through the Japanese garden, jog along the plentiful paths, or take a book and lay out on the grass for the afternoon!\nFor a shopping adventure, the colorful neighborhood of Tlaquepaque is the place to be, with its endless array of artisanías offering a true taste of Mexican craftsmanship. You can pick up all kinds of gorgeous pottery, textiles, and blown glass goods here.\nGuadalajara also happens to be the birthplace of Mariachi music and a hub for tequila production. There are lots of venues throughout the city to catch a performance. The[**Plaza de los Mariachis**](https://www.tripadvisor.es/Attraction_Review-g150798-d155222-Reviews-Plaza_de_los_Mariachis-Guadalajara_Guadalajara_Metropolitan_Area.html)and[**Restaurante El Patio**](https://www.elpatio.com.mx/)(in Tlaquepaque) are great places to start!

Here's what's extracted from crawl4ai:

### 1\\. Guadalajara, Jalisco\n\nGuadalajara is a vibrant city that beautifully blends tradition with modernity, making it a must-visit destination. As the second-largest city in Mexico and the epicenter of the country’s tech scene, it offers a dynamic and exciting atmosphere. \n\n![Guadalajara's Catedral de la Asunción de María Santísima, with a flock of pigeons taking flight in the foreground](https://www.janineintheworld.com/wp-content/uploads/2024/01/guadalajara_Catedral-de-la-Asuncion-de-Maria-Santisima.jpg)\n\nGuadalajara’s Catedral de la Asunción de María Santísima is an emblematic landmark of the city.\n\nDive into the cultural richness by visiting the Mercado San Juan de Dios, the largest market in Latin America. Even if you’re not looking to buy anything, it’s cool to wander through and take in the expanse of it all. The market is also a great place to try some traditional Jaliscense food, like tortas ahogadas or tacos de birria! \n\nNature lovers will adore the Bosque de Colomos, a vast nature park that provides a peaceful escape from the urban buzz. Wander through the Japanese garden, jog along the plentiful paths, or take a book and lay out on the grass for the afternoon!\n\nFor a shopping adventure, the colorful neighborhood of Tlaquepaque is the place to be, with its endless array of artisanías offering a true taste of Mexican craftsmanship. You can pick up all kinds of gorgeous pottery, textiles, and blown glass goods here. \n\nGuadalajara also happens to be the birthplace of Mariachi music and a hub for tequila production. There are lots of venues throughout the city to catch a performance. The Plaza de los Mariachis and Restaurante El Patio (in Tlaquepaque) are great places to start!

You can see in the crawl4ai version, there are some differences / issues:

  1. The numbering seems to always be escaped \ when it doesn't need to be
  2. Guadalajara’s Catedral de la Asunción is missing the * asterisk in front of it
  3. Bolds and italic formatting are stripped (see tortas ahogadas, or [Plaza de los Mariachis]
  4. Links are missing from the markdown, see the tripadvisor link and the elpatio link (which seems like there's another issue opened already Include links in the markdown #173) - This is important to keep the exact position of the links as found in the original content, as the placement is important for us to infer relationships for downstream processing

Would it be possible to address the above so that the markdown returned is as true to the actual HTML / original content as possible? Hope that's helpful, thank you!

@unclecode
Copy link
Owner

unclecode commented Oct 20, 2024

Thank you so much for using our library. I have to tell you that the issue you raised with me is one of the best issues I've received in the last few days. It took a couple of hours, but all good now.

Numbering is corrected, formatting characters are not going to be stripped. Also, the term, 'Guadalajara’s Catedral de la Asunción', in the actual website, I noticed that it's italic. Consequently, it shouldn't be wrapped by asterisk. It should be by underscores, and now in Crawl4ai it is currently wrapped by underscores.

By tonight, I guess I will release an updated version. You can install it from pypi, and then you will have access to all these.

There's one very important thing about links in markdown: the initial approach was to remove things that do not contribute to the textual content of the page, such as links, unless they are going to be used in an agent-based environment, where the agent needs to refer back to those sources in the links. The same goes for images and iFrames. So, we added several flags to make this controllable. In the following code, you can see how I am controlling these attributes: one is to exclude or not exclude all external links, and the same for images. The other is to exclude only popular social media links, or a set of links that do not contribute anything to the process you will do with the data after use. Anyway, the extracted HTML is always there, and we also extract all internal and external links separately. This is just additional information that can be helpful.

And, to let you know that, personally, my main motivation for starting this was to create something of high quality that's open source. So, I felt like I'm fighting against anyone who wants to charge people for this. Yes, if they provide cloud services and crawl at scale, that's fine. But the act of crawling should be free of charge and of high quality. So, I want to be like Robin Hood for crawlers :D.

By the way, I noticed that in another issue, you shared an example of deploying using Modal. And actually, that's one of our areas of interest. And I am planning to create a content for that. By the way, I really like the issue you raise and it's very helpful. If you're interested in helping us, you can join our Discord server and become one of the library collaborators.

This is sample of output:

### 1. Guadalajara, Jalisco

Guadalajara is a vibrant city that beautifully blends tradition with modernity, making it a must-visit destination. As the second-largest city in Mexico and the epicenter of the country’s tech scene, it offers a dynamic and exciting atmosphere. 

![Guadalajara's Catedral de la Asunción de María Santísima, with a flock of pigeons taking flight in the foreground](https://www.janineintheworld.com/wp-content/uploads/2024/01/guadalajara_Catedral-de-la-Asuncion-de-Maria-Santisima.jpg)

 _Guadalajara’s Catedral de la Asunción de María Santísima is an emblematic landmark of the city._

Dive into the cultural richness by visiting the Mercado San Juan de Dios, the largest market in Latin America. Even if you’re not looking to buy anything, it’s cool to wander through and take in the expanse of it all. The market is also a great place to try some traditional Jaliscense food, like _tortas ahogadas_ or _tacos de birria_! 

Nature lovers will adore the Bosque de Colomos, a vast nature park that provides a peaceful escape from the urban buzz. Wander through the Japanese garden, jog along the plentiful paths, or take a book and lay out on the grass for the afternoon!

For a shopping adventure, the colorful neighborhood of Tlaquepaque is the place to be, with its endless array of artisanías offering a true taste of Mexican craftsmanship. You can pick up all kinds of gorgeous pottery, textiles, and blown glass goods here. 

Guadalajara also happens to be the birthplace of Mariachi music and a hub for tequila production. There are lots of venues throughout the city to catch a performance. The [**Plaza de los Mariachis**](https://www.tripadvisor.es/Attraction_Review-g150798-d155222-Reviews-Plaza_de_los_Mariachis-Guadalajara_Guadalajara_Metropolitan_Area.html) and [**Restaurante El Patio**](https://www.elpatio.com.mx/) (in Tlaquepaque) are great places to start!

This is how will be the code:

async def main():
    async with AsyncWebCrawler(headless = True, sleep_on_close = True) as crawler:
        url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
        result = await crawler.arun(
            url=url,
            # bypass_cache=True,
            word_count_threshold = 10,
            excluded_tags = ['form'], # Optional - Default is None, this adds more control over the content extraction for markdown
            exclude_external_links = False, # Default is True
            exclude_social_media_links = True, # Default is True
            exclude_external_images = True, # Default is False
            # social_media_domains = ["facebook.com", "twitter.com", "instagram.com", ...] Here you can add more domains, default supported domains are in config.py
            
            html2text = {
                "escape_dot": False,
                # Add more options here
            }
        )
        # Save markdown to file
        with open(os.path.join(__data, "mexico_places.md"), "w") as f:
            f.write(result.markdown)

    print("Done")

And some of Html2Text configuration, you can adjust, Btw, We have forked Html2Text, and keep customize it accordignly.

skip_internal_links = False
single_line_break = False
mark_code = False
include_sup_sub = False
body_width = 0
ignore_mailto_links = True
ignore_links = False
escape_backslash = False
escape_dot = False
escape_plus = False
escape_dash = False
escape_snob = False

Don't hesitate to ask for help if you need it, Also, I would be more than happy to see this library becoming a part of your company's projects. Moreover, when it comes to productions, perhaps you could feature us and provide us with more information about your work, so we can know how we can help.

@chanmathew
Copy link
Author

@unclecode Wow thanks so much for your quick response, appreciate your kind words! The updates are super helpful! Would you mind confirming what version I should be looking out for? I just tried 0.3.71 but I don't think that has your changes yet. I will test and report back.

Also would love to join the Discord! I sent you a friend request on Discord as I didn't see a link posted on the site.

@unclecode
Copy link
Owner

@chanmathew I am referring to next version 0.3.72 (Right now there is a branch for this version, you can find it), and most welcome to Discord, I will check that soon.

@unclecode
Copy link
Owner

unclecode commented Oct 21, 2024

@chanmathew I'd like to let you know that we've just added something I've been working on for a while to bring a little more heuristic to create more fitting versions of the markdown, especially for pages that don't have repetitive patterns like articles, blogs, and news. The early result is very nice, but it's still under test, and what it does is use some heuristics to remove unnecessary parts from the page. For example, when you use it on links, it nicely extracts just the article in a very nice way, and it doesn't include sidebar menus, headers, footers, and many more things. Please take a look at this, and perhaps when you join our Discord, you can help us test it and challenge it on other URLs and other places, and we'll improve it. This will be in version 0.3.72.

async def main():
    async with AsyncWebCrawler(verbos=True) as crawler:
        url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            word_count_threshold = 10,
        )
        with open(os.path.join(__data, "mexico_places.md"), "w") as f:
            f.write(result.fit_markdown)

    print("Done")

Everything remains the same. You simply pick up the fit_markdown property of the result crawl item. I also attach the markdown here.

mexico_places.md

@chanmathew
Copy link
Author

@unclecode That's amazing! I am reviewing the MD you attached, the output looks very clean, however I did notice that with the fit_markdown, it seems to be missing some links within the actual content, is that just because the exclude_external_links = False flag wasn't set?

For example in this section (crawl4ai version):

### 16. Huasteca Potosina, San Luis Potosí

If you love the outdoors, waterfalls, rafting, kayaking, and caving, La Huasteca Potosina is definitely going to be one of your favorite places in central Mexico! 

The Huasteca is a geographical region of the Huastec indigenous group. The Huastec covers parts of Tamaulipas and Northern Veracruz, but Huasteca Potosina refers to the area found in the state of San Luis Potosí.

This area is characterized by rivers, waterfalls, jungle, and caves, making way for incredible scenery and exciting adventures. The Huasteca is largely rural, but there are small towns throughout the region. 

to get an idea of what the area has to offer!

La Huasteca is best explored on a tour or with a rental car. Many travelers opt to base in Ciudad Valles and take day trips to points of interest from there. If you have time, you might bounce from town to town to see more of the region. 

**Ready to tour La Huasteca Potosina?** is led by an experienced local operator.

There should be a number of links included, which is important to retain. Here's the original from other provider:
### 16. Huasteca Potosina, San Luis Potosí\nIf you love the outdoors, waterfalls, rafting, kayaking, and caving, La Huasteca Potosina is definitely going to be one of your favorite places in central Mexico!\nThe Huasteca is a geographical region of the Huastec indigenous group. The Huastec covers parts of Tamaulipas and Northern Veracruz, but Huasteca Potosina refers to the area found in the state of San Luis Potosí.\nThis area is characterized by rivers, waterfalls, jungle, and caves, making way for incredible scenery and exciting adventures. The Huasteca is largely rural, but there are small towns throughout the region.\n**[Check out this video by the Kinetic Kennons](https://www.youtube.com/watch?v=916AHUkSbSk&list=PL3NFOOuCkxFEmunupyLsi94EVpsWpODmd&t=619s)**to get an idea of what the area has to offer!\nLa Huasteca is best explored on a tour or with a rental car. Many travelers opt to base in Ciudad Valles and take day trips to points of interest from there. If you have time, you might bounce from town to town to see more of the region.\n**Ready to tour La Huasteca Potosina?**[**This three-day tour from Ciudad Valles**](https://www.getyourguide.com/ciudad-valles-l157201/ciudad-valles-3-day-nature-tour-in-huasteca-potosina-t480300/?partner_id=2CNBPIK&utm_medium=online_publisher&cmp=centralmexico)is led by an experienced local operator.

It looks like both the anchor text, its formatting, and the link itself are stripped:

First link:
**[Check out this video by the Kinetic Kennons](https://www.youtube.com/watch?v=916AHUkSbSk&list=PL3NFOOuCkxFEmunupyLsi94EVpsWpODmd&t=619s)**

Second link:
**[**This three-day tour from Ciudad Valles**](https://www.getyourguide.com/ciudad-valles-l157201/ciudad-valles-3-day-nature-tour-in-huasteca-potosina-t480300/?partner_id=2CNBPIK&utm_medium=online_publisher&cmp=centralmexico)

Anyway I'm excited for the new release to test! There'll definitely be other URLs we can challenge it on, happy to share our learnings as we crawl more sites.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants