-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Returned markdown missing formatting and links #184
Comments
Thank you so much for using our library. I have to tell you that the issue you raised with me is one of the best issues I've received in the last few days. It took a couple of hours, but all good now. Numbering is corrected, formatting characters are not going to be stripped. Also, the term, 'Guadalajara’s Catedral de la Asunción', in the actual website, I noticed that it's italic. Consequently, it shouldn't be wrapped by asterisk. It should be by underscores, and now in Crawl4ai it is currently wrapped by underscores. By tonight, I guess I will release an updated version. You can install it from pypi, and then you will have access to all these. There's one very important thing about links in markdown: the initial approach was to remove things that do not contribute to the textual content of the page, such as links, unless they are going to be used in an agent-based environment, where the agent needs to refer back to those sources in the links. The same goes for images and iFrames. So, we added several flags to make this controllable. In the following code, you can see how I am controlling these attributes: one is to exclude or not exclude all external links, and the same for images. The other is to exclude only popular social media links, or a set of links that do not contribute anything to the process you will do with the data after use. Anyway, the extracted HTML is always there, and we also extract all internal and external links separately. This is just additional information that can be helpful. And, to let you know that, personally, my main motivation for starting this was to create something of high quality that's open source. So, I felt like I'm fighting against anyone who wants to charge people for this. Yes, if they provide cloud services and crawl at scale, that's fine. But the act of crawling should be free of charge and of high quality. So, I want to be like Robin Hood for crawlers :D. By the way, I noticed that in another issue, you shared an example of deploying using Modal. And actually, that's one of our areas of interest. And I am planning to create a content for that. By the way, I really like the issue you raise and it's very helpful. If you're interested in helping us, you can join our Discord server and become one of the library collaborators. This is sample of output: ### 1. Guadalajara, Jalisco
Guadalajara is a vibrant city that beautifully blends tradition with modernity, making it a must-visit destination. As the second-largest city in Mexico and the epicenter of the country’s tech scene, it offers a dynamic and exciting atmosphere.
![Guadalajara's Catedral de la Asunción de María Santísima, with a flock of pigeons taking flight in the foreground](https://www.janineintheworld.com/wp-content/uploads/2024/01/guadalajara_Catedral-de-la-Asuncion-de-Maria-Santisima.jpg)
_Guadalajara’s Catedral de la Asunción de María Santísima is an emblematic landmark of the city._
Dive into the cultural richness by visiting the Mercado San Juan de Dios, the largest market in Latin America. Even if you’re not looking to buy anything, it’s cool to wander through and take in the expanse of it all. The market is also a great place to try some traditional Jaliscense food, like _tortas ahogadas_ or _tacos de birria_!
Nature lovers will adore the Bosque de Colomos, a vast nature park that provides a peaceful escape from the urban buzz. Wander through the Japanese garden, jog along the plentiful paths, or take a book and lay out on the grass for the afternoon!
For a shopping adventure, the colorful neighborhood of Tlaquepaque is the place to be, with its endless array of artisanías offering a true taste of Mexican craftsmanship. You can pick up all kinds of gorgeous pottery, textiles, and blown glass goods here.
Guadalajara also happens to be the birthplace of Mariachi music and a hub for tequila production. There are lots of venues throughout the city to catch a performance. The [**Plaza de los Mariachis**](https://www.tripadvisor.es/Attraction_Review-g150798-d155222-Reviews-Plaza_de_los_Mariachis-Guadalajara_Guadalajara_Metropolitan_Area.html) and [**Restaurante El Patio**](https://www.elpatio.com.mx/) (in Tlaquepaque) are great places to start! This is how will be the code: async def main():
async with AsyncWebCrawler(headless = True, sleep_on_close = True) as crawler:
url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
result = await crawler.arun(
url=url,
# bypass_cache=True,
word_count_threshold = 10,
excluded_tags = ['form'], # Optional - Default is None, this adds more control over the content extraction for markdown
exclude_external_links = False, # Default is True
exclude_social_media_links = True, # Default is True
exclude_external_images = True, # Default is False
# social_media_domains = ["facebook.com", "twitter.com", "instagram.com", ...] Here you can add more domains, default supported domains are in config.py
html2text = {
"escape_dot": False,
# Add more options here
}
)
# Save markdown to file
with open(os.path.join(__data, "mexico_places.md"), "w") as f:
f.write(result.markdown)
print("Done") And some of Html2Text configuration, you can adjust, Btw, We have forked Html2Text, and keep customize it accordignly.
Don't hesitate to ask for help if you need it, Also, I would be more than happy to see this library becoming a part of your company's projects. Moreover, when it comes to productions, perhaps you could feature us and provide us with more information about your work, so we can know how we can help. |
@unclecode Wow thanks so much for your quick response, appreciate your kind words! The updates are super helpful! Would you mind confirming what version I should be looking out for? I just tried 0.3.71 but I don't think that has your changes yet. I will test and report back. Also would love to join the Discord! I sent you a friend request on Discord as I didn't see a link posted on the site. |
@chanmathew I am referring to next version 0.3.72 (Right now there is a branch for this version, you can find it), and most welcome to Discord, I will check that soon. |
@chanmathew I'd like to let you know that we've just added something I've been working on for a while to bring a little more heuristic to create more fitting versions of the markdown, especially for pages that don't have repetitive patterns like articles, blogs, and news. The early result is very nice, but it's still under test, and what it does is use some heuristics to remove unnecessary parts from the page. For example, when you use it on links, it nicely extracts just the article in a very nice way, and it doesn't include sidebar menus, headers, footers, and many more things. Please take a look at this, and perhaps when you join our Discord, you can help us test it and challenge it on other URLs and other places, and we'll improve it. This will be in version 0.3.72. async def main():
async with AsyncWebCrawler(verbos=True) as crawler:
url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
result = await crawler.arun(
url=url,
bypass_cache=True,
word_count_threshold = 10,
)
with open(os.path.join(__data, "mexico_places.md"), "w") as f:
f.write(result.fit_markdown)
print("Done") Everything remains the same. You simply pick up the fit_markdown property of the result crawl item. I also attach the markdown here. |
@unclecode That's amazing! I am reviewing the MD you attached, the output looks very clean, however I did notice that with the fit_markdown, it seems to be missing some links within the actual content, is that just because the For example in this section (crawl4ai version):
There should be a number of links included, which is important to retain. Here's the original from other provider: It looks like both the anchor text, its formatting, and the link itself are stripped: First link: Second link: Anyway I'm excited for the new release to test! There'll definitely be other URLs we can challenge it on, happy to share our learnings as we crawl more sites. |
Hi @unclecode - First of all thanks for your amazing work on this lib! It's exciting to have an open source crawler that works quite well!
I was testing switching over from a paid provider, however I'm noticing some differences in the markdown output, where it seems to retain less of the actual formatting, which affects our downstream processing (feeding to an LLM for extraction). Due to the stripped formatting, we're finding that the LLM tends to yield worse results because it's losing those subtle cues.
For example take this section about Guadalajara, Mexico from this page: https://janineintheworld.com/places-to-visit-in-central-mexico
Here's the extracted MD from other provider:
### 1. Guadalajara, Jalisco\nGuadalajara is a vibrant city that beautifully blends tradition with modernity, making it a must-visit destination. As the second-largest city in Mexico and the epicenter of the country’s tech scene, it offers a dynamic and exciting atmosphere.\n![Guadalajara's Catedral de la Asunción de María Santísima, with a flock of pigeons taking flight in the foreground](https://www.janineintheworld.com/wp-content/uploads/2024/01/guadalajara_Catedral-de-la-Asuncion-de-Maria-Santisima.jpg)*Guadalajara’s Catedral de la Asunción de María Santísima is an emblematic landmark of the city.*\nDive into the cultural richness by visiting the Mercado San Juan de Dios, the largest market in Latin America. Even if you’re not looking to buy anything, it’s cool to wander through and take in the expanse of it all. The market is also a great place to try some traditional Jaliscense food, like*tortas ahogadas*or*tacos de birria\*!\nNature lovers will adore the Bosque de Colomos, a vast nature park that provides a peaceful escape from the urban buzz. Wander through the Japanese garden, jog along the plentiful paths, or take a book and lay out on the grass for the afternoon!\nFor a shopping adventure, the colorful neighborhood of Tlaquepaque is the place to be, with its endless array of artisanías offering a true taste of Mexican craftsmanship. You can pick up all kinds of gorgeous pottery, textiles, and blown glass goods here.\nGuadalajara also happens to be the birthplace of Mariachi music and a hub for tequila production. There are lots of venues throughout the city to catch a performance. The[**Plaza de los Mariachis**](https://www.tripadvisor.es/Attraction_Review-g150798-d155222-Reviews-Plaza_de_los_Mariachis-Guadalajara_Guadalajara_Metropolitan_Area.html)and[**Restaurante El Patio**](https://www.elpatio.com.mx/)(in Tlaquepaque) are great places to start!
Here's what's extracted from crawl4ai:
### 1\\. Guadalajara, Jalisco\n\nGuadalajara is a vibrant city that beautifully blends tradition with modernity, making it a must-visit destination. As the second-largest city in Mexico and the epicenter of the country’s tech scene, it offers a dynamic and exciting atmosphere. \n\n![Guadalajara's Catedral de la Asunción de María Santísima, with a flock of pigeons taking flight in the foreground](https://www.janineintheworld.com/wp-content/uploads/2024/01/guadalajara_Catedral-de-la-Asuncion-de-Maria-Santisima.jpg)\n\nGuadalajara’s Catedral de la Asunción de María Santísima is an emblematic landmark of the city.\n\nDive into the cultural richness by visiting the Mercado San Juan de Dios, the largest market in Latin America. Even if you’re not looking to buy anything, it’s cool to wander through and take in the expanse of it all. The market is also a great place to try some traditional Jaliscense food, like tortas ahogadas or tacos de birria! \n\nNature lovers will adore the Bosque de Colomos, a vast nature park that provides a peaceful escape from the urban buzz. Wander through the Japanese garden, jog along the plentiful paths, or take a book and lay out on the grass for the afternoon!\n\nFor a shopping adventure, the colorful neighborhood of Tlaquepaque is the place to be, with its endless array of artisanías offering a true taste of Mexican craftsmanship. You can pick up all kinds of gorgeous pottery, textiles, and blown glass goods here. \n\nGuadalajara also happens to be the birthplace of Mariachi music and a hub for tequila production. There are lots of venues throughout the city to catch a performance. The Plaza de los Mariachis and Restaurante El Patio (in Tlaquepaque) are great places to start!
You can see in the crawl4ai version, there are some differences / issues:
Would it be possible to address the above so that the markdown returned is as true to the actual HTML / original content as possible? Hope that's helpful, thank you!
The text was updated successfully, but these errors were encountered: