Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how can i extract text from the CrawlResult? #171

Open
deepak-hl opened this issue Oct 17, 2024 · 5 comments
Open

how can i extract text from the CrawlResult? #171

deepak-hl opened this issue Oct 17, 2024 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@deepak-hl
Copy link

deepak-hl commented Oct 17, 2024

from crawl4ai import WebCrawler
from crawl4ai.chunking_strategy import SlidingWindowChunking
from crawl4ai.extraction_strategy import LLMExtractionStrategy

     crawler = WebCrawler()
     crawler.warmup()

        strategy = LLMExtractionStrategy(
            provider='openai',
            api_token=os.getenv('OPENAI_API_KEY')
        )
        loader = crawler.run(url=all_urls[0], extraction_strategy=strategy)
        chunker = SlidingWindowChunking(window_size=2000, step=50)
        texts = chunker.chunk(loader)
        print(texts)

I want text in chunks from the crawler.run, so to further use these text in storing embeddings, how can I?
its showing me the error : 'CrawlResult' object has no attribute 'split'

@deepak-hl
Copy link
Author

@unclecode I am new on crawl4ai, please help me as I want text in chunks from the crawler.run, so to further use these text in storing embeddings, how can I?

@unclecode
Copy link
Owner

@deepak-hl thx fot using Crawl4Ai, I take a look at your code by tomorrow and definitely update you soon 🤓

@deepak-hl
Copy link
Author

@unclecode thank you !!

@deepak-hl
Copy link
Author

deepak-hl commented Oct 18, 2024

@unclecode can i crawl all the content from its sub urls by providing only its base url in crawl4ai, if yes then how?

@unclecode unclecode self-assigned this Oct 18, 2024
@unclecode unclecode added the question Further information is requested label Oct 18, 2024
@unclecode
Copy link
Owner

@deepak-hl Thank you for using Crawl4ai. Let me go through your questions one by one. The first is you're using the old version, the synchronous version. And I'm not going to support that because I moved everything to the asynchronous version. Here I share with you the code example that how it's properly you can combine all these together. In this example I'm building a knowledge graph from one of the Paul Graham essay.

class Entity(BaseModel):
    name: str
    description: str
    
class Relationship(BaseModel):
    entity1: Entity
    entity2: Entity
    description: str
    relation_type: str

class KnowledgeGraph(BaseModel):
    entities: List[Entity]
    relationships: List[Relationship]

async def main():
    extraction_strategy = LLMExtractionStrategy(
            provider='openai/gpt-4o-mini',
            api_token=os.getenv('OPENAI_API_KEY'),
            schema=KnowledgeGraph.model_json_schema(),
            extraction_type="schema",
            instruction="""Extract entities and relationships from the given text."""
    )
    async with AsyncWebCrawler() as crawler:
        url = "https://paulgraham.com/love.html"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            extraction_strategy=extraction_strategy,
            chunking_strategy=OverlappingWindowChunking(window_size=2000, overlap=100),
            # magic=True
        )
        # print(result.markdown[:500])
        print(result.extracted_content)
        with open(os.path.join(__data__, "kb.json"), "w") as f:
            f.write(result.extracted_content)

    print("Done")

if __name__ == "__main__":
    asyncio.run(main())

Regarding your next question to pass one URL and get all the sub-urls which is scrapping, the good news is we are already working on it, and already it is under the testing. And within few weeks, we will release the scrapper as well next to crawler function. The scrapper will handle a graph search. You give a URL and you can define how many levels you want to go or all of it. Right now there is this function arun_many([urls]). After calling crawlfunction the response has a propertylinks` contains all the internal and external links of the page. You use a queue data structure, add all the internal links, then start to crawl them again, and keep adding new internal links. This is just a temporary way that you can do that but wait for our scrapper to be ready.

I hope I answered your questions let me know if you have any question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants