feat(scrapers.update_from_text): new command #4520

grossir · 2024-10-01T16:10:06Z

Helps solve: freelawproject/juriscraper#858

New command to re-run Site.extract_from_text over downloaded opinions
Able to filter by Docket.court_id , OpinionCluster.date_filed, OpinionCluster.precedential_status
Updates tasks.update_from_document_text to return information for logging purposes
Updates test_opinion_scraper to get a Site.extract_from_text method

Helps solve: freelawproject/juriscraper#858 - New command to re-run Site.extract_from_text over downloaded opinions - Able to filter by Docket.court_id , OpinionCluster.date_filed, OpinionCluster.precedential_status - Updates tasks.update_from_document_text to return information for logging purposes - Updates test_opinion_scraper to get a Site.extract_from_text method

sentry-io · 2024-10-01T16:10:15Z

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: cl/scrapers/tasks.py

Function	Unhandled Issue
`update_document_from_text`	IndexError: list index out of range cl.scrapers.t... `Event Count:` 2

_{Did you find this useful? React with a 👍 or 👎}

cl/scrapers/management/commands/update_from_text.py

flooie · 2024-10-19T11:04:14Z

cl/scrapers/management/commands/update_from_text.py

+        stats = {"Docket": 0, "OpinionCluster": 0, "Opinion": 0, "Citation": 0}
+
+        if options["opinion_ids"]:
+            opinions = Opinion.objects.filter(id__in=options["opinion_ids"])
+            for op in opinions:
+                rerun_extract_from_text(op, juriscraper_module, stats)
+
+            logger.info("Modified objects counts: %s", stats)
+            return
+
+        if not (options["date_filed_gte"] and options["date_filed_lte"]):
+            raise ValueError(
+                "Both `date-filed-gte` and `date-filed-lte` arguments should have values"
+            )
+
+        court_id = juriscraper_module.split(".")[-1].split("_")[0]
+        gte_date = datetime.strptime(options["date_filed_gte"], "%Y/%m/%d")
+        lte_date = datetime.strptime(options["date_filed_lte"], "%Y/%m/%d")
+        query = {
+            "docket__court_id": court_id,
+            "date_filed__gte": gte_date,
+            "date_filed__lte": lte_date,
+        }
+
+        if options["cluster_status"]:
+            query["precedential_status"] = options["cluster_status"]
+
+        qs = OpinionCluster.objects.filter(**query).prefetch_related(
+            "sub_opinions"
+        )
+        for cluster in qs:
+            opinions = cluster.sub_opinions.all()
+            for op in opinions:
+                rerun_extract_from_text(op, juriscraper_module, stats)
+
+        logger.info("Modified objects counts: %s", stats)
+        self.stats = stats


maybe its just me but I prefer to have a clean handle method that doesnt contain large portions of code to run and passes that off. do you think we could move this out of the handle into its own method.

I think a Command's handle should have the code to manipulate the input arguments, otherwise it would be just boilerplate taking up space. I actually tried to abstract it out, and the handle would look like this, which I don't think is good:

def handle(self, *args, **options): super().handle(*args, **options) some_other_function(options)

cl/scrapers/management/commands/update_from_text.py

flooie · 2024-10-19T12:52:10Z

cl/scrapers/management/commands/update_from_text.py

+        for cluster in qs:
+            opinions = cluster.sub_opinions.all()
+            for op in opinions:
+                rerun_extract_from_text(op, juriscraper_module, stats)


It's possible to have opinions that are merged with Harvard. If thats the case this could bring back empty plain_text and html fields. which will crash

See https://freelawproject.sentry.io/issues/6004622476/?project=5620212&query=is%3Aunresolved%20issue.priority%3A%5Bhigh%2C%20medium%5D&referrer=issue-stream&stream_index=0
for an example I crashed locally.

cl/scrapers/tasks.py

cl/scrapers/test_assets/test_opinion_scraper.py

cl/scrapers/management/commands/update_from_text.py

cl/scrapers/tasks.py

- validate citation objects from `Site.extract_from_text`. Add tests for this - abstract --courts required argument for scrapers into ScraperCommand class also, made it more flexible - refactor cl_scrape_opinions; cl_scrape_oral_arguments to account for this - delete cl.scrapers.utils.extract_recap_documents which was generating a circular import. This function was not used anywhere

Merge branch 'main' into scrapers_update_from_text_command

f516f13

grossir requested a review from flooie October 1, 2024 18:06

Merge branch 'main' into scrapers_update_from_text_command

79c8c0a

flooie reviewed Oct 3, 2024

View reviewed changes

cl/scrapers/management/commands/update_from_text.py Outdated Show resolved Hide resolved

grossir and others added 3 commits October 18, 2024 14:00

Merge branch 'main' into scrapers_update_from_text_command

143e6de

refactor(scrapers.update_from_text): change function name and docstring

5adce99

Merge branch 'main' into scrapers_update_from_text_command

3bc0f8e

flooie reviewed Oct 18, 2024

View reviewed changes

cl/scrapers/management/commands/update_from_text.py Outdated Show resolved Hide resolved