Skip to content

Commit

Permalink
fix(coloctapp): update cleanup_content
Browse files Browse the repository at this point in the history
Solves #1198

Make cleanup_content check if elements exist before trying to delete them, preventing an IndexError
  • Loading branch information
grossir committed Oct 16, 2024
1 parent 1ee3bc7 commit 4f767c5
Showing 1 changed file with 8 additions and 3 deletions.
11 changes: 8 additions & 3 deletions juriscraper/opinions/united_states/state/coloctapp.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,11 @@ class Site(colo.Site):

@staticmethod
def cleanup_content(content: str) -> str:
"""Returned HTML needs 2 modifications:
"""Returned HTML may need editing for proper ingestion
The HTML seems to change constantly, so some of these
steps may be outdated (Check juriscraper#1198 and courtlistener#4443)
- delete style and img tags which hold tokens
that make the hash change everytime
Expand All @@ -33,8 +37,9 @@ def cleanup_content(content: str) -> str:
tree = html.fromstring(content)
remove_xpaths = ["//style", "//img"]
for xpath in remove_xpaths:
to_remove = tree.xpath(xpath)[0]
to_remove.getparent().remove(to_remove)
if tree.xpath(xpath):
to_remove = tree.xpath(xpath)[0]
to_remove.getparent().remove(to_remove)

for tag in tree.xpath("//*[@class]"):
tag.attrib.pop("class")
Expand Down

0 comments on commit 4f767c5

Please sign in to comment.