-
-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(coloctapp): update cleanup_content #1216
Conversation
Solves: #1215 We are getting duplicate hashes again due to some documents having multiple hash altering elements. Generalize cleanup_content to cases with more than one element
55bceb2
to
aa5f1bd
Compare
|
||
for tag in tree.xpath("//*[@class]"): | ||
tag.attrib.pop("class") | ||
remove_tags = ["//style", "//img"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing this - perhaps we should modify
from juriscraper.lib.html_utils import strip_bad_html_tags_insecure
or some similar function to pass in extra tags we want to remove using Nh3 ?
remove_tags = ["//style", "//img"] | ||
remove_attributes = [ | ||
"//*[@class]", | ||
# contains json like data with "ctm" key | ||
"//*[@data-data]", | ||
# contains coordinate like data | ||
"//*[@data-dest-detail]", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible we might want to leave some of these classes in so we can use them to improve how tresults display on CL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im starting to think we should look at the few HTML scrapers and tailor some CSS for them
Solves:
#1215
We are getting duplicate hashes again due to some documents having multiple hash altering elements. Generalize cleanup_content to cases with more than one element