Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode the HTML before loading static assets #60

Merged
merged 1 commit into from
Dec 14, 2023

Conversation

alecsmrekar
Copy link
Contributor

With this PR, goose-eggs will decode the HTML before trying to detect and load static assets from it.

Right now, it just supports the & -> & decoding.

Without this PR, goose might making requests with encoded ampersand symbols, which could cause servers to return a 4xx response. At least that's what happened in one of my loadtests; I was testing a Drupal 10.1 site that aggregates assets which then get included into the page with many query parameters.

You can apply the following diff if you wish to see the new failing test:

diff --git a/src/lib.rs b/src/lib.rs
index de87860..1053b7e 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -985,7 +985,7 @@ fn valid_local_uri(user: &mut GooseUser, uri: &str) -> bool {
 
 /// Decodes the HTML. Currently it just decodes the encoded ampersand character.
 fn decode_html(html: &str) -> String {
-    html.replace("&", "&")
+    html.to_string()
 }
 
 /// Extract all local static elements defined with a `src=` tag from the the provided html.

Copy link
Member

@jeremyandrews jeremyandrews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if you see a big downside to doing a full decode, something like this:

diff --git a/Cargo.toml b/Cargo.toml
index dd779cf..7f371c8 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -13,6 +13,7 @@ license = "Apache-2.0"
 
 [dependencies]
 goose = { version = "0.17", default-features = false }
+html-escape = "0.2"
 http = "0.2"
 log = "0.4"
 rand = "0.8"
diff --git a/src/lib.rs b/src/lib.rs
index de87860..5d4c2f6 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -983,11 +983,6 @@ fn valid_local_uri(user: &mut GooseUser, uri: &str) -> bool {
     }
 }
 
-/// Decodes the HTML. Currently it just decodes the encoded ampersand character.
-fn decode_html(html: &str) -> String {
-    html.replace("&", "&")
-}
-
 /// Extract all local static elements defined with a `src=` tag from the the provided html.
 ///
 /// While you can invoke this function directly, it's generally preferred to invoke
@@ -997,10 +992,9 @@ pub async fn get_src_elements(user: &mut GooseUser, html: &str) -> Vec<String> {
     // <foo> is the URL to local image and js assets.
     // @TODO: parse HTML5 srcset= also
 
-    let html = decode_html(html);
     let src_elements = Regex::new(r#"(?i)src="(.*?)""#).unwrap();
     let mut elements: Vec<String> = Vec::new();
-    for url in src_elements.captures_iter(html.as_str()) {
+    for url in src_elements.captures_iter(html_escape::decode_html_entities(html).as_ref()) {
         if valid_local_uri(user, &url[1]) {
             elements.push(url[1].to_string());
         }
@@ -1015,10 +1009,9 @@ pub async fn get_src_elements(user: &mut GooseUser, html: &str) -> Vec<String> {
 pub async fn get_css_elements(user: &mut GooseUser, html: &str) -> Vec<String> {
     // Use a case-insensitive regular expression to find all href=<foo> in the html, where
     // <foo> is the URL to local css assets.
-    let html = decode_html(html);
     let css = Regex::new(r#"(?i)href="(.*?\.css.*?)""#).unwrap();
     let mut elements: Vec<String> = Vec::new();
-    for url in css.captures_iter(html.as_str()) {
+    for url in css.captures_iter(html_escape::decode_html_entities(html).as_ref()) {
         if valid_local_uri(user, &url[1]) {
             elements.push(url[1].to_string());
         }

Either way, the test needs to be moved into its own file.

@@ -212,3 +215,48 @@ async fn test_invalid_header_value() {
}
assert!(goose_metrics.errors.len() == 1);
}

#[tokio::test]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a in a new file. tests/validate is intended to provide test coverage for the Validate struct. (Likely a comment should be added to the top of tests/validate to make this more clear, as I see that's non-obvious.)

src/lib.rs Outdated
@@ -983,6 +983,11 @@ fn valid_local_uri(user: &mut GooseUser, uri: &str) -> bool {
}
}

/// Decodes the HTML. Currently it just decodes the encoded ampersand character.
fn decode_html(html: &str) -> String {
html.replace("&amp;", "&")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to add a dependency here and to offload this logic, and to be sure we properly decode everything. What about instead using this?
https://docs.rs/html-escape

@jeremyandrews
Copy link
Member

Fantastic, thanks for the improvement, and quickly reworking based on feedback.

@jeremyandrews jeremyandrews merged commit a749ad4 into tag1consulting:main Dec 14, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants