Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix open graph urls for RECAP pdfs #3094

Merged
merged 7 commits into from
Sep 7, 2023

Conversation

nathreed
Copy link
Contributor

@nathreed nathreed commented Sep 4, 2023

These are a special case where we redirect opengraph bots to /recap/og-lookup/?file_path=/recap/the/path so that they can get their opengraph content. Previously, this page was using the og:url from the base.html template, which doesn't include the query parameters. This means that clients who depend on the og:url property for the click action on an opengraph card (e.g. Mastodon) were loading URLs that 404ed when you click the preview card but not the link in the content that generated it, leading to a confusing experience for users.

The og:url would always be courtlistener.com/recap/og-lookup/ which of course 404s since the query param it's expecting isn't present.

To fix, override the og:url on recap_document.html template. Pass an og_file_path argument when we do the redirect and check for it in the template.

I decided against simply including the query parameters in the OG URL (either in the base.html template or in recap_document.html) so that the og:url reflects the "canonical" URL, not the url via the og-lookup redirect. Also because I'm not sure what all the possible query parameters might be and whether it would be harmful to include them in the URL or not.

These are a special case where we redirect opengraph bots to /recap/og-lookup/?file_path=/recap/the/path so that they can get their opengraph content.
Previously, this page was using the og:url from the base.html template, which doesn't include the query parameters.
This means that clients who depend on the og:url property for the click action on an opengraph card (e.g. Mastodon) were getting URLs that 404ed when you click the preview card but not the link in
the content that generated it.
The og:url would always be `courtlistener.com/recap/og-lookup/` which of course 404s since the query param it's expecting isn't present.

To fix, override the template on recap_document.html.
Pass an `og_file_path` argument when we do the redirect and check for it in the template.
Can't reverse for view attachments with og_file_path param.
Need to look more closely at getting correct og:url for attachments.
This way it will work when we redirect to the first attachment as well as the regular case.
@nathreed
Copy link
Contributor Author

nathreed commented Sep 4, 2023

I tweaked the approach a little to handle the case where we redirect to the first attachment if the docket entry itself wasn't found, after a test failure illuminated the fact that I was handling that wrong before.

I was not able to conclusively test this in my local dev setup because I don't have any real RECAP documents, nor permissions on the API to clone them using the clone_from_cl script. However, it appears that filepath_local is the correct property to be using here, as that's what the OG redirect looks documents up by. Let me know if that's wrong and we need to take a different approach here.

@mlissner
Copy link
Member

mlissner commented Sep 5, 2023

Well, this is a confusing mess. When you post a PDF link on CL to Mastodon:

  1. The masto crawler tries to get the PDF.
  2. It's detected as a crawler.
  3. It gets redirected to www.courtlistener.com/recap/og-lookup/?some-filepath
  4. That loads the HTML for the document page.

Right now, that HTML includes og:url content for /og-lookup/, and Mastodon, unlike Twitter and all other clients, returns that to users.

So users click the link in Mastodon, and ploof, it fails with a 404. Great.

This fix makes it so that the og:url value for the document pages is the www.cl.com/recap/gov.us.courts.xxx URL, but that's not right either, because now it's a PDF link even for the non-PDF pages (and of course, it would require users to deal with the www to storage.cl.com redirect, though that's a small offense).


It seems like the view for the regular document pages needs to know whether it's responding to a og bot or not. If so, return the og:url value for storage.cl.com/recap/xxx, and if not, return the regular og:url value.

Or am I missing something? This is stupidly complex!

@nathreed
Copy link
Contributor Author

nathreed commented Sep 5, 2023

Or am I missing something? This is stupidly complex!

You have the same understanding of the flow as I do after debugging it. I agree that Mastodon's behavior seems a bit silly here...it makes no sense that the URL for the OG card should be different than the URL in the post that created that same card.

but that's not right either, because now it's a PDF link even for the non-PDF pages (and of course, it would require users to deal with the www to storage.cl.com redirect).

Hmm, yes. That would happen here. I did not notice that.

It seems like the view for the regular document pages needs to know whether it's responding to a og bot or not. If so, return the og:url value for storage.cl.com/recap/xxx, and if not, return the regular og:url value.

This is more along the lines of what I wrote originally in 09ba479: I had added an argument to view_recap_document that was only populated by the redirect_og_lookup function, which I was using to determine whether we are responding to an og bot or not. I couldn't find any code in this repo that sends visitors to the /recap/og-lookup based on their UA, so I assumed that's on the AWS side. This way, the argument for view_recap_document was None in the normal case but populated if invoked via redirect_og_lookup (and by my assumption, bots get redirected there by code on the AWS side).

The problem I had there is that there's an edge case (main document may have been converted to an attachment) where view_recap_document uses reverse() to construct a URL and redirect to that: https://github.com/freelawproject/courtlistener/blob/main/cl/opinion_page/views.py#L475-L482

In that one case, it's hard to pass the "are we responding to an OG bot" value through without adding it as part of the URL path or something. I suppose we could add a query parameter to the URL that gets constructed there after the reverse() and pass it through that way. Thoughts on that approach?

@mlissner
Copy link
Member

mlissner commented Sep 6, 2023

I couldn't find any code in this repo that sends visitors to the /recap/og-lookup based on their UA, so I assumed that's on the AWS side

Yes, good assumption. It's in an S3 lambda.

there is that there's an edge case where view_recap_document uses reverse() to construct a URL and redirect to that

That won't happen to a PDF though. This happens because appellate dockets start out as docket entry 1, then you learn that actually there's no docket entry 1, only 1-1, and 1-2, etc, so you convert doc 1 to doc 1-1. All of this junk happens before you get the PDF, so if somebody is sharing a PDF, it would have already been figured out.

I think that's right. Anyway, even if it's not, I think we can ignore this edge case because worst case, somebody shares a PDF, lambda redirects to the og-lookup, og-lookup says, "Oh shoot, this item was converted", and then redirects to the document HTML page. Mastodon users see a link to a PDF, and get the HTML. That's...fine. At least it's not a 404.

nathreed and others added 2 commits September 5, 2023 20:24
is_og_bot will be false normally and only true when view_recap_document is invoked via redirect_og_lookup.
If false, None will be sent to the template and will use og:url value from base.html template.
@nathreed
Copy link
Contributor Author

nathreed commented Sep 6, 2023

I think that's right. Anyway, even if it's not, I think we can ignore this edge case because worst case, somebody shares a PDF, lambda redirects to the og-lookup, og-lookup says, "Oh shoot, this item was converted", and then redirects to the document HTML page. Mastodon users see a link to a PDF, and get the HTML. That's...fine. At least it's not a 404.

Excellent! So it should be straightforward to just pass an is_og_bot value through from redirect_og_lookup and only send a non-None value to the template if we're serving an og bot. I've pushed a change to that effect.

@cweider cweider self-requested a review September 6, 2023 07:06
@mlissner mlissner merged commit 6c01de9 into freelawproject:main Sep 7, 2023
5 checks passed
@mlissner
Copy link
Member

mlissner commented Sep 7, 2023

This looks about right. Merging, thank you.

This will deploy in about twenty minutes (you can watch in the Github Actions tab). It'd be great if you could just check that it's working once it's deployed.

Thank you again!

@nathreed
Copy link
Contributor Author

nathreed commented Sep 7, 2023

Checked after the python deploy finished.

It looks like I missed a / in the path because rd.filepath_local doesn't have one prepended:

curl "https://www.courtlistener.com/recap/og-lookup/?file_path=recap/gov.uscourts.nysd.524067/gov.uscourts.nysd.524067.27.0.pdf" | head -n 50 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 38864    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
 100<!DOCTYPE html>
 3<html lang="en">
8<head>
8  <meta charset="utf-8"/>
6  <meta http-equiv="Content-Language" content="en"/>
4  <meta name="language" content="en_us"/>
   <meta name="viewport" content="width=device-width,initial-scale=1"/>
 
   
   <meta name="description" content=" OPINION &amp; ORDER  re:  13  MOTION to Dismiss First Amended Complaint:  The Amended Complaint does not state a claim of misrepresentation regarding the flavoring of Wegmans Vanilla Ice Cream and is dismissed.  (Signed by Judge Louis L. Stanton on 7/14/2020)   (ml) Transmission to Orders and Judgments Clerk for processing.
"/>

  
  <link rel="search"
        type="application/opensearchdescription+xml"
        title="CourtListener"
0        href="https://storage.courtlistener.com/static/xml/opensearch.d73a7f6aa26e.xml" />
 
  
   <meta name="application-name" content="CourtListener"/>
   <meta name="msapplication-tooltip" content="Create alerts, search for and browse the latest court opinions."/>
   <meta name="msapplication-starturl" content="https://www.courtlistener.com"/>
   <meta name="msapplication-navbutton-color" content="#6683B7"/>
0
   
   <meta name="twitter:card" content="summary_large_image">
8  <meta name="twitter:creator" content="@freelawproject">
9  <meta name="twitter:site" content="@courtlistener">
4
8  
4      0 --  <meta property="og:type" content="website"/>
:  <meta property="og:title" content="Memorandum & Opinion &ndash; #27 in Steele v. Wegmans Food Markets, Inc. (S.D.N.Y., 1:19-cv-09227) – CourtListener.com"/>
-  <meta property="og:description"
-:-- --:-        content=" OPINION &amp; ORDER  re:  13  MOTION to Dismiss First Amended Complaint:  The Amended Complaint does not state a claim of misrepresentation regarding the flavoring of Wegmans Vanilla Ice Cream and is dismissed.  (Signed by Judge Louis L. Stanton on 7/14/2020)   (ml) Transmission to Orders and Judgments Clerk for processing.
">
  <meta property="og:url" content="https://www.courtlistener.comrecap/gov.uscourts.nysd.524067/gov.uscourts.nysd.524067.27.0.pdf"/>
  <meta property="og:site_name" content="CourtListener"/>
  <meta property="og:image"
        content="https://storage.courtlistener.com/recap-thumbnails/gov.uscourts.nysd.524067/139327795.thumb.1068.png"/>
  <meta property="og:image:type" content="image/jpeg"/>
  <meta property="twitter:image:alt"
-:        content="The first page of the document in the linked PDF"/>
-  <meta property="og:image:width" content="826"/>
-  <meta property="og:image:height" content="1068"/>
   
-
-  
:  <link rel="icon" href="https://storage.courtlistener.com/static/ico/favicon.1ae736eba120.ico">
-  

<meta property="og:url" content="https://www.courtlistener.comrecap/gov.uscourts.nysd.524067/gov.uscourts.nysd.524067.27.0.pdf"/>

I'll have a PR in a second to fix that one.

@nathreed
Copy link
Contributor Author

nathreed commented Sep 7, 2023

@mlissner #3113 fixes the slash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants