Sitemap XML support #69

buren · 2018-08-25T14:10:52Z

Overview

Supports index files
Supports gzipped files
Tries common Sitemap XML locations
With robots: true will try to fetch sitemap locations from /robots.txt
Each found URL will be added to the agent queue
⚠️If the server returns text/html for the sitemap no urls will be found, we could be more "liberal" in this situation and allow it..

Public API

Adds sitemap: true/false option to Agent
Adds Agent#sitemap_urls and #initialize_sitemap
Adds Page (tried to follow the same pattern used in page/html.rb):
- gzip?
- each_sitemap_link
- each_sitemap_url
- sitemap_links
- sitemap_urls
- each_sitemap_index_link
- each_sitemap_index_url
- sitemap_index_links
- sitemap_index_urls
- sitemap_index?
- sitemap_urlset?
- sitemap_doc

Usage

Spidr.site(url, sitemap: true)

Common sitemap locations will be tried (/sitemap.xml, etc..).

Spidr.site(url, sitemap: true, robots: true)

will first try to fetch sitemap locations from /robots.txt, if nothing is found there try common sitemap locations.

Common sitemap locations that will be tried (highest priority first):

sitemap.xml
sitemap.xml.gz
sitemap.gz
sitemap_index.xml
sitemap-index.xml
sitemap_index.xml.gz
sitemap-index.xml.gz

robots.txt support / interface

Implicitly enable robots: if sitemap: is enable.

Allow mixing robots: with sitemap:. If robots: is not specified, fallback to /sitemap.xml. This would have to be documented.

Add another option to indicate that you wish to infer sitemap from /robots.txt.

#19 (comment)

The current implementation implements 2. It would be easy to implement the other variants if thats desirable (Example for 3.).

Or a more "fancy" interface

Spidr.site(url, sitemap: :robots) # check /robots.txt

Support non-default locations that aren't listed in /robots.txt, the sitemap protocol allows Sitemaps to be "scoped" under a path, to support that we could allow for this:

Spidr.site(url, sitemap: '/catalog/sitemap.xml')

Here is a diff for a commit that adds support for it.

Links

Sitemap XML protocol - https://www.sitemaps.org/protocol.html
Previous discussion in Automatically detect and parse sitemap.xml #19.

postmodern

Looks good, just a suggestion of how/when we trigger the sitemap pulling/parsing code, suggestion about making a dedicated Sitemap class instead of adding to Page, and my OCD Ruby code styling suggestions.

postmodern · 2022-01-28T23:11:58Z

lib/spidr/agent/sitemap.rb

+        if urls = @robots.other_values(base_url)['Sitemap']
+          return urls.flat_map { |u| get_sitemap_urls(url: u) }
+        end
+      end


This is a clever way of populating the the queue using the Sitemap listed in the robots.txt file. Although, I feel like we should not request any URL before run has been called. A way around that would be to add every_robots_txt and every_sitemap_xml callback hooks, and use those to automatically parse and enqueue URLs when those files are encountered.

Something like:

agent.every_robots_txt do |robots| # RobotsTXT class # check robots for a `Sitemap` entry end agent.every_sitemap_xml do |sitemap| # SitemapXML class sitemap.urls.each { |url| agent.enqueue(url) } end

Hmm, every_robots_txt might not current since I forgot the Robots library automatically fetches /robots.txt files and caches them when you query it. I could possibly add a every_host callback to detect when the Agent visits a new hostname, and then we can use that to eager request /robots.txt and /sitemap.xml. This would also allow the sitemap detection logic to fire on every new host/domain we spider, not just the first one. Thoughts?

I could add a every_host callback hook in the 0.7.0 branch, if you think it's a good idea.

postmodern · 2022-01-28T23:14:08Z

lib/spidr/page/sitemap.rb

+require 'zlib'
+
+module Spidr
+  class Page


I feel like there should be a Sitemap class that inherits from Page. Only the sitemap.xml file should have sitemap methods. This would also allow for these methods to not contain sitemap in them, which seems kind of redundant if we're already parsing a sitemap.xml file.

postmodern · 2022-01-28T23:16:45Z

spec/agent/sitemap_spec.rb

+      end
+
+      before do
+        stub_request(:any, /#{Regexp.escape(host)}/).to_rack(app)


This is redundant code. Already defined in spec/example_app.rb:22.

postmodern · 2022-01-28T23:29:31Z

lib/spidr/page/sitemap.rb

+    # Return all URLs defined in Sitemap.
+    #
+    # @return [Array<URI::HTTP>, Array<URI::HTTPS>]
+    #   of URLs defined in Sitemap.


Sentence fragment.

postmodern · 2022-01-28T23:30:37Z

lib/spidr/page/sitemap.rb

+    def each_sitemap_index_link
+      return enum_for(__method__) unless block_given?
+
+      each_extracted_sitemap_links('sitemap') { |url| yield(url) }


Probably can be simplified by passing in a &block argument.

postmodern · 2022-01-28T23:43:03Z

lib/spidr/page/sitemap.rb

+    #   A URL from the sitemap page.
+    #
+    # @return [Enumerator]
+    #   If no block is given, an enumerator object will be returned.


Add extra #.

postmodern · 2022-01-28T23:43:11Z

lib/spidr/page/sitemap.rb

+    # Return all Sitemap index links defined in sitemap.
+    #
+    # @return [Array<String>]
+    #   of links defined in Sitemap.


Add extra #.

postmodern · 2022-01-28T23:43:22Z

lib/spidr/page/sitemap.rb

+    #   A URL from the sitemap page.
+    #
+    # @return [Enumerator]
+    #   If no block is given, an enumerator object will be returned.


Add extra #.

postmodern · 2022-01-28T23:43:33Z

lib/spidr/page/sitemap.rb

+    #   A URL from the sitemap page.
+    #
+    # @return [Enumerator]
+    #   If no block is given, an enumerator object will be returned.


Add extra #.

postmodern · 2022-01-28T23:43:47Z

lib/spidr/agent/sitemap.rb

+    # @return [Array<URI::HTTP>, Array<URI::HTTPS>]
+    #   The URLs found.
+    #
+    # @see https://www.sitemaps.org/protocol.html


Add extra #.

postmodern · 2022-01-28T23:49:35Z

I like the sitemap: :robots feature. Although maybe it should be a separate option, like robots_sitemap: true?

postmodern · 2022-01-28T23:51:09Z

Regardless of my suggestions, this is good work and a good feature idea!

buren added 2 commits August 25, 2018 15:45

Add Page#gzip?

dc3193b

Add Sitemap XML support

8acd704

postmodern requested changes Jan 28, 2022

View reviewed changes

Spone mentioned this pull request Mar 24, 2023

Idea: using sitemap for crawling benpickles/parklife#95

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sitemap XML support #69

Sitemap XML support #69

buren commented Aug 25, 2018 •

edited

Loading

postmodern left a comment

postmodern Jan 28, 2022

postmodern Jan 28, 2022

postmodern Jan 28, 2022

postmodern Jan 28, 2022

postmodern Jan 28, 2022

postmodern Jan 28, 2022

postmodern Jan 28, 2022

postmodern Jan 28, 2022

postmodern Jan 28, 2022

postmodern Jan 28, 2022

postmodern Jan 28, 2022

postmodern commented Jan 28, 2022

postmodern commented Jan 28, 2022

Sitemap XML support #69

Are you sure you want to change the base?

Sitemap XML support #69

Conversation

buren commented Aug 25, 2018 • edited Loading

postmodern left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

postmodern commented Jan 28, 2022

postmodern commented Jan 28, 2022

buren commented Aug 25, 2018 •

edited

Loading