Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sitemap XML support #69

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Sitemap XML support #69

wants to merge 2 commits into from

Conversation

buren
Copy link
Contributor

@buren buren commented Aug 25, 2018

Overview

  • Supports index files
  • Supports gzipped files
  • Tries common Sitemap XML locations
  • With robots: true will try to fetch sitemap locations from /robots.txt
  • Each found URL will be added to the agent queue
  • ⚠️If the server returns text/html for the sitemap no urls will be found, we could be more "liberal" in this situation and allow it..

Public API

  • Adds sitemap: true/false option to Agent
  • Adds Agent#sitemap_urls and #initialize_sitemap
  • Adds Page (tried to follow the same pattern used in page/html.rb):
    • gzip?
    • each_sitemap_link
    • each_sitemap_url
    • sitemap_links
    • sitemap_urls
    • each_sitemap_index_link
    • each_sitemap_index_url
    • sitemap_index_links
    • sitemap_index_urls
    • sitemap_index?
    • sitemap_urlset?
    • sitemap_doc

Usage

Spidr.site(url, sitemap: true)

Common sitemap locations will be tried (/sitemap.xml, etc..).

Spidr.site(url, sitemap: true, robots: true)

will first try to fetch sitemap locations from /robots.txt, if nothing is found there try common sitemap locations.

Common sitemap locations that will be tried (highest priority first):

sitemap.xml
sitemap.xml.gz
sitemap.gz
sitemap_index.xml
sitemap-index.xml
sitemap_index.xml.gz
sitemap-index.xml.gz

robots.txt support / interface

  1. Implicitly enable robots: if sitemap: is enable.
  2. Allow mixing robots: with sitemap:. If robots: is not specified, fallback to /sitemap.xml. This would have to be documented.
  3. Add another option to indicate that you wish to infer sitemap from /robots.txt.

#19 (comment)

The current implementation implements 2. It would be easy to implement the other variants if thats desirable (Example for 3.).

Or a more "fancy" interface

Spidr.site(url, sitemap: :robots) # check /robots.txt

Support non-default locations that aren't listed in /robots.txt, the sitemap protocol allows Sitemaps to be "scoped" under a path, to support that we could allow for this:

Spidr.site(url, sitemap: '/catalog/sitemap.xml')

Here is a diff for a commit that adds support for it.

Links

Copy link
Owner

@postmodern postmodern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a suggestion of how/when we trigger the sitemap pulling/parsing code, suggestion about making a dedicated Sitemap class instead of adding to Page, and my OCD Ruby code styling suggestions.

if urls = @robots.other_values(base_url)['Sitemap']
return urls.flat_map { |u| get_sitemap_urls(url: u) }
end
end
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clever way of populating the the queue using the Sitemap listed in the robots.txt file. Although, I feel like we should not request any URL before run has been called. A way around that would be to add every_robots_txt and every_sitemap_xml callback hooks, and use those to automatically parse and enqueue URLs when those files are encountered.

Something like:

  agent.every_robots_txt do |robots| # RobotsTXT class
    # check robots for a `Sitemap` entry
  end
  
  agent.every_sitemap_xml do |sitemap| # SitemapXML class
    sitemap.urls.each { |url| agent.enqueue(url) }
  end

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, every_robots_txt might not current since I forgot the Robots library automatically fetches /robots.txt files and caches them when you query it. I could possibly add a every_host callback to detect when the Agent visits a new hostname, and then we can use that to eager request /robots.txt and /sitemap.xml. This would also allow the sitemap detection logic to fire on every new host/domain we spider, not just the first one. Thoughts?

I could add a every_host callback hook in the 0.7.0 branch, if you think it's a good idea.

require 'zlib'

module Spidr
class Page
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there should be a Sitemap class that inherits from Page. Only the sitemap.xml file should have sitemap methods. This would also allow for these methods to not contain sitemap in them, which seems kind of redundant if we're already parsing a sitemap.xml file.

end

before do
stub_request(:any, /#{Regexp.escape(host)}/).to_rack(app)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant code. Already defined in spec/example_app.rb:22.

# Return all URLs defined in Sitemap.
#
# @return [Array<URI::HTTP>, Array<URI::HTTPS>]
# of URLs defined in Sitemap.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentence fragment.

def each_sitemap_index_link
return enum_for(__method__) unless block_given?

each_extracted_sitemap_links('sitemap') { |url| yield(url) }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably can be simplified by passing in a &block argument.

# A URL from the sitemap page.
#
# @return [Enumerator]
# If no block is given, an enumerator object will be returned.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add extra #.

# Return all Sitemap index links defined in sitemap.
#
# @return [Array<String>]
# of links defined in Sitemap.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add extra #.

# A URL from the sitemap page.
#
# @return [Enumerator]
# If no block is given, an enumerator object will be returned.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add extra #.

# A URL from the sitemap page.
#
# @return [Enumerator]
# If no block is given, an enumerator object will be returned.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add extra #.

# @return [Array<URI::HTTP>, Array<URI::HTTPS>]
# The URLs found.
#
# @see https://www.sitemaps.org/protocol.html
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add extra #.

@postmodern
Copy link
Owner

I like the sitemap: :robots feature. Although maybe it should be a separate option, like robots_sitemap: true?

@postmodern
Copy link
Owner

Regardless of my suggestions, this is good work and a good feature idea!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants