-
-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sitemap XML support #69
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a suggestion of how/when we trigger the sitemap pulling/parsing code, suggestion about making a dedicated Sitemap
class instead of adding to Page
, and my OCD Ruby code styling suggestions.
if urls = @robots.other_values(base_url)['Sitemap'] | ||
return urls.flat_map { |u| get_sitemap_urls(url: u) } | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a clever way of populating the the queue using the Sitemap
listed in the robots.txt
file. Although, I feel like we should not request any URL before run
has been called. A way around that would be to add every_robots_txt
and every_sitemap_xml
callback hooks, and use those to automatically parse and enqueue URLs when those files are encountered.
Something like:
agent.every_robots_txt do |robots| # RobotsTXT class
# check robots for a `Sitemap` entry
end
agent.every_sitemap_xml do |sitemap| # SitemapXML class
sitemap.urls.each { |url| agent.enqueue(url) }
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, every_robots_txt
might not current since I forgot the Robots library automatically fetches /robots.txt
files and caches them when you query it. I could possibly add a every_host
callback to detect when the Agent visits a new hostname, and then we can use that to eager request /robots.txt
and /sitemap.xml
. This would also allow the sitemap detection logic to fire on every new host/domain we spider, not just the first one. Thoughts?
I could add a every_host
callback hook in the 0.7.0 branch, if you think it's a good idea.
require 'zlib' | ||
|
||
module Spidr | ||
class Page |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like there should be a Sitemap
class that inherits from Page
. Only the sitemap.xml
file should have sitemap methods. This would also allow for these methods to not contain sitemap
in them, which seems kind of redundant if we're already parsing a sitemap.xml
file.
end | ||
|
||
before do | ||
stub_request(:any, /#{Regexp.escape(host)}/).to_rack(app) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant code. Already defined in spec/example_app.rb:22.
# Return all URLs defined in Sitemap. | ||
# | ||
# @return [Array<URI::HTTP>, Array<URI::HTTPS>] | ||
# of URLs defined in Sitemap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sentence fragment.
def each_sitemap_index_link | ||
return enum_for(__method__) unless block_given? | ||
|
||
each_extracted_sitemap_links('sitemap') { |url| yield(url) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably can be simplified by passing in a &block
argument.
# A URL from the sitemap page. | ||
# | ||
# @return [Enumerator] | ||
# If no block is given, an enumerator object will be returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add extra #
.
# Return all Sitemap index links defined in sitemap. | ||
# | ||
# @return [Array<String>] | ||
# of links defined in Sitemap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add extra #
.
# A URL from the sitemap page. | ||
# | ||
# @return [Enumerator] | ||
# If no block is given, an enumerator object will be returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add extra #
.
# A URL from the sitemap page. | ||
# | ||
# @return [Enumerator] | ||
# If no block is given, an enumerator object will be returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add extra #
.
# @return [Array<URI::HTTP>, Array<URI::HTTPS>] | ||
# The URLs found. | ||
# | ||
# @see https://www.sitemaps.org/protocol.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add extra #
.
I like the |
Regardless of my suggestions, this is good work and a good feature idea! |
Overview
robots: true
will try to fetch sitemap locations from/robots.txt
text/html
for the sitemap no urls will be found, we could be more "liberal" in this situation and allow it..Public API
sitemap: true/false
option toAgent
Agent#sitemap_urls
and#initialize_sitemap
Page
(tried to follow the same pattern used inpage/html.rb
):gzip?
each_sitemap_link
each_sitemap_url
sitemap_links
sitemap_urls
each_sitemap_index_link
each_sitemap_index_url
sitemap_index_links
sitemap_index_urls
sitemap_index?
sitemap_urlset?
sitemap_doc
Usage
Common sitemap locations will be tried (
/sitemap.xml
, etc..).will first try to fetch sitemap locations from
/robots.txt
, if nothing is found there try common sitemap locations.Common sitemap locations that will be tried (highest priority first):
robots.txt support / interface
The current implementation implements 2. It would be easy to implement the other variants if thats desirable (Example for 3.).
Or a more "fancy" interface
Support non-default locations that aren't listed in
/robots.txt
, the sitemap protocol allows Sitemaps to be "scoped" under a path, to support that we could allow for this:Here is a diff for a commit that adds support for it.
Links