-
Notifications
You must be signed in to change notification settings - Fork 7
FAqT Service using Beautiful Soup
- Getting started for all of DataFAQs.
- FAqT Service introduces how to start implementing a FAqT Service.
- FAqT Services follow the design of the SADI Semantic Web Services framework.
We'll show how to include the Beautiful Soup API in your FAqT or SADI service so that you can scrape a web page from within the service. We'll create an example SADI service that scrapes wikimedia tables to find RDFS/OWL vocabulary listed. We created this service to help the W3C prov-wg develop and document their OWL ontology.
Beautiful Soup is a python library to walk HTML documents. ScraperWiki uses the API to ease scraping web pages, and using that is a handy way familiarize with and try out Beautiful Soup.
First, we'll follow the steps in FAqT Service to copy the faqt-template and make it our own. This commit shows the SADI service implementation before we do anything with Beautiful Soup. The commit also shows the example HTML that we're trying to parse and the RDF describing the web page, which is submitted to the wikitable-gspo
service. This provides the materials that we need to implement the scraping functionality.
Next, make sure you have Beautiful Soup installed (ref). We stuck to version 3 and avoided the beta.
bash-3.2$ easy_install BeautifulSoup
Searching for BeautifulSoup
Best match: BeautifulSoup 3.2.0
Adding BeautifulSoup 3.2.0 to easy-install.pth file
Using /Library/Python/2.6/site-packages
Processing dependencies for BeautifulSoup
Finished processing dependencies for BeautifulSoup
Add the following import to your FAqT/SADI service (ref):
from BeautifulSoup import BeautifulSoup # For processing HTML
Next, we fetch the web page, create the soup object, and print it (just to see that we got it). Killing and rerunning python wikitable-gspo.rpy
and calling it with curl -H "Content-Type: text/turtle" -d @provrdf.ttl http://localhost:9115/wikitable-gpso
will show the html on the service's console. This commit shows the two additions required.
def process(self, input, output):
print 'processing ' + input.subject
page = urllib2.urlopen(input.subject)
soup = BeautifulSoup(page)
print soup.prettify()
Finally, implement the scrape that we need. This commit shows the implementation we needed to add to find CURIEs mentioned in html tables. Thanks to Richard Cyganiak for http://prefix.cc/. We use it to look up namespaces for the curies we find.
The following commands show how to invoke the final service.
bash-3.2$ cat provrdf.ttl
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
<http://www.w3.org/2011/prov/wiki/ProvRDF> a foaf:Document .
<http://dvcs.w3.org/hg/prov/raw-file/default/primer/Primer.html> a foaf:Document .
<http://dvcs.w3.org/hg/prov/raw-file/default/paq/prov-aq.html> a foaf:Document .
bash-3.2$ curl -H "Content-Type: text/turtle" -d @provrdf.ttl http://localhost:9115/wikitable-gpso
@prefix dcterms: <http://purl.org/dc/terms/> .
<http://dvcs.w3.org/hg/prov/raw-file/default/paq/prov-aq.html>
a <http://purl.org/twc/vocab/datafaqs#Unsatisfactory> .
<http://dvcs.w3.org/hg/prov/raw-file/default/primer/Primer.html>
a <http://purl.org/twc/vocab/datafaqs#Unsatisfactory> .
<http://www.w3.org/2011/prov/wiki/ProvRDF>
a <http://purl.org/twc/vocab/datafaqs#Satisfactory>;
dcterms:subject <http://www.w3.org/ns/prov-o/Entity> .
Meanwhile, the service prints:
processing http://dvcs.w3.org/hg/prov/raw-file/default/primer/Primer.html
processing http://dvcs.w3.org/hg/prov/raw-file/default/paq/prov-aq.html
processing http://www.w3.org/2011/prov/wiki/ProvRDF
document contained curie prov:Entity
FETCHED prefix.cc namespace for prov : http://www.w3.org/ns/prov-o/
Finally, we want to include the right metadata about the service. We gather up the:
- Source code: https://raw.github.com/timrdf/DataFAQs/master/services/sadi/util/wikitable-gspo.rpy
- Source code homepage: https://github.com/timrdf/DataFAQs/blob/master/services/sadi/util/wikitable-gspo.rpy
- Deployed service URI: http://sparql.tw.rpi.edu/services/datafaqs/util/wikitable-gspo
and replace the template strings, respectively:
#TEMPLATE/path/to/public/source-code.rpy
#TEMPLATE/path/to/public/HOMEPAGE-FOR/source-code.rpy
#TEMPLATE/path/to/where/source-code.rpy/is/deployed/for/invocation
This commit shows how the metadata template was filled out. The wikitable-gspo.ttl file was created by running tic.sh wikitable-gspo.rpy > wikitable-gspo.ttl
(see turtle in comments).
You can ignore all the stuff above and just run these two commands to find out what CURIEs http://www.w3.org/2011/prov/wiki/ProvRDF mentions in tables.
curl -O https://raw.github.com/timrdf/DataFAQs/master/services/sadi/util/wikitable-gspo-materials/sample-inputs/provrdf.ttl
curl -H "Content-Type: text/turtle" -d @provrdf.ttl http://sparql.tw.rpi.edu/services/datafaqs/util/wikitable-gspo
Last I checked, it gives:
<http://www.w3.org/2011/prov/wiki/ProvRDF> dcterms:subject <http://www.w3.org/ns/prov-o/Entity> .
Or, if you want to check your own web page, replace http://www.w3.org/2011/prov/wiki/ProvRDF
in provrdf.ttl
and send it off again:
cp provrdf.ttl my-page.ttl
# replace http://www.w3.org/2011/prov/wiki/ProvRDF with, e.g.
# http://www.w3.org/2005/Incubator/prov/wiki/Provenance_Vocabulary_Mappings
curl -H "Content-Type: text/turtle" -d @my-page.ttl http://sparql.tw.rpi.edu/services/datafaqs/util/wikitable-gspo
you'll get something like:
<http://www.w3.org/2005/Incubator/prov/wiki/Provenance_Vocabulary_Mappings>
a <http://purl.org/twc/vocab/datafaqs#Satisfactory>;
dcterms:subject
<http://inference-web.org/2.0/pml-justification.owl#InferenceStep>,
<http://inference-web.org/2.0/pml-justification.owl#IsConsequentOf>,
...
<http://xmlns.com/wot/0.1/PubKey>,
<http://xmlns.com/wot/0.1/SigEvent>,
<http://xmlns.com/wot/0.1/signer>;
<http://code.google.com/p/surfrdf/7e6f38eb-5c2a-435a-ac0a-463b4be1df7c>,
<http://code.google.com/p/surfrdf/93dcc518-7eb0-4990-bc04-58b13862ebb3>,
<http://code.google.com/p/surfrdf/bf4d3286-05df-44f7-ba11-5357fc5c7f67>,
<http://code.google.com/p/surfrdf/e676d982-1592-4926-8ce7-58587795f8e0> .
<http://code.google.com/p/surfrdf/7e6f38eb-5c2a-435a-ac0a-463b4be1df7c>
a <http://purl.org/twc/vocab/datafaqs#Error>;
vann:preferredNamespacePrefix "pav" .
<http://code.google.com/p/surfrdf/93dcc518-7eb0-4990-bc04-58b13862ebb3>
a <http://purl.org/twc/vocab/datafaqs#Error>;
vann:preferredNamespacePrefix "premis" .
<http://code.google.com/p/surfrdf/bf4d3286-05df-44f7-ba11-5357fc5c7f67>
a <http://purl.org/twc/vocab/datafaqs#Error>;
vann:preferredNamespacePrefix "Provenance" .
<http://code.google.com/p/surfrdf/e676d982-1592-4926-8ce7-58587795f8e0>
a <http://purl.org/twc/vocab/datafaqs#Error>;
vann:preferredNamespacePrefix "wp" .
which tells us that prefix.cc doesn't have namespaces defined for four prefixed that the page uses.
- Other ways to implement a FAqT Service
- A different way to implement a FAqT Service: [Ripple](FAqT Service using Ripple)