Skip to content

FAqT Service using Beautiful Soup

timrdf edited this page Sep 14, 2012 · 35 revisions

What is first

What we'll cover

We'll show how to include the Beautiful Soup API in your FAqT or SADI service so that you can scrape a web page from within the service. We'll create an example SADI service that scrapes wikimedia tables to find RDFS/OWL vocabulary listed. We created this service to help the W3C prov-wg develop and document their OWL ontology.

Let's get to it

Beautiful Soup is a python library to walk HTML documents. ScraperWiki uses the API to ease scraping web pages, and using that is a handy way familiarize with and try out Beautiful Soup.

First, we'll follow the steps in FAqT Service to copy the faqt-template and make it our own. This commit shows the SADI service implementation before we do anything with Beautiful Soup. The commit also shows the example HTML that we're trying to parse and the RDF describing the web page, which is submitted to the wikitable-gspo service. This provides the materials that we need to implement the scraping functionality.

Next, make sure you have Beautiful Soup installed (ref). We stuck to version 3 and avoided the beta.

bash-3.2$ easy_install BeautifulSoup
Searching for BeautifulSoup
Best match: BeautifulSoup 3.2.0
Adding BeautifulSoup 3.2.0 to easy-install.pth file

Using /Library/Python/2.6/site-packages
Processing dependencies for BeautifulSoup
Finished processing dependencies for BeautifulSoup

Add the following import to your FAqT/SADI service (ref):

from BeautifulSoup import BeautifulSoup          # For processing HTML

Next, we fetch the web page, create the soup object, and print it (just to see that we got it). Killing and rerunning python wikitable-gspo.rpy and calling it with curl -H "Content-Type: text/turtle" -d @provrdf.ttl http://localhost:9115/wikitable-gpso will show the html on the service's console. This commit shows the two additions required.

   def process(self, input, output):

      print 'processing ' + input.subject

      page = urllib2.urlopen(input.subject)
      soup = BeautifulSoup(page)
      print soup.prettify()

Finally, implement the scrape that we need. This commit shows the implementation we needed to add to find CURIEs mentioned in html tables. Thanks to Richard Cyganiak for http://prefix.cc/. We use it to look up namespaces for the curies we find.

The following commands show how to invoke the final service.

bash-3.2$ cat provrdf.ttl 

@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://www.w3.org/2011/prov/wiki/ProvRDF>                       a foaf:Document .
<http://dvcs.w3.org/hg/prov/raw-file/default/primer/Primer.html> a foaf:Document .
<http://dvcs.w3.org/hg/prov/raw-file/default/paq/prov-aq.html>   a foaf:Document .
bash-3.2$ curl -H "Content-Type: text/turtle" -d @provrdf.ttl http://localhost:9115/wikitable-gpso

@prefix dcterms: <http://purl.org/dc/terms/> .

<http://dvcs.w3.org/hg/prov/raw-file/default/paq/prov-aq.html> 
   a <http://purl.org/twc/vocab/datafaqs#Unsatisfactory> .

<http://dvcs.w3.org/hg/prov/raw-file/default/primer/Primer.html> 
   a <http://purl.org/twc/vocab/datafaqs#Unsatisfactory> .

<http://www.w3.org/2011/prov/wiki/ProvRDF> 
   a <http://purl.org/twc/vocab/datafaqs#Satisfactory>;
   dcterms:subject <http://www.w3.org/ns/prov-o/Entity> .

Meanwhile, the service prints:

processing http://dvcs.w3.org/hg/prov/raw-file/default/primer/Primer.html
processing http://dvcs.w3.org/hg/prov/raw-file/default/paq/prov-aq.html
processing http://www.w3.org/2011/prov/wiki/ProvRDF
   document contained curie prov:Entity
      FETCHED prefix.cc namespace for prov : http://www.w3.org/ns/prov-o/

Finally, we want to include the right metadata about the service. We gather up the:

and replace the template strings, respectively:

  • #TEMPLATE/path/to/public/source-code.rpy
  • #TEMPLATE/path/to/public/HOMEPAGE-FOR/source-code.rpy
  • #TEMPLATE/path/to/where/source-code.rpy/is/deployed/for/invocation

This commit shows how the metadata template was filled out. The wikitable-gspo.ttl file was created by running tic.sh wikitable-gspo.rpy > wikitable-gspo.ttl (see turtle in comments).

Try it out yourself

You can ignore all the stuff above and just run these two commands to find out what CURIEs http://www.w3.org/2011/prov/wiki/ProvRDF mentions in tables.

curl -O https://raw.github.com/timrdf/DataFAQs/master/services/sadi/util/wikitable-gspo-materials/sample-inputs/provrdf.ttl
curl -H "Content-Type: text/turtle" -d @provrdf.ttl http://sparql.tw.rpi.edu/services/datafaqs/util/wikitable-gspo

Last I checked, it gives:

<http://www.w3.org/2011/prov/wiki/ProvRDF> dcterms:subject <http://www.w3.org/ns/prov-o/Entity> .

Check your own page

Or, if you want to check your own web page, replace http://www.w3.org/2011/prov/wiki/ProvRDF in provrdf.ttl and send it off again:

cp provrdf.ttl my-page.ttl
# replace http://www.w3.org/2011/prov/wiki/ProvRDF with, e.g. 
# http://www.w3.org/2005/Incubator/prov/wiki/Provenance_Vocabulary_Mappings
curl -H "Content-Type: text/turtle" -d @my-page.ttl http://sparql.tw.rpi.edu/services/datafaqs/util/wikitable-gspo

you'll get something like:

<http://www.w3.org/2005/Incubator/prov/wiki/Provenance_Vocabulary_Mappings> 
    a <http://purl.org/twc/vocab/datafaqs#Satisfactory>;
    dcterms:subject 
        <http://inference-web.org/2.0/pml-justification.owl#InferenceStep>,
        <http://inference-web.org/2.0/pml-justification.owl#IsConsequentOf>,
        ...
        <http://xmlns.com/wot/0.1/PubKey>,
        <http://xmlns.com/wot/0.1/SigEvent>,
        <http://xmlns.com/wot/0.1/signer>;
        <http://code.google.com/p/surfrdf/7e6f38eb-5c2a-435a-ac0a-463b4be1df7c>,
        <http://code.google.com/p/surfrdf/93dcc518-7eb0-4990-bc04-58b13862ebb3>,
        <http://code.google.com/p/surfrdf/bf4d3286-05df-44f7-ba11-5357fc5c7f67>,
        <http://code.google.com/p/surfrdf/e676d982-1592-4926-8ce7-58587795f8e0> .

<http://code.google.com/p/surfrdf/7e6f38eb-5c2a-435a-ac0a-463b4be1df7c> 
    a <http://purl.org/twc/vocab/datafaqs#Error>;
    vann:preferredNamespacePrefix "pav" .

<http://code.google.com/p/surfrdf/93dcc518-7eb0-4990-bc04-58b13862ebb3> 
    a <http://purl.org/twc/vocab/datafaqs#Error>;
    vann:preferredNamespacePrefix "premis" .

<http://code.google.com/p/surfrdf/bf4d3286-05df-44f7-ba11-5357fc5c7f67> 
    a <http://purl.org/twc/vocab/datafaqs#Error>;
    vann:preferredNamespacePrefix "Provenance" .

<http://code.google.com/p/surfrdf/e676d982-1592-4926-8ce7-58587795f8e0> 
    a <http://purl.org/twc/vocab/datafaqs#Error>;
    vann:preferredNamespacePrefix "wp" .

which tells us that prefix.cc doesn't have namespaces defined for four prefixed that the page uses.

What is next

  • Other ways to implement a FAqT Service
  • A different way to implement a FAqT Service: [Ripple](FAqT Service using Ripple)
Clone this wiki locally