pyscraper

pyscraper is a super simple Python script that accepts a mandatory URL (URI, actually) as parameter. It issues a request to that URI and attempts to interpret the response body as HTML and count the HTML elements present. It returns a brief summary of such figure as well as the top 5 most used tags.

Free software: Apache Software License 2.0
Documentation: https://github.com/ClintEsteMadera/pyscraper

Context

This script was created in the context of a code challenge that a company asked me to do. These are the exact requirements:

Build a web page scraper in python that generates some basic metrics about a given page.

Requirements:

A "README" explaining usage and any dependencies/setup/caveats/etc

The exercise should be written in Python (2 or 3, as you prefer). Please, specify in the README the version that was used

No external python libraries are allowed. But built-in Python libraries are allowed

The target URL to scrape should be a parameter

In whatever output format you see fit, your code should hit the target domain and output the total number of HTML elements and the top 5 most frequently used tags, and their respective counts.

Must-have/Compulsory: Write tests to make sure your code is behaving correctly.

You must create a repo on Github and send us the link to it.

Please include any good practices you usually follow and present it as if it were production-ready code.

Requirements

Mandatory:

Python 3 or greater

Optional:

If you are interested in running code coverage (currently at 100%), lint or api-docs generation, you will need the following:

flake8 (pip install flake8)
coverage (pip install coverage)
sphinx (pip install sphinxcontrib-apidoc)

Virtual Environment

Although optional, using a virtual environment is highly recommended. To manage these virtual environments, if you haven't already done so, please install pyenv:

brew install pyenv pyenv-virtualenv

You then need to add this to .bashrc:

if command -v pyenv 1>/dev/null 2>&1; then
    eval "$(pyenv init -)"
    eval "$(pyenv virtualenv-init -)"
fi

Then run pyenv versions to know which Python versions you've got already. Choose any Python 3+ for basing the virtualenv you will create. For example:

pyenv virtualenv system-3.8.2 pyscraper

Now run pip install -r requirements_dev.txt and you should be good to go.

Note that there is a Makefile that has several goals to accomplish these tasks.

Script Usage

python main.py <url_to_scrape>

Example:

python main.py https://www.ordergroove.com/

Sample Output:

Number of HTML elements: 1104
Most frequently used elements (Top 5):
div : 452
script : 163
span : 88
a : 70
li : 60

Run tests

Just run the following:

make test

Note: you might see there is a warning that was intentionally omitted (pytest --disable-pytest-warnings) as I could not easily (properly) fix it.

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Even though many auto-generated things, specially in the Makefile, might strike people as unnecessary for the purpose of this code challenge, as a Python newbie, I found it was a good excuse for me to at least scratch the surface of several tools commonly used in the Python ecosystem. That is why I decided to go this route, as opposed to a dead-simple setup which is what my IDE (PyCharm) generated for me in the first place.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
pyscraper		pyscraper
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
AUTHORS.rst		AUTHORS.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
Procfile		Procfile
README.rst		README.rst
main.py		main.py
requirements_dev.txt		requirements_dev.txt
runtime.txt		runtime.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyscraper

Context

Requirements

Virtual Environment

Script Usage

Run tests

Credits

About

Releases

Packages

Languages

License

ClintEsteMadera/pyscraper

Folders and files

Latest commit

History

Repository files navigation

pyscraper

Context

Requirements

Virtual Environment

Script Usage

Run tests

Credits

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages