README

Names and SEAS logins

Kevin Chen (kevc528), Maxwell Du (maxdu), Edward Kim (kime022), Andrew Zhao (anzhao)

Description of Features

Crawler

The crawler follows the Mercator model and distributes the tasks involved in the crawl. Added are a worker and master Spark Java servers along with changes to the queue and other aspects of the crawler to increase efficiency.

Indexer

The indexer takes documents and creates an inverted index which maps terms to documents that contain them, along a seond table that maps terms to their IDFs.

PageRank

The PageRank algorithm is written in Python on a Jupyter Notebook using PySpark SQL. It queries a Postgres AWS RDS instance for graph edge list data and loads it into a spark dataframe for computation. When it converges the results are written back to RDS.

Query and Search

The query function returns a sorted list of documents matching a user's query, where documents' scores are a combination of TF-IDF, pagerank, and several custom bonuses we implemented. On the front-end, there is a web server API for the query function and a React-based UI to display search results.

Source Files

Crawler: 555-crawler
Indexer: 555-indexer
Web UI: cis555webui
Pagerank: pagerank
Search Backend: search-server

How to Run The Project

Crawler

The crawler can be run using Maven. Within the pom.xml file, there are different executions for the master node and up to 3 worker nodes. To run the master, just simply run mvn exec:java@master and to run the workers, run mvn exec:java@worker[number]. It is important to note that there are several environment variables that the crawler uses. These are RDS_USERNAME, RDS_PASSWORD, and RDS_HOSTNAME. These must be filled in for the crawler to access and write to the database. Additionally, when using EC2, you will have to change the argument in pom.xml to match what ever location the master node is running on.

Indexer

To run the indexer, first run mvn package within the 555-indexer directory, which will create a JAR file in the target folder. Upload this file to S3, and create an EMR cluster that runs this S3 file as a "Spark step." Make sure that the arguments for spark-submit contains --class edu.upenn.cis.cis455.invertedindex.Indexer and that you specify --conf spark.yarn.appMasterEnv.RDS_USERNAME=___ --conf spark.yarn.appMasterEnv.RDS_PASSWORD=___ --conf spark.yarn.appMasterEnv.RDS_HOSTNAME=___. The arguments to the main class itself is of the form crawlerDocsTableName invertedIndexTableName idfsTablename numPartitions, where numPartitions specifies the number of partitions Spark uses throughout the job.

PageRank

Follow this tutorial to configure and launch your EC2 node: https://chrisalbon.com/aws/basics/run_project_jupyter_on_amazon_ec2/ . Afterwards, upload the 2 notebooks: pagerank.ipynb and clean_urls.ipynb using the Jupyter Notebook UI. Additionally add the postgresql-42.2.20.jar file in the same directory. You will also need the create a file called aws_credentials.json and put it into the same directory. The file should have the following format: { "aws_access_key_id": "", "aws_secret_access_key": "", "password": "", "ENDPOINT": "", "PORT": "", "USR": "", "REGION": "", "DBNAME": "" }

Now just run clean_urls.ipynb, then run pagerank.ipynb.

Document Title/Preview Extractor

Deploy on EMR by setting a stage and uploading the jar with dependencies to Amazon S3.

Query and Search

Every other component can be deployed in EC2. In particular, we can deploy the web server and the UI on the same EC2 node, with the web server running on port 45555 and the UI running on port 80.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

Names and SEAS logins

Description of Features

Crawler

Indexer

PageRank

Query and Search

Source Files

How to Run The Project

Crawler

Indexer

PageRank

Document Title/Preview Extractor

Query and Search

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
555-crawler		555-crawler
555-indexer		555-indexer
cis555webui		cis555webui
pagerank		pagerank
search-server		search-server
.gitignore		.gitignore
CIS 555 Final Report.pdf		CIS 555 Final Report.pdf
README.md		README.md

maxduu/Web-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

README

Names and SEAS logins

Description of Features

Crawler

Indexer

PageRank

Query and Search

Source Files

How to Run The Project

Crawler

Indexer

PageRank

Document Title/Preview Extractor

Query and Search

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages