TextRank

Based on Google's PageRank Algorithm
Graph-based Ranking Statistical Model
Each Node is a Sentence
Ranking Sentences with Underlying Assumption that Summary Sentences are Similar to most other Sentences
Higher Ranking Sentence = Sentence being more similar to other sentences in the text

Click here for more information on TextRank

pyTextRank

Python Implementation of TextRank
Slight Improvements to TextRank
- Lemmatization instead of Stemming
NLP Combinations
- All NLP
- Stop words Removal Only
- Lemmatization Only
- No NLP

Running the Model

There are two different pyTextRank Notebooks, the first notebook shows how a single article will be processed along with it's output, while the second notebook processes multiple data and then stores the generated summaries into a separate folder.

Additional Steps to Run Different NLP Combinations

The original Github code for the pyTextRank model have been modified by the authors of this repository to run different NLP combinations. Below is a list of detailed steps to run the different NLP combinations.

1. All NLP Combination

No additional steps required, base model contains all NLP techniques (Stop words Removal + Lemmatization).

2. Stop words Removal Only Combination

Comment out this chunk of code from pyTextRank.py

line 227  if pos_family in POS_LEMMA:
...
line 229     word = word._replace(root=tok_lemma)

3. Lemmatization Only Combination

Remove all words from stop word list

4. No NLP Combination

Step 1 - Remove all words from stop word list

Step 2 - Comment out this chunk of code from pyTextRank.py

line 227  if pos_family in POS_LEMMA:
...
line 229     word = word._replace(root=tok_lemma)

TextRank4ZH

Chinese Python Implementation of TextRank
NLP Combinations
- No NLP
- Stop words Removal Only

Running the Model

Step 1 - Run run_TextRank4ZH

Step 2 - Run split_results. This is done to split the generated summaries into 2 parts, as it will be too big for the ROUGE 2.0 package to display results onto excel. Take note that this step is optional, depending on the size of the dataset.

Step 3 - Run segment_text. This is done to segment Chinese text with a library before scoring it via ROUGE 2.0 package.

Other Resources & Dependencies

Completed by Melvin and Joe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TextRank

pyTextRank

Running the Model

TextRank4ZH

Running the Model

Other Resources & Dependencies

Clone this wiki locally