Skip to content

This project is a Spark ML pipeline using Pyspark for NLP, using annotators: DocumentAssembler, Tokenizer, WordEmbeddingsModel, PerceptronModel & NerCrfModel. It prints a transformed DataFrame showing POS & NER columns, and analyzes any relationship between found entities & their POS attributes. Hands-on experience with Spark, Pyspark & Spark-NLP.

Notifications You must be signed in to change notification settings

GhaidaaShtayeh/spark-nlp-Pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Spark-nlp-Pyspark

This project is focused on building a Spark ML Pipeline using Pyspark to perform natural language processing on a dataset. The pipeline uses the following annotators:

Getting Started

To get started with the project, you will need to have Spark and Pyspark installed on your machine. Additionally, you will need to import the necessary libraries, including the pretrained models for English.

Prerequisites

Installing

To install Spark and Pyspark, please follow the instructions provided on the respective websites. To install the Spark NLP library, you can use the following command in your Pyspark project:

!pip install spark-nlp

Running the Application

The application is run by executing the script file containing the pipeline. The pipeline will read the input dataset, and it will print the transformed DataFrame showing only the POS column and the NER column. As a bonus, it will only show the result attribute of these annotations. The result attribute of NER and POS will be collected, and the relationship between found entities and their part of speech attributes will be analyzed and explained.

About

This project is a Spark ML pipeline using Pyspark for NLP, using annotators: DocumentAssembler, Tokenizer, WordEmbeddingsModel, PerceptronModel & NerCrfModel. It prints a transformed DataFrame showing POS & NER columns, and analyzes any relationship between found entities & their POS attributes. Hands-on experience with Spark, Pyspark & Spark-NLP.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published