The objective of this pipeline is to enable autotagging of stack overflow questions and answers. The model will be trained with TF-IDF with Spark MLlib in batch processing. Then the model would be used to autotag the new coming questions from Kafka (latency?). Finally the data will be persisted in Cassandra to support the front end.
Use TF-IDF to form a vector for each questions or answers:
- TF(term frequency) is the frequency of a word appears in a document
- IDF(inverted document frequency) is a measurement of whether a word is common or rare in the whole documents
- Extract text features using TF-IDF
- Train a Naive bayes classifier to do multiclass classfication