For this challenge, we used historical data of some comments from Hacker News to create meaningful segments of users.
Based on historical data of some comments create meaningful segments of users. Can you propose any statistical measures of goodness of fit to describe the quality of the segmentation solution?
Below we present some inspirations for potential characteristics that may eventually differentiate segments:
- What is the sentiment of comments made by each group?
- What are the common words-association within each group based on comments?
- What are the keywords of titles of articles under which comments are made?
- What amount of comments is done by which segment?
- What are the characteristics of each group?
- What articles do they comment about and what do they write about?
- What is the high-level summary of groups?
Please keep in mind a segmentation solution should have balanced segments sizes and meaningful stories behind the groups of users.
More info: github.com/WhyR2020/hackathon
- R
- tidyverse
- tidytext: Text mining using tidy tools
- tm: Text Mining Package
- topicmodels: Topic Models
- textutils: Utilities for Handling Strings and Text
We submitted our solution as a presentation video where we talked about each step that we took to segment the users and some insights that we discovered when exploring the data.
Video: https://youtu.be/B-cEalcxfb4
PowerPoint Presentation: Presentation.pptx