Frederick Lam A Data Enthusiast's Blog:
    About     Archive     Feed

Metis Project 4 - Topic Modeling the COVID-19 feels

Topic: Unsupervised and Natural Language Processing (NLP)

About

This is the 4th project in the Metis Data Science bootcamp. The main focus of this project is to utilize unsupervise learning with text data, i.e. NLP. For this project, I’ll be focusing on topic modeling, which is an unsurpervised technique that returns clustered word groups/topics that define a set of documents.

The topic I chose is a topic model on First Generation Low Income (FGLI) students’ blog posts related to COVID-19. I had wanted to do something on COVID for while but thought that the previous projects would be too sensitive to the topic. But text data is a different story because it’s not as abstract, so I felt more confident making this the project to do it.

Data

The data I used was scraped from an organization called RiseFirst. RiseFirst is a group whose mission is to empower FGLI students through online platforms. During this season of COVID, they’ve publicly shared a blog that lists students’ posts about their experience and feelings during COVID-19.

I used BeautifulSoup and Selenium to scrape each student’s post into a pandas dataframe. Each post ranged from about fifty to a couple hundred words.

Cleaning the data involved much text preprocessing such as generalizing words, removing punctuation, adding stop words, and creating multi-word expressions. There were also a few documents that contained foreign languages, which I was able to translate using the textblob python package. However, I ended up removing most of them because they became outliers due to the large document size and frequency of specific words per each document.

Methodology

Once the cleaning was finished, I vectorized the data into sklearn’s CountVectorizer which transforms the data into a document x term matrix.

The document x term matrix essentially reports the frequency of detected words within each document. I did a quick matplotlib bar chart to display the frequency of the top 10 words in the data to get an idea of the topics being mentioned.

Next I applied sklearn’s Non-Negative Matrix Factorization (NMF) to the matrix to create a topic x term matrix. NMF is an algorithm that applies dimensionality reduction, which transforms the data from a high-dimension to a low-dimension to improve the interpretablity of the data. Compared to other dimension reduction algorithms, NMF is a bit more specific in the sense that it only takes in positive values. Since I’m dealing with text data, it fits the criteria very well.

Because the dimensions have now been reduced, I can return the topics from the topic x term matrix and get an idea of what the labels of each topic can be. After going back and doing additional cleaning, I concluded with 4 topics and labeled them Personal Health, Money, Anxiety, and Fear.

Here are a few plots I made to help visualize the weights of the top words within each topic which lead me to deciding the labels:

Personal Health

Money

Anxiety

Fear

Topic Modeling isn’t complete until this topic x term matrix is transformed into a document x term matrix, which was simply done by transforming it into an H-matrix. The document x term matrix reports the weights of each topic within each document, which can be interpreted to determine which documents belong to each topic.

The higher the number between each topic means that there is a high probability the corresponding document’s text belongs in that topic. Taking an example, one of the documents “Anyone who knows me knows…” has a weight of 1.91 in the topic of Personal Health. Compared to the other topic weights, it’s significantly higher. Looking back at the blog, the post belongs to Andrea Convey from Wellesley College. As your briefly read her post, you can tell that it’s a very emotional post and she mentions her mental health quite a bit. This corresponds well with the Personal Health topic, which puts more confidence in my topic model.

Conclusion

To conclude, I’d say the model does a good job of displaying the topics within this blog. Obviously by reading through the posts, you can tell that the students are very concerned about the well-being of themselves and their friends and family. But if you wanted a brief overview of them all, the topic model provides that and can help categorize the general feel of the documents.

In addition, because this data can be quite relatable to most of us during this season, the topics re-emphasize how COVID has put us in tricky situations but also share a portion of what people have been feeling this year.

You can access the project Github here.

Metis Project 3 - Team Building a TFT win/lose

Topic: Classification

About

Just like project 2, this is a solo project. As a reminder, projects 2 to 5 in the Metis Data Science bootcamp are all solo. The goal of this project is create a machine learning model using supervised data to predict a classification target.

One of the requirements for this project is to create an interactive app or data visualization. There will be a video that showcases a flask app that I created for this project.

The topic I chose for this project is predicting a win or loss using matchmaking data from the popular video game, Teamfight Tactics <- (Don’t know what that is? Click on link!). I’m basing my prediction off the odds of a win or loss, which I include as an output in my flask app.

Data

The data I used was downloaded from this kaggle post, which was extracted from Riot Game’s, the game developer, API. For the JSON objects, I chose to create multiple features off each object which lead me to these initial features in my datasets:

  • gameID (API key per user)
  • gameDuration (total match duration)
  • level (player match level)
  • lastRound (last round before player leaves the match/gets eliminated from the match)
  • ingameDuration (match duration per player)
  • dummified list of features for each class and origin
  • item count for each champion
  • star level for each champion

To expand on what the last three features are, each champion has a unique class and origin. These attributes make up for team combinations/synergies between other champions and are what sets the strength of your team. Each champion placed on the board can equip up to 3 items that are dropped throughout that match and can be leveled up to 3 stars, which affects the growth of their base stats.

The premise of my predictions are based off part 1 of set 3, or season 3, for the game but there was some data from the previous set listed in the datasets that I had to remove.

To conclude my data cleaning, I removed the gameID feature because since it represents the API key for each player, I would have a set of matches that belonged to one player but would cause duplicate IDs to exist in the dataset.

Once each dataset was cleaned, I uploaded them all onto an AWS EC2 using a Postgresql database structure. The other requirement for this project is to incorporate SQL, which I used using SQLAlchemy, a toolkit that allows Python pandas to read SQL queries, to import each dataset from AWS. From there I concatenated all the datasets into one large dataset.

Through testing, I also removed everything else but the gameDuration, player level, item counts, and star levels because they all showed strong multicollinearity. My finalized dataset consisted of 800,000 observations and 106 features.

Methodology

Before testing, my classes were about a 1:2 ratio so I used a SMOTE technique to balance my classes to 1:1. Then I did a 80/20 split for my train data and an additional split for 20% test and 20% validation data, both on stratified splits.

The algorithms I tested my model on include KNearestNeighbors, Logistic Regression, Decision Tree Classifier, Random Forest Classifier, Gaussian Naive Bayes, and LinearSVC, all imported from scikit-learn. Because my main goal is to predict the odds of a class, I chose the ROC_AUC score as my main measurement metric. Among all my models, Random Forest performed the best. As you can see, my scores only have a 0.02 difference between the train, test, and validation data and are quite high, much more than I expected. My scores represent the ranking of my predictions, so the closer it is to 1, the better. The ROC curve on the right represents the performance of my model at all classification thresholds and appears to do quite well in combination with my AUC results.

After finalizing my choice of model, I plotted the precision and recall curves to determine what my threshold would be for my finalized predictions. To support my threshold calculation, I plotted a threshold curve using an f1 curve as a baseline. f1, in layman’s terms, represents the balance between precision and recall. I calculated the f1 curve by taking one divided by the sum of the inverse of the recall and precision curves. Then to find my threshold, I returned the index of the maximum value of the f1 curve and used that the find the value of the threshold curve based off the index position, which gave me a threshold of about 0.41. If you put a line straight down the intersection of the recall and precision curves in the plot below, it matches up with the peak of the threshold curve which is around our value of 0.41.

So any prediction with an odd greater than 0.41 is a win and any odd less will be a loss.

Flask App

FYI, I’ve published subtitles for this demo so click the icon in the YouTube player to enable them.

Conclusion

My model works surprisingly well but it’s obviously not perfect. I think the reason why is because the random forest classifier works well with such a large number of features. But I also think it also came as a double-edged sword because it’s difficult to interpret the features.

However, the goal of this project was to predict a class based off the calculated odds, which is what I ended up doing. There is one problem though and it’s that the data is based off high-elo/ranking matchmaking. So unfortunately, no one’s going to perform horribly and that messes up the model a bit when you try to run it with a majority of zeros. It would help if the data included more variety among the entire playerbase so the prediction could reflect the real-world better. I’m hoping that if this is a continued project for the future, I could get approved to access the game’s API to update the data and improve the model off it.

You can access the project Github here.