Data Understanding and Preparation

dataset for games: https://www.kaggle.com/datasets/nikatomashvili/steam-games-dataset

from the management of missing values, now we have only 6% of missing values in the column of Total_User_Reviews

after checking the urls, indeed the ones with "sub" tend to be from popular games, which include bundles that don't contain the amount of reviews for that game. On the other hand, the ones with "app" are games that actually have no reviews at all.

It doesn't seem that indie games and non indie games (based on the tag) differ a lot in regards to the distribution of both. Hence, we choose not to create a separate category or remove non-indie games

Price of games

Analyis and removal of outliers for price of games

Transformation (Game Features)

Analysis and transformation of Game Features into dummy variables

PCA Transformation (Popular Tags)

Analysis and transformation of popular tags and summarization through Principal Component Analysis

because there's too many features, we're going to use feature reudction PCA. source: https://mikulskibartosz.name/pca-how-to-choose-the-number-of-components

Finish cleaning up the dataset

remove columns that won't be used, add new columns, etc

Modelling and Evaluation

Just as shown in the previous analysis of F2P games, even if mostly linear, there is higher amount of reviews present in games with price 0 (F2P). It's with this in mind that the use of the additional variable F2P should mitigate this effect on the model.

Based on the Regression Results, all the variables in the dataset (assuming no multicolinearity and linearity of independant variables) the model is able to explain 27% of the variance related to the dependant variable (From Adjusted R-Squared).

Clustering for further analysis

Provide additional insights to the users related to the average percentage of positive reviews