Week 4: Data Visualisation (Udemy Course Data)

Concept

I found a dataset on Udemy courses on the Kaggle database, which stored the data such as course rating, course duration, course price, and many more. However, I was focused on three of these variables. I quickly browsed through the data and decided to use it to create a scatterplot or three variables. One of them would represent the x-axis, while another one would populate the y-axis, and for the third variable, it would be plotted as the size of the point. This would essentially create a 3-dimensional scatter plot in a 2D graph.

The scatter plot can then be used to find correlation between the three variables and all the six combinations of two variables from three. Some of the observations that can be made from this data visualisation is discussed below.

Procedure

There are a series of steps I followed to reach my end result. Note that: Each blob is an object and has its own attributes.

  1. Data Collection/Retrieval: I got the data set titled “Udemy Courses – Top 5000 Course 2022” from Kaggle. However, I only used about 2000 rows of data as it was faster to load, the code can be scaled to utilise all 5000 rows of data, if necessary.
  2. Data Cleaning: In this phase, I looked at the columns of data that the plot would concern with, namely reviews_avg (course rating), course_duration, and main_price. All of theses columns were in a string format with embedded numbers in them. For instance, the course rating was stored as “Rating: 4.6 out of 5.0“. I split each data point, and extracted the relevant number.
    function getRating(ratings) {
      // stores ratings in floating numbers
      let rating = [];
      max_rating = 0;
      
      // cleaning the data to extract numbers from strings
      for (let i = 0; i < len; i++) {
        let rating_array = ratings[i].split(" ");
        rating[i] = parseFloat(rating_array[1]);
        
        // extracting the maximum rating
        if (max_rating < rating[i]) {
          max_rating = rating[i];
        }
      }
    
      return rating;
    }
    
    function getDuration(durations) {
      // stores duration in floating numbers
      let duration = [];
      max_duration = 0;
      
      // cleaning the data to extract numbers from strings
      for (let i = 0; i < len; i++) {
        let duration_array = durations[i].split(" ");
        duration[i] = parseFloat(duration_array[0]);
        
        // extracting the maximum duration
        if (max_duration < duration[i]) {
          max_duration = duration[i];
        }
      }
      
      return duration;
    }
    
    function getPrice(prices) {
      // stores duration in floating numbers
      let price = [];
      max_price = 0;
      
      // cleaning the data to extract numbers from strings
      for (let i = 0; i < len; i++) {
        let price_array = prices[i].split(" ");
        price[i] = price_array[2];
        
        // ignoring any values that is not present
        if (price[i] == undefined) {
          continue;
        }
        
        // the number in string had comma, e.g. 1,399.99. This portion removes the comma and converts the string into a floating number
        let temp = price[i].slice(2).split(",");
        if (temp.length == 2) {
          // the price was only in thousands, so there is only one comma in each price data
          price[i] = parseFloat(temp[0])*1000 + parseFloat(temp[1]);
        } else {
          price[i] = parseFloat(temp[0]);
        }
        
        // extracting the maximum price
        if (max_price < price[i]) {
          max_price = price[i];
        }
      }
    
      return price;
    }

    The above code shows how the data cleaning was done.

  3. Normalised Values: I normalised the value in each data variable to fit the canvas. This was achieved by comparing each data with its respective maximum data for each variable and multiplying by the some factor of width, height and size. The portion of the code used to normalise the values is as follows:
    // normalizing the data to fit into the canvas
        
    // Price is used on the x-axis, while rating is used on the y-axis
    let xPos = this.price / max_price * (width/1.1);
    let yPos = this.rating / max_rating * (height*2) - height*1.1;
        
    // the duration of the course determines the diameter of the circle
    let diameter = this.duration / max_duration * 200;
  4. Displaying: Notice from the above code snippet, that the prices is used as the x-coordinate, rating is used as the y-coordinate, while duration is used as the diameter for the blob/circle that would be plotted on the canvas. This decision for the axises was made with hit-and-trial as I tried various combination, and chose which appeared to be the most aesthetic.
  5. Interactivity: As part of the interactivity, I added in a hover function to the blob, such that the name, price, rating and duration of the course gets displayed when you hover over any of the blobs in the plot. I implemented this using show_description() function in the class as such:
    show_description() {
      // displays the information about the course when hovered on the blob
      if ((mouseX <= this.xPos + this.diameter/2 && mouseX >= this.xPos - this.diameter/2) && (mouseY < this.yPos + this.diameter/2 && mouseY > this.yPos - this.diameter/2)) {
        if (mouseX > width/2) {
          textAlign(RIGHT, CENTER);
          fill(123);
          rect(mouseX - 520, mouseY - 10, 520, 80);
        } else {
          textAlign(LEFT, CENTER);
          fill(123);
          rect(mouseX - 20, mouseY - 10, 520, 80);
        }
    
        fill("#eeeeee")
        text(this.name, mouseX - 10, mouseY);
        text("Price: "+this.price+ " USD", mouseX - 10, mouseY + 20);
        text("Rating: "+this.rating + "/5.0", mouseX - 10, mouseY + 40);
        text("Duration: "+this.duration + " hours", mouseX - 10, mouseY + 60);
      }
    }

Observations that can be made

There are a couple of interesting observations we can make from the data:

  1. There appears to be no apparent correlation between the price of the course and the course rating. It is interesting there is not much information to conclude that as the course rating increasing, the price increases with it, which is supposedly a common belief.
  2. Notice that the biggest blobs are mostly on the bottom right of the graph, suggesting that the course with have a longer duration also are more costly and also have a higher rating.
  3. However, most of the courses that are short have lower price and a rating that is distributed.

Future Improvements

It appears that the data set stores Udemy Courses have a high average rating which might make the observations on the data biased towards higher rated courses and might not provide definite conclusion about the entirety of Udemy course libraries. So, an improvement would be on the data collection where we can randomly sample a few thousand courses from the entire Udemy course library to maintain a proper distribution of the entire population/sample space.

Leave a Reply