Data Representation and Calculation

2022-07-14 . 4 min read

We all must know that working in Machine Learning means working with large sets of data. The sets of data that we use to train our model are the training examples. For supervised learning, we have a label for each training example which is the value we want our algorithm to predict. Let's keep our foot on the land of supervised learning for now. We have lots of training examples and let me denote the number of training examples by “m”. Now, each example can have lots of features, for example, if we are creating a ML model to predict the price of a house then each training example is a vectorof the features that represent a house, these features can be the number of rooms on the house, the area, number of stories and many more or less depending on our model’s performance and other factors. Let's say, we have $n$ features for each of the $m$ training examples. So each training example would be a $n$ -dimensional vector.

Vector

As Mr.Grant Sanderson said in the first video in his video series Essence of Linear Algebra

A vector is the fundamental, root-of-it-all building lock of linear algebra.

Further in the video, we get to know 3 different perspectives or ideas to understand vectors. What we need now, is one part of the Mathematical perspective that combines the idea of the Physicist perspective and the Computer Scientist perspective. So the one part we are taking out of the Mathematical one is the CS perspective which defines a vector as an ordered list of numbers. Below, we have 2 vectors, a row vector, and a column vector. Laying row vectors side by side or stacking column vectors gives us a matrix.

Making Sense of Data

The training examples can be represented as a matrix of dimension $m$ x $n$ , where each row corresponds to each training example and each column each feature of the training example.

Here, $**X$ ** is a matrix of dimension $**5$ x $3$ ** where each row is a training example and each column is a feature of the training example. So here we have $3$ features, $1^{st}$ column representing the number of floors in the house, $2^{nd}$ column representing the number of rooms in the house, and finally the $3^{rd }$ column representing the area of the house in some sq. unit. And y is the price of each house in our training example which is the value we want our model to predict for which we train our model to find and give us a vector of parameter theta(or weights) of dimension $n$ x $1$ such that the hypothesized or the predicted value $h$ is weighted sum of the inputs calculated as:

h = X * theta 
% h - m x 1 as X * theta - (m x n) * (n * 1) => m x 1

The vector $**h$ ** is the result of matrix multiplication between $X$ and $theta$ , so $**h$ ** would be a $**m-dimensional**$ vector as we have $**m$ ** training examples so we would predict $**m$ ** results or can be proved by the rule of matrix multiplication. Working with matrices allows us to vectorize our code which gives us high benefits in terms of our calculation time, code readability which in turn makes it easier to debug our code.

Vectorization - Brief Overview

Vectorization is a more efficient and alternative way of using loops. The way we calculated $**h$ ** above is also an example of vectorization where instead of looping over each training example, we are working on the whole matrix. Vectorization is made possible because the processors in our device allow us to, with SIMD. Vectorization requires its own blog post to understand it in depth but for now, just understand that working with matrix gives us this freedom of using vectorization. Many high-level languages like Matlab and Octave support this inside the box but many do not and require external libraries to use it for example numpy for python

Normalizing Feature - Brief Overview

If a ML practitioner is reading this article then the practitioner is clearly wondering using this raw feature is not going to give us good results or might take the model using Gradient Descent as the optimization algorithm to optimize or minimize the cost function too long to converge as different features take values that span over different ranges.

So to address this problem we scale and normalize features to ensure that the values of the features are in the same range. Understanding why and how we do this is a topic that requires its own blog post. So, for now, have faith in me. The normalized values can look like this:

We see that the values in each column spans over some range which we can approximate for now to be from (-1.3 to 1.3). So we can confirm every feature has an equal impact on the output that we predict.

Representing Neural Network Data

Let me give one more example of how working with matrices makes it really easy for us to represent, understand and comprehend data, and perform calculations on the data. Below we can see a 3 layered neural network.

Source: Here

If we were making a ML model using a neural network to predict which number is in a $28$ x $28$ pixel image taken from the MNIST dataset, the neural network model would have $784$ ( $28$ x $28$ ) units in the input layer each unit representing the brightness of the corresponding pixel in the image. A completely black pixel can be represented by a pixel with an intensity value of 0 while a perfectly white pixel with an intensity value of $255$ and a gray pixel can have an intensity value from 1 - 254. At bit level, each pixel has an $8$ -bit value with the black pixel being $00000000$ while the white pixel being $11111111$ and gray in between.

Now if we have some number of training examples, say $m$ = $1000$ , and since each training example of $28$ x $28$ pixel is unrolled into a vector, our input matrix, $X$ will be of size $1000$ x $784$ ( $1000$ x $785$ - including the bias unit) each row representing a training example - a vector of dimension 785.

Info

The value of the bias unit is always 1 but the value of its corresponding element in the parameter matrix $theta$ has a different value which the model calculates. It acts as the constant term c in the equation y = mx + c, to understand it better let's take an example of a case when it’s raining and you are stuck in your school and your brain is the model which predicts whether you’ll go home or not, the prediction can depend on many things like if you have an umbrella or not which acts as a feature and the weight or parameter matrix has values for each feature denoting how much that feature affects the output, now while the feature can have a huge impact on prediction, the bias term can represent how much you are biased at that time to go home, maybe you have some important thing to take care at home so you are highly biased at the time to go home or maybe you don’t have anything to do and are not so much biased.

Now each of these input units or neurons serves as an input to each unit of the hidden layer each with different weights or parameters. So, if we have 25 hidden layer units, we get a matrix of weights, let's call that $theta$ , of dimension ( $25$ x $785$ ).

Warning

I am ignoring the use of the activation function here, interested can google it or wait for my blog post on the topic

So, now the activations in the hidden units can be calculated in the same manner as above, we calculate the weighted sum:

h = X * theta' % X * theta_transpose
% h - m x 1 as X * theta' - (1000 x 785) * (785 * 25) => 1000 x 25

Here, we have $1000$ rows for each training example, each row is a vector of dimension $25$ - each being the activation in the $25$ hidden layers. Now we do the same for the hidden layer and output layer and get the activations in the output layer. The output layer has $10$ units for predicting numbers from ( $0$ - $9$ ). So the dimension of the output matrix will be of dimension ( $1000$ x $10$ ) each row denoting the vector of activations in the output layer for each of the $1000$ training examples. Each unit in the output layer is associated with a number between ( $0$ - $9$ ), so we predict the number to be the one associated with the unit having the highest activation. We’d have the same type of representation and calculation for recommender systems using collaborative filtering taking the calculation of any $2$ layers of a neural network. In fact, these representations and calculations are the way we represent data and do calculations for any ML system.

References and Resources

Essence of Linear Algebra - 3b1b
The Internet

Hey I assume you finished reading, I would love to know your feedback or if found any error or mistake in this blog post, please do not hesitate to reach out to me.