Data Representation and Calculation

2022-07-14 . 4 min read

We all must know that working in Machine Learning means working with large sets of data. The sets of data that we use to train our model are the training examples. For supervised learning, we have a label for each training example which is the value we want our algorithm to predict. Let's keep our foot on the land of supervised learning for now. We have lots of training examples and let me denote the number of training examples by “m”. Now, each example can have lots of features, for example, if we are creating a ML model to predict the price of a house then each training example is a vectorof the features that represent a house, these features can be the number of rooms on the house, the area, number of stories and many more or less depending on our model’s performance and other factors. Let's say, we have nn features for each of the mm training examples. So each training example would be a nn-dimensional vector.

Vector


As Mr.Grant Sanderson said in the first video in his video series Essence of Linear Algebra

A vector is the fundamental, root-of-it-all building lock of linear algebra.

Further in the video, we get to know 3 different perspectives or ideas to understand vectors. What we need now, is one part of the Mathematical perspective that combines the idea of the Physicist perspective and the Computer Scientist perspective. So the one part we are taking out of the Mathematical one is the CS perspective which defines a vector as an ordered list of numbers. Below, we have 2 vectors, a row vector, and a column vector. Laying row vectors side by side or stacking column vectors gives us a matrix.

Column Vector
Column Vector
Row Vector
Row Vector

Making Sense of Data


The training examples can be represented as a matrix of dimension mm x nn, where each row corresponds to each training example and each column each feature of the training example.

Matrix of Training Examples
Matrix of Training Examples
Label Vector
Label Vector

Here, X**X** is a matrix of dimension 5**5 x 33** where each row is a training example and each column is a feature of the training example. So here we have 33 features, 1st1^{st} column representing the number of floors in the house, 2nd2^{nd} column representing the number of rooms in the house, and finally the 3rd3^{rd } column representing the area of the house in some sq. unit. And y is the price of each house in our training example which is the value we want our model to predict for which we train our model to find and give us a vector of parameter theta(or weights) of dimension nn x 11 such that the hypothesized or the predicted value hh is weighted sum of the inputs calculated as:

h = X * theta 
% h - m x 1 as X * theta - (m x n) * (n * 1) => m x 1

The vector h**h** is the result of matrix multiplication between XX and thetatheta, so h**h** would be a mdimensional**m-dimensional** vector as we have m**m** training examples so we would predict m**m** results or can be proved by the rule of matrix multiplication. Working with matrices allows us to vectorize our code which gives us high benefits in terms of our calculation time, code readability which in turn makes it easier to debug our code.

Vectorization - Brief Overview


Vectorization is a more efficient and alternative way of using loops. The way we calculated h**h** above is also an example of vectorization where instead of looping over each training example, we are working on the whole matrix. Vectorization is made possible because the processors in our device allow us to, with SIMD. Vectorization requires its own blog post to understand it in depth but for now, just understand that working with matrix gives us this freedom of using vectorization. Many high-level languages like Matlab and Octave support this inside the box but many do not and require external libraries to use it for example numpy for python

Normalizing Feature - Brief Overview


If a ML practitioner is reading this article then the practitioner is clearly wondering using this raw feature is not going to give us good results or might take the model using Gradient Descent as the optimization algorithm to optimize or minimize the cost function too long to converge as different features take values that span over different ranges.

So to address this problem we scale and normalize features to ensure that the values of the features are in the same range. Understanding why and how we do this is a topic that requires its own blog post. So, for now, have faith in me. The normalized values can look like this:

We see that the values in each column spans over some range which we can approximate for now to be from (-1.3 to 1.3). So we can confirm every feature has an equal impact on the output that we predict.

Representing Neural Network Data


Let me give one more example of how working with matrices makes it really easy for us to represent, understand and comprehend data, and perform calculations on the data. Below we can see a 3 layered neural network.

Source: Here

If we were making a ML model using a neural network to predict which number is in a 2828 x 2828 pixel image taken from the MNIST dataset, the neural network model would have 784784 (2828 x 2828) units in the input layer each unit representing the brightness of the corresponding pixel in the image. A completely black pixel can be represented by a pixel with an intensity value of 0 while a perfectly white pixel with an intensity value of 255255 and a gray pixel can have an intensity value from 1 - 254. At bit level, each pixel has an 88-bit value with the black pixel being 0000000000000000 while the white pixel being 1111111111111111 and gray in between.

Now if we have some number of training examples, say mm = 10001000, and since each training example of 2828 x 2828 pixel is unrolled into a vector, our input matrix, XX will be of size 10001000 x 784784 (10001000 x 785785 - including the bias unit) each row representing a training example - a vector of dimension 785.

Info

The value of the bias unit is always 1 but the value of its corresponding element in the parameter matrix thetatheta has a different value which the model calculates. It acts as the constant term c in the equation y = mx + c, to understand it better let's take an example of a case when it’s raining and you are stuck in your school and your brain is the model which predicts whether you’ll go home or not, the prediction can depend on many things like if you have an umbrella or not which acts as a feature and the weight or parameter matrix has values for each feature denoting how much that feature affects the output, now while the feature can have a huge impact on prediction, the bias term can represent how much you are biased at that time to go home, maybe you have some important thing to take care at home so you are highly biased at the time to go home or maybe you don’t have anything to do and are not so much biased.

Now each of these input units or neurons serves as an input to each unit of the hidden layer each with different weights or parameters. So, if we have 25 hidden layer units, we get a matrix of weights, let's call that thetatheta , of dimension (2525 x 785785).

Warning

I am ignoring the use of the activation function here, interested can google it or wait for my blog post on the topic

So, now the activations in the hidden units can be calculated in the same manner as above, we calculate the weighted sum:

h = X * theta' % X * theta_transpose
% h - m x 1 as X * theta' - (1000 x 785) * (785 * 25) => 1000 x 25

Here, we have 10001000 rows for each training example, each row is a vector of dimension 2525 - each being the activation in the 2525 hidden layers. Now we do the same for the hidden layer and output layer and get the activations in the output layer. The output layer has 1010 units for predicting numbers from (00-99). So the dimension of the output matrix will be of dimension (10001000 x 1010) each row denoting the vector of activations in the output layer for each of the 10001000 training examples. Each unit in the output layer is associated with a number between (00-99), so we predict the number to be the one associated with the unit having the highest activation. We’d have the same type of representation and calculation for recommender systems using collaborative filtering taking the calculation of any 22 layers of a neural network. In fact, these representations and calculations are the way we represent data and do calculations for any ML system.


References and Resources



Hey I assume you finished reading, I would love to know your feedback or if found any error or mistake in this blog post, please do not hesitate to reach out to me.