Tutorial on using Convolutional Neural Network to detect and classify images of two different persons.

11 min readJan 14, 2021

Prepared by Mohamed Fawas ,Harikrishnan R and Kevin Varghese.

An introduction to Convolution Neural Network (CNN)

In this article we will try to intuitively understand how CNN works and then build a CNN capable of classifying images.

How do we identify objects?

When we look at an image or an object we seem to usually identify these objects effortlessly, but is it really that simple?

Have a look at the image below, feeling dizzy or uneasy?

Oh, don’t worry we all felt that, but why?

The simple answer is, our brain is always analyzing the world around us, when we look at an object our brain is always trying to extract patterns, and unconsciously learn by labelling objects in our surroundings. In this image we were struggling to relate it to anything we’ve learned over our lifetime. The purpose of adding this image was to prove this exact point — Human needs feature to make sense out of a visual input.

And this is what we will try to achieve using the concept of CNN. Make the computer identify features/patterns and train it with numerous labelled data points, so that it could make a correct prediction.

What is CNN?

A convolutional neural network (CNN) is a specific type of artificial neural network that uses multilayered perceptrons for processing structured arrays of data such as images.

To implement CNN we will now look at the related steps:

Convolution operation
Rectified Linear unit (ReLU)
Max Pooling
Flattening
Full Connection

Let’s look at each of the steps:

Convolution and convolution operation

Convolution is a type of operation on two given functions by integration which expresses how the shape of one is modified by the other.

We have three elements that enter into the convolution operation:

Input- We will use Images as our input. Images are made up of pixels. Each pixel is represented by a number between 0 and 255.
Feature detector (kernel) — Feature detector can be thought of as a filter which is being used to measure the main features of an input. The aim of this step is to reduce the size of the image and make processing faster and easier.
Feature map- The feature maps of a CNN capture the result of applying the filters to an input image. In other words, the matrix representation of the input image is multiplied with the feature detector to produce a feature map.

Explanation of the convolution operation:

We begin by placing the feature detector (3X3 matrix) on the input image beginning from the top-left corner.
We compare the selected portion of the input image matrix with the feature detector and take the sum of the matching cell(s); this sum is then inserted to the top-left cell of the feature map.
You then move the feature detector by a stride of one pixel and repeat the process. We continue to insert the sum to the corresponding cell in the feature map.

Note: Although, the feature map that we end up with has fewer cells and less information than the original input image, the main features of the image that are important in image detection are retained.

We create many feature maps to obtain our first convolution layer.

Rectified Linear Unit (ReLU)

The rectified linear activation function or ReLU for short is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero.

Here the rectifier function is used to increase the non-linearity in our images. The reason we want to use ReLU is that images contain a lot of non-linear features such as the transition between pixels, the borders, the colors, etc.

Max Pooling

In practical aspects an image could have various spatial orientation or visual effects which could affect the output of the model. To overcome this we use the concept of pooling. We have different types of pooling available at our disposal but here we will use Max Pooling. Max pooling enables the CNN to detect features in various images irrespective of the difference in lighting in the pictures and different angles of the images. It works by placing a matrix of 2x2 on the feature map and picking the largest value in that box. The 2x2 matrix is moved from left to right through the entire feature map creating a pooled feature map. It preserves the main features while it also reduces the size of the image. This helps to reduce the overfitting.

Flattening

This step involves transforming the entire max pooled feature map matrix into a single column which is then used as the input for the neural network for processing.

Full Connection

The flattened feature map is fed to the neural network as the input layer. The fully connected layer is similar to the hidden between layers in the Artificial Neural Network (ANN) and the output layer is the predicted class.The information is passed through the network and the error of prediction is calculated. Error is then back propagated through hidden layers and it will improve the output value.

The final figures produced by the neural network don’t usually add up to one. To address this issue we use Softmax function to bring down the final figure sum to one, which represents the probability of each class. Let’s look at the mathematical representation for softmax:

Python program to implement the CNN

Here we are going to work on a dataset that contains images of two famous footballers Lionel Messi and Cristiano Ronaldo. In the training data of the images we have 100 images of each of them. In the test data of images we have 25 images of each of them.

Lionel Messi (on left) and Cristiano Ronaldo (on right)

Here we have created a custom dataset containing images of Cristiano Ronaldo and Lionel Messi and performed this project on this data. Click here to download the dataset.

First of all we have to install tensorflow and keras libraries in our anaconda navigator. Then we proceed to the jupyter notebook.

Then we import both tensorflow and keras library into our jupyter notebook.

import tensorflow as tf
from keras.preprocessing.image import ImageDataGenerator

Here the ImageDataGenerator class we imported from keras library is used to generate batches of tensor image data with real-time data augmentation. The data will be looped over (in batches).

Then just for a verification we can check which version of the tensorflow we are using by executing the following code:

tf.__version__

Here we have to preprocess both the training set and test set of images.

So let’s move to preprocessing our training set. So we have to apply some transformation on all the images of the training set. So we execute the following code:

train_datagen = ImageDataGenerator(rescale = 1./255,
 shear_range = 0.2,
 zoom_range = 0.2,
 horizontal_flip = True)

With this code we do some image augmentation which includes some geometrical transformation , some zooming and some rotations. This helps prevent overfitting and helps the model generalize better. Otherwise while training our Convolutional Neural Network we will get huge difference between the accuracy of the training set and test set.

Here the rescale parameter is about feature scaling. It will feature scale all of our pixels by dividing them with 255. Each pixel takes a value between 0 and 255. So dividing them by 255 gives a value between 0 and 1. Which is just like a normalization. Feature scaling is compulsory for neural networks when training. For shear_range , zoom_range and horizontal_flip we give default values.

Now we have to input our training data into train_datagen. This is done by executing the following code:

training_set = train_datagen.flow_from_directory(’machine learning workouts\messironaldo\training_set’,
 target_size = (64, 64),
 batch_size = 32,
 class_mode = ‘binary’)

Then we have to create a training_set variable and then we connect train_datagen with the help of flow_from_directory class. This takes the path to a directory & generates batches of augmented data. Here we have to give the path to the training data set as the first parameter. Target size is the size of your images when they are fed into the convolutional neural network. Here we use 64 x 64 pixels, because this makes it faster and still we have good results. Batch size is the no: of images you want to include in each batch , for which default value is 32.

The number of examples from the training dataset used in the estimate of the error gradient is called the batch size and is an important hyperparameter that influences the dynamics of the learning algorithm. Batch size controls the accuracy of the estimate of the error gradient when training neural networks. A batch size of 32 means that 32 samples from the training dataset will be used to estimate the error gradient before the model weights are updated.

Then we have to specify the class mode which is either binary or categorical. Since here we have to choose either Messi or Ronaldo .So here it’s binary .

Now we have to preprocess the test set of images. Here we won’t apply transformations , because we have to keep the test set images intact like the original ones. But here we rescale their pixels. This is done to avoid the information leakage from the test set. Information leakage refers to a mistake made by the creator of a machine learning model in which they accidentally share information between the test and training data sets.

This is done by executing the following code:

test_datagen = ImageDataGenerator(rescale = 1./255)

Now like we have done for training dataset we will connect our test data to the test_datagen object.

test_set = test_datagen.flow_from_directory(machine learning workouts\messironaldo\test_set’,
 target_size = (64, 64),
 batch_size = 32,
 class_mode = ‘binary’)

So we have completed the data preprocessing steps. Now we move to building the Convolutional Neural Network. First of all we have to initialize the CNN by executing the following code:

cnn = tf.keras.models.Sequential()

Here cnn variable represents the convolutional neural network. Its created as an instance of the sequential class which allows to create artificial neural network as a sequence of layers. First call tensorflow, Then keras library , then call models module and then call sequential class.

Now we have to add our first convolutional layer.

cnn.add(tf.keras.layers.Conv2D(filters=32, kernel_size=3, activation=’relu’, input_shape=[64, 64, 3]))

We use this using add method and also we use Conv2D class from the layers module. Here filters are the number of feature detectors which you want to apply to your images. 32 is a default value for this paramter.

kernel_size is the size of the feature detector we are using. As long as we haven’t reached the output layer we use ‘relu’ which represents the rectifier function as the activation function.

When you add your very first layer you have to specify the input shape of your input. In the beginning we have resized images as 64 x 64 pixels so here also we use the same as input. Since here we are working with coloured images which is in 3d corresponding to the rgb code of colours we give 3 as input.

Then we have to add our fist pooling layer.

cnn.add(tf.keras.layers.MaxPool2D(pool_size=2, strides=2))

Here we do maxpooling , For this we use MaxPool2D class from the layers module. Here pool_size represents the height of the box we consider while we move through our feature map. strides are the number of pixels it’s shifted to the right.

Now we create the second convolutional layer and pooling layer by repeating the above two codes.

cnn.add(tf.keras.layers.Conv2D(filters=32, kernel_size=3, activation=’relu’))cnn.add(tf.keras.layers.MaxPool2D(pool_size=2, strides=2))

But here we don’t use input_shape parameter because input shape is needed only when we add our very first layer.

Now we do flattening by executing the following code.

cnn.add(tf.keras.layers.Flatten())

Flattened layer is the result of all these convolutions and pooling which is a 1D vector which will become input of a future fully connected neural network. Here we don’t have to give any parameters.

Now we add a fully connected layer to that flattened layer. This is done by executing the following code.

cnn.add(tf.keras.layers.Dense(units=128, activation=’relu’))

The flattened layer will now become the input of fully connected neural network. Here we use Dense class from the layers module of keras. Here units parameter represents the number of hidden layers you want to include in this fully connected neural network. As long as we haven’t reached the output layer we use the ‘relu’ as the activation function.

Now we add the output layer.

cnn.add(tf.keras.layers.Dense(units=1, activation=’sigmoid’))

Here also we make use of Dense class from the layers module of keras library. Here we give 1 as the input for units parameter because here we need only one number of neuron which represents the binary output. Here we use sigmoid function as the activation function.

Now we train our CNN. For this initially we have to compile the CNN. This is done by executing the following code.

cnn.compile(optimizer = ‘adam’, loss = ‘binary_crossentropy’, metrics = [‘accuracy’])

Here we choose adam optimizer to perform stochastic gradient descent to update the weights in order to reduce the loss error between the predictions and targets .We choose binary_crossentropy because we are doing binary classification. We use accuracy metrics to measure the performance of the classification model.

Now we train the CNN on the training set and evaluate it on the test set by executing the following code.

cnn.fit(x = training_set, validation_data = test_set, epochs = 25)

Fit method will always train our CNN on the training set and we use our test_set as validation data. Here epoch value of 25 gives us a good accuracy.

Now we can make a single prediction using an image. For this we execute the following code.

import numpy as npfrom keras.preprocessing import imagetest_image = image.load_img(‘dataset/single_prediction/messi.jpeg’, target_size = (64, 64))test_image = image.img_to_array(test_image)test_image = np.expand_dims(test_image, axis = 0)

Here we use numpy module and image module from the keras.preprocessing. Then we load our image using load_img method. Here we give input to target_size parameter as 64 x 64 pixels because the image that will become the input of the predict method has to have the same size as the one that was used during the training.

Then we convert the PIL image to an array. Then we add an extra dimension which will correspond to the batch and which will contain that image into a batch. Axis is where we want to add that extra dimension. Dimension of the batch is always the first dimension. So we give 0 as the input for the parameter axis.

Now we have to create a result variable which will give the result of this CNN.

result = cnn.predict(test_image)training_set.class_indices

Now we have to do some encoding which helps us to give a correct result output.

if result[0][0] == 1:  prediction = ‘messi’else:  prediction = ‘ronaldo’

With this step we have completed the encoding step. Now we can print the predicted result.

print(prediction)

Here is the full python code we had used here:

Reference:

Tutorial on using Convolutional Neural Network to detect and classify images of two different persons.

Python program to implement the CNN

Written by Mohamed Fawas