Machine Learning Concepts
Chapter 2
Last updated
Chapter 2
Last updated
TensorFlow.js enables traditional developers to build and run machine learning solutions using JavaScript in the web browser. Even if you have never created machine learning solutions before, this chapter introduces concepts that would enable web developers to use TensorFlow.js to build machine and deep learning solutions for the browser.
Machine learning and deep learning form the bulk of topics that comprise Artificial Intelligence. The illustration in Figure 2-1a is nothing new but is a good segue to start the conversation.
If you are confused about the various terms you come across when reading about artificial intelligence (AI), refer to Figure 2-1b. I displayed 'Robotics' and 'Expert Systems' in italics because they do not fall into our discussion.
Given a data-set required for training a machine learning model, the data scientist does not get to decide which algorithm you want to use to solve the problem at hand. This is analogous to a programmer picking a sorting algorithm to sort an array of numbers. She (the programmer) does not blindly choose the sorting mechanism, but instead looks at the size of the data first, and depending on the complexity and efficiency, picks an algorithm that orders the data in the array. Similarly, the data scientist picks out an algorithm based on its efficiency and accuracy (we will look at how to calculate the efficiency of a machine learning algorithm later in the chapter).
Deep learning is simply a subset of machine learning, and otherwise explicitly stated, the term Machine Learning encompasses Deep Learning as shown in Figure 2-1. For the same reason, this book is titled ‘Machine Learning with TensorFlow.js’ (and not ‘Deep Learning with TensorFlow.js’) even though the TensorFlow.js library uses deep learning to develop solutions.
The following sub-sections cover the building blocks of a machine/deep learning solution. These building blocks are captured in the end-to-end solution involving learning from data as shown in Figure 2-2.
In Figure 2-2, you start with a Problem depending on the data you have. Sometimes, it is the data available that dictates the problem, and in other cases, the problem at hand dictates the data to be collected. The data is collected from public or private sources, or both.
Once all the data is collected, it is explored, cleansed, and wrangled to produce a Dataset. While a discussion on Exploratory Data Analysis or EDA (Data Exploration, Cleansing, and Wrangling) is beyond the scope of this chapter or even book, both client (math.js) and server (Pandas, NumPy, SciPy, etc.) libraries exist for this task. Figure 2-3 shows the various sources the data is collected from, to train a machine learning algorithm.
Although not included in Figure 2-2, a Feature is an attribute or field of data that can be passed through a machine learning algorithm to generate an outcome. Features are not columns since not all columns need to be submitted to an ML algorithm e.g. Last Name, Gender, etc. and typically comprise of atomic values that can be classified as either continuous or categorical in nature.
Continuous values are attributes or features that are made of un-intermittent denominations or quantities, like Price, Weight, Age, etc.
Categorical values are composed of attributes or features that contain categories or classes that apply to the rows or records or observations, such as Account Type, Gender, City, etc.
The concept of continuous and categorical values is captured in Figure 2-4 below.
This can be further demonstrated using some example data shown in Figure 2-5.
In the data presented in Figure 2-5 above _id, latitude_reg, longitude_reg, elevation, and score comprise values that are Continuous, whereas type, iso_region, municipality, and scheduled_service are Categorical values or features.
To come up with an optimal solution, data scientists and/or machine learning engineers apply the dataset to different ML algorithms and compare the outcomes. The outcome with the algorithm which is the most accurate and efficient is chosen from a host of outcomes, and hence this is known as a machine learning Experiment. The final algorithm to be used in the ML model is determined by the accuracy and efficiency of that algorithm.
Note A machine learning experiment involves splitting the data into 2 sets, training and test. The test set is what determines the accuracy of an algorithm.
The Outcome of a machine learning experiment is the one with the highest accuracy and efficiency, selected from a variety of outputs or results of different algorithms chosen based on the training data and the value to be predicted from that data.
Note The data required to produce an outcome can come in the following different formats, of which only the first i.e. Tabular is example of Structured data, and the others are examples of Unstructured data:
Text (Sequential): Unstructured data for natural language processing (NLP) tasks. Sound (Sequential): Audio data for classification, speech synthesis, voice recognition, etc. Video (Sequential): Data to analyze video in real-time (streaming) or static formats. Image (Non-sequential): Visual data for object detection or computer vision tasks. Numeric (Sequential and Non-sequential): Data that includes decimal and non-decimal values. Timeseries (Sequential): Data that must appear in a specific sequence to make sense. Timeseries data is numeric.
Tip We will primarily focus on numeric values in our discussion. However, for a machine learning algorithm to act on the information provided to it, the data must first be converted to a numeric value (either continuous and categorical). The conversion from its original form is part of the Exploratory Data Analysis (EDA), which also involves data cleaning and munging to bring together data from multiple sources and producers, as shown in Figure 2-6a.
If you have dabbled in machine learning before, you may already have some idea of the types of algorithms that exist for the data problems encountered by an enterprise. If you know absolutely nothing about machine learning, Figures 2-7a and 2-7b are good places to start. You can learn more about these problem types and algorithms in the following sections.
Note The data to train a machine learning model can be categorized as follows:
Labeled: Refers to data that has been manually (using human operator) flagged or annotated, e.g. flagging each customer record of the bank to denote whether a loan would be paid off, etc. Unlabeled: Mainly used for analysis of data or information. Though unlabeled, machine learning algorithms can process and separate the data into multiple clusters or groups of information depending on data properties, e.g. Customer Segmentation etc. Partial: Data is only partially labeled, whereas the other data in the dataset is not labeled. The algorithms that train partially labeled data enables data that can be easily collected.
The algorithm to train depends on the type of problems presented to the organization. Those machine learning algorithms are covered in more detail in the next section, but the problems the algorithms are designed to address fall into the following broad classes:
Whenever a predicted outcome can be defined by a categorical variable, the type of problem can be described as a Classification problem. In other words, the data scientist or machine learning engineer first categorizes or classifies the training data manually as belonging to a particular category e.g. cat or not cat, cat or dog or neither, etc. Categorization or Classification are of 2 types:
Binary Classification: Pick one from two classes or categories, e.g. iPhone or Not iPhone, Aero-dynamic or Not Aero-dynamic, etc.
Multi-class Classification: Select one from many classes or categories, e.g. emotion classification, identification of animals in a Serengeti ecosystem, etc.
Regression problems in machine learning are defined by a numeric (continuous variable) predicted outcome. Examples include predicting the stock price based on some historic data, or a sports team final score based on playing history, etc.
A Clustering problem in machine learning pertains to an unsupervised learning scenario characterized by predicting the category or group an outcome will fall into, but the categories are not known in advance. This is in sharp contrast to a classification problem where the data scientist known the number and nature of categories before running the machine learning model. Clustering problems are typically used to identify anomalies in data, and include examples of customer or product segmentation, city planning, natural disaster studies etc.
Embedding problems and algorithms in machine learning are mainly used for creating numeric variables from existing text values. Since machine learning algorithms work better with numerical values than text, Embedding allows textual columns or features to be converted into continuous or categorical values. Embedding or Conversion in machine learning can also refer to dimensionality reduction, or reducing/eliminating features in a dataset that are not needed.
When machine learning is used to generate an output i.e. images, audio/video files, text etc., such problems and algorithms are said to be Generative. This is in contrast with Discriminant modeling (opposite of generative) where the desired outcome is to discriminate or classify, using a classification model. Examples of generative models include those where machine learning is used to generate outcomes that did not exist in the past before the algorithm was run e.g. generative models can be used to create new human faces based on facial data, or write new poetry based on the past verses.
Simulation or Reinforcement Learning is characterized by an agent that performs an action and get a reward as a result of that action. This proves to the agent whether the action is successful or not.
Note Although not mentioned in Figure 2-7b, there are two problem types i.e. ‘Simulation (Reinforcement Learning)’ and ‘Recognition (Detection)’ that use supervised/semi-supervised learning and sequential data. Recognition is used to create recurrent neural networks (RNN) for speech recognition, machine translation, classification, regression, and generative tasks. Both problem types are captured in Figures 2-8a and 2-8b.
The problem types discussed above are illustrated in Figure 2-8a below:
I took the liberty of combining the data types (mentioned earlier in Figure 2-6b) with each problem type in the above diagram.
Tip Regardless of the machine learning problem you are trying to solve, you have to train an algorithm with the available data to create a prediction model. Training a ML model involves splitting the data into two parts: training and test (the training set is larger than the test set); the algorithm or model is first trained with the training data, and tested for prediction accuracy with the test data. Refer to Figure 6-4 for a visual illustration.
Before we can embark on our TensorFlow.js journey, lets us first get a feel for the machine learning algorithms out there. This is, by no stretch of the imagination, an exhaustive list of the algorithms. The intention here is to help you understand the categories of machine learning problems you will be faced with, and the types of algorithmic solutions you will be required to come up with.
Note Most of the existing text pertaining to machine learning classify the algorithms as either supervised, un-supervised, or semi-supervised. While it is absolutely okay to categorize ML algorithms according to their category, machine learning problems are rarely simple enough to select algorithms based on the available dataset, and require significant efforts to wrangle the data to fit the machine learning algorithm to the dataset.
Machine Learning algorithm types in Figure 2-7b and the problem types from Figure 2-8a can be combined to form an illustration in Figure 2-8b.
The above diagram can also be accessed on the books GitHub page, and is referenced again in Chapter 6. The above, however, is not a list of all machine learning algorithms. Also note that the algorithm names with the asterisk (*) are the ones modeled using deep learning.
Note To view some problems or use-cases pertaining to machine learning, refer to Figure 6-4a, a diagram similar to the ones above in Figures 2-8a and 2-8b.
Deep learning is a subset of machine learning (see Figure 2-1) in that it uses artificial neural networks (ANN) to make sense of the data provided to it. Deep learning is composed of the following components:
Neural networks comprise neurons that transform data to solve a problem. Artificial neural networks or ANN borrow the terminology from the working of a person’s brain, and that is extent of its link with the genetic wiring inside the human body. Before continuing the discussion on neural networks, take a look at illustrations below:
The illustrations in the above three figures are all neural networks, inspired from the human brain.
Each node in the network is called a Neuron, characterized by a square, but is also denoted by a circle in various neural network diagrams online and offline.
Each neural network has only one input layer and only one output layer.
The total number of layers does not include the input layer; therefore, the input layer is known as Layer 0 in the above diagrams.
Each input in the input layer is called a Feature, whereas each output in the output layer represents a single yield or output of the problem, e.g. the four neurons in the input layer correspond to four features, and the two outputs refer to two yields produced by the model (Figure 2-9c).
The hidden layer in a neural network can have zero (Figure 2-9a), one (Figure 2-9b), or N number (Figure 2-9c) of hidden layers.
If a neural network has more than one hidden layers, the model is said to be a Deep Neural Network using Deep Learning to produce the output (Figure 2-9c).
If each neuron in a layer of the neural network is connected by every neuron in the previous layer, the current layer is said to be a Dense layer.
Every neuron connected to the next higher layer has a particular non-zero weight w associated with a connection.
The total value of the neuron is equal to the value x of a neuron times the weight w of its link to the next higher layer.
Not all neurons in a layer are active.
Considering there are three neurons in a layer in Figure 2-10 i.e. x1, x2, and x3, the output is the sum (also known as the weighted sum) of the value of each neuron in that layer multiplied by all its weight i.e. w1, w2, and w3 (also refer to Listing 2-1).
Each layer in a neural network model includes a constant value known as a Bias, and acts as an intercept in a linear equation model.
Each Bias value is associated with exactly one layer in a neural network model.
After adding the Bias value, Figure 2-10 can be redrawn as shown below in Figure 2-11.
Not all layers in a neural network model are counted.
The Activation Function determines whether neurons in a layer are fired or not.
Activation Functions can be characterized into the following and illustrated in Figure 2-12a:
Binary Step Function
Linear Activation Function
Non-linear Activation Function
Sigmoid TanH (Hyberbolic Tangent)
ReLU (Rectified Linear Unit)
Leaky ReLU
Parametric ReLU
Softmax
Swish
Activation Functions exist in every layer of a neural network, but it is the activation function that produces the output of the neural network (last-layer activation) that determines the prediction result produced by the model.
After adding the Activation Function, Figure 2-11 can be redrawn as illustrated in Figure 2-13.
Note In case you are wondering why Figure 2-13 follows Figure 2-12a, be advised that the letters in diagram numbers were used to group similar illustrations together, and Figures 2-12b, 2-12c, and 2-12d are included in this chapter.
Learning Rate is a scalar value that tells a model how quickly or slowly it learns.
A small learning rate means that the model takes a long time to learn, and a large learning rate implies that the model learns very quickly.
The trick is to pick a learning rate that is neither too high nor too low.
A learning rate is specified for a model at the time of creation when an optimizer (explained in the next section) is defined. Please note that the learning rate is optional for some optimizers and mandatory for others.
Deep learning is an iterative process and each training iteration is known as an epoch.
In other words, an Epoch is one iteration where the model learns from the entire training dataset.
An Optimizer is an algorithm that is used to update the weights on a neural network model.
The optimizer is based on the loss function (explained in the next section).
The various optimizers are listed in Figure 2-12b below:
Before going into the details of the loss function, refer to Figure 2-14.
The Objective Function or criterion either has to be minimized or maximized, depending on whether that objective function value is positive or negative (see Figure 2-14).
When it needs to be minimized, the objective function is also called a loss function or a cost function, and the value of the loss function is known as a loss.
Conversely, when it is negative and needs to be maximized, the objective function or criterion is called a reward function.
Figure 2-12c lists the loss functions used by a neural network model.
Metrics specify how a neural network model will be evaluated.
The choice of a metric depends on the type of machine learning problem being addressed.
A list of metrics is shown in Figure 2-12d.
Note Do not be troubled if you do not understand any of the components of deep learning, or have trouble understanding why a certain building block is needed. All concepts pertaining to TensorFlow.js and machine learning in the web browser will be explained in due time.
The components of deep learning and neural networks explained in the previous section culminate into the following equation for processing each layer in a neural network model.
Y = σ(∑(xi.wi)) + b
where x = Neuron Value w = Weight b = Bias σ = Activation Function
Listings 2-1a and 2-1b present the above equation in pseudo code.
Lastly, the pseudo-code in Listing 2-1c makes use of the code in the above two listings to create a neural network model.
Tip The theoretical aspects using TensorFlow.js are discussed in Chapter 4 in the section titled ‘TensorFlow.js Syntax’. Also, refer to the JavaScript code snippets in Listings 4-4a, 4-4b, and 4-5 to learn the syntax used by the TensorFlow.js library.
Once a neural network has been created, we can compile, train, and predict the result as shown in the pseudo-code in Listing 2-2 below.
A developer may choose to evaluate the model using test data. The pseudo-code in Listing 2-3 shows how a machine learning model is evaluated. While not mandatory for predicting an outcome, it is highly recommended. Also, the compile step is necessary for evaluation since it validates the model before evaluation and/or prediction.
The pseudo-code in each on the above listings are illustrated visually in flowcharts below.
Note Just to reiterate, many of the above topics are re-discussed later in the book. It is sufficient to understand these concepts incompletely at this stage, as long as you understand the components that make up a deep learning solution.
Note See a larger version of the above diagram below (divided into 3 parts).
Note The flowcharts are meant for understanding the sequence of steps in the algorithm, so there is not necessarily a one-to-one relationship between the pseudo-code and the above flowchart diagrams. I followed a programming-like naming convention for the function calls in the flowcharts to make them easier to understand. You are perfectly fine using plain-words to define the function calls, as long as you follow the correct sequence of statements in your flowchart diagrams.
For any neural networks, say for example the ones shown in Figures 2-9b and 2-9c, any calculations are done from left to right, starting from the inputs, to the hidden layers according to their sequence, and to the outputs. This is known as forward-propagation as shown in Figure 2-18a below.
However, training a neural network involves multiple iterations and includes updating the weights in the hidden layer(s) once it is realized that the given weights produce an error (do not produce the desired result or outcome). To update the weights, a reverse approach, called backpropagation, is taken; where the starting point is the output layer and goes up to the input layer, as shown in Figure 2-18b.
Note The above over-simplifies forward- and backward-propagation. The intent here is to just introduce you to the terms. If you want to learn more in-depth, I would recommend online resources.
The whole point of training a neural network with available data is to make predictions using completely new data (preferably from the same source), and the accuracy of the predictions determines how effective a machine learning model is. Sometimes, though, a model is too accurate in predicting the outcomes of the training data (and not test and new data) because the data scientist or machine learning engineer has made the model fit the available data too well.
The concept of overfitting is shown in Figure 2-19a, where the training data (depicted using solid-black circles) exactly coincides with the created model (shown using a solid-black line). However, the test data, and any new data that the created model has not seen before, fall completely outside the model.
The opposite of overfitting in machine learning is underfitting, shown in Figure 2-19b, where the model does not generalize to the training data used to create it. This also ensures that the model does not fit any new data, leading to model inaccuracy and bad performance metrics.
To strike a balance between overfitting and underfitting, or simply to reduce overfitting in the model, the following approaches may be used.
Increase the amount of training data without decreasing the volume of test data by gathering more data values preferably from the same source(s).
Reduce the complexity of the network by decreasing the number of unit or neurons, condensing the weight values, and/or reducing the number of layers in the model.
In order to prevent overfitting in a machine learning model, dropout is used. Dropout eliminates a model complexity by switching off certain units or neurons during training, and can apply to either the visible (input) layer or any of the hidden layers, but not the output layer. The ‘switching off’ of neurons is also called ‘ignoring’, and the neurons to be ignored are selected at random. The neural network model from Figure 2-9b is shown in Figure 2-20 with dropout, where the third neuron in the input layer and the second and fifth neurons in the hidden layer are ignored to limit complexity during one epoch in training, meaning the same dropout will not be used in other iterations or epochs.
Before I end this chapter, I would like to summarize the components that make up a neural network model in the following table:
Component
Mandatory or Optional
Defined For
Input
Mandatory
Neural Network
Output
Mandatory
Neural Network
Layer
Mandatory
Neural Network
Neuron
Mandatory
Layer
Weight
Mandatory
Neuron
Bias
Mandatory
Layer
Activation Function
Mandatory
Layer
Epoch
Mandatory
Neural Network
Optimizer
Mandatory
Neural Network
Learning Rate
Optional
Neural Network
Loss
Optional
Neural Network
Metrics
Optional
Neural Network
The chapter looks into the underlying topics of machine learning, deep learning, and neural networks, and covers the following:
Building blocks for deep learning solutions
Problem types and algorithms for machine learning
Components of a neural network model