Deep Learning Tools¶
Machine Learning¶
Difference/similarities between DL and ML. In ML model receives data and get patterns and make a representation that fits best the data. Then when you input the new data model figures out a class or label for each datapoint. So learning is the storing collections of patterns that are used to make assumption about new input. DL is a subtype of ML.
Deep¶
The deep is because you have deep structure of layers that put on top of each other and it is similar to the structure of the brain where numerous layers of neural networks perform steps to identify patterns and categorize stuff around us. Deep stack of abstract layers are interconnected in a special way based on a data feed into it.
Deep learning is different from the ML since you not always need to make feature extraction for a model (always necessary in ML): a representational learning can learn from raw input.
Another difference between DL and ML is that large amounts of data is the crucial parameter for DL, without massive amount of training data it is less accurate than less complex ML models.
DL application need to process thousand sometimes millions of objects to start react is a appropriate way when such an object appears in a system.
Simultaneous Multiple Computations Along Neural Networks¶
To recognize objects in the way human does was something that computers found difficult until Rosenblatts perceptron algorithm 1953 was developed. It was able to solve image recognition problem. Later Multilayer Perceptron (Feedforward Network) was derived from it - sequentially processed input through multiple layers of neurons each time making decisions returning meaningful output.
GPU¶
Processing power of high-performance graphics processing units is a driving parameter that makes DL executed fast compared to execution on CPU. GPUs are essential for deep learning (DL) because they can quickly perform parallel processing of matrices, vectors, and scalars.
Common Deep Learning Concepts¶
Types of Deep Learning¶
Type | Input Data | Principle | Year of Breakthrough |
---|---|---|---|
Reinforcement Learning | Various | Trains agents to make sequential decisions by interacting with an environment to maximize cumulative rewards. | 1950s |
Recurrent Neural Networks (RNNs) | Sequential data (e.g., text) | Use recurrent layers to maintain state across inputs and time steps. | 1986 |
Convolutional Neural Networks (CNNs) | Grid-like data (e.g., images) | Use convolutional layers to extract features, pooling layers to reduce feature map size, and fully connected layers for classification. | 1989 |
Autoencoders | Various | Learn data representations by compressing and decompressing input data. | 2006 |
Transfer Learning | Various | Uses pre-trained models to solve new tasks by freezing some layers and retraining others. | 2014 |
Maths. Linear Algebra.¶
Understanding linear algebra is crucial for DL.
Chollet's book quickly covers the most relevant datastructures and datatypes for tensorflow(btw tensors are n-d arrays), here shown as python (numpy) objects.
import numpy as np
# scalar
x = 1
# matrices
z = np.array([[1,2,3], [4,5,6], [7,8,9]])
# vectors
y = z[-1] # [1,2,3]
tensor = np.zeros((3, 2, 5)) # see below
"""
Level 1:
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
Level 2:
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
Level 3:
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
"""
- The tensor
tensor
is a 3D array with shape[3, 2, 5]
. - This means it has 3 levels, each containing a list of lists (2x5 matrix).
- The first level is list of lists of 2x5 matrices.
Jake VanderPlas covers the dimensionality of nested numpy arrays (shapes) and matrix algebra (I assembled it in another article here: <>).
Scalars/Matrix Addition & Multiplication¶
# scalars/matrix addition & multiplication
# Given:
matrix_a = np.array([[1, 2, 3], [4, 5, 6]])
matrix_b = np.array([[7, 8], [9, 10], [11, 12]])
# multiplication
matrix_c = np.dot(matrix_a, matrix_b) # [[58, 64], [139, 154]]
# wich in native python would be:
matrix_c2 = [[0, 0], [0, 0]]
# for each row in first matrix
# for each element k in row i of A and corresponding element k in column j of B
# (k ranges from 0 to len of first matrix):
# Multiply A[i][k] by B[k][j]
# Add the result to sum: set C[i][j] = sum
for i in matrix_a:
for j in matrix_b:
matrix_c2[i][j] = sum(
matrix_a[i][k] * matrix_b[k][j] for k in range(matrix_a.shape[1])
)
'''
Matrix A (2x3) and Matrix B (3x2)
Result matrix C will have shape (2x2) and vice versa B x A will have shape (3x3)
'''
from IPython.display import Image
Image(r'C:\thisAKcode.github.io\Pelican\content\images\matrix_mult.png', width = 600)
Transpose Matrices¶
# matrix m transpose
m = np.array([[1, 2, 3], [4, 5, 6]])
m_t = np.transpose(m) # [[1, 4], [2, 5], [3, 6]]
Neural Networks Concept Viewed as a Graph¶
- Neurons are like nodes in a graph.
- Weights are like edges connecting the nodes.
- Activation functions are like the nodes' output functions.
- Loss functions are like the cost functions.
- Optimizers are like the algorithms to minimize the cost functions.
Useful Analogy: Comparing TensorFlow Neural Networks to Execution Graphs¶
Aspect | Neural Network | Execution Graph |
---|---|---|
Structure | Layers (input, hidden, output) connected by weights. | Visual representation of computations. |
Nodes | Neurons that perform computations. | Operations (e.g., addition, multiplication). |
Edges | Weights that connect neurons. | Data paths that carry datastructures between operations. |
Data Flow | Data (tensors) flows from the input layer, through hidden layers, to the output layer. | Data flows from one operation to another, following the graph's structure. |
Training | Adjusts the weights to minimize error. | Updates the graph's parameters to optimize performance. |
Example | Input Layer -> Hidden Layer -> Output Layer |
Operation A -> Operation B -> Operation C |
Visual Representation | Input -> [Weights] -> Hidden -> [Weights] -> Output |
Node A -> [Data] -> Node B -> [Data] -> Node C |
Why Activation Functions applied to the tensors?¶
Activation functions are applied to tensors at each layer of a neural network. They bring non-linearity to the model. The most common activation functions used in DL:
- Sigmoid:
tf.keras.activations.sigmoid
(0-1) used in the output layer of a binary classification problem. - Tanh:
tf.keras.activations.tanh
(-1-1) - ReLU:
tf.keras.activations.relu
(0 to infinity) is used in hidden layers of a neural network. - Leaky ReLU:
tf.keras.activations.relu
(0 to infinity) - Softmax:
tf.keras.activations.softmax
(0-1) used in the output layer of a multi-class classification problem.
As well as implementation in python (numpy): ```python def sigmoid(x): return 1 / (1 + np.exp(-x))
def relu(x): return np.maximum(0, x)
def softmax(x): exp_x = np.exp(x); return exp_x / exp_x.sum()
### Loss Functions
Loss functions are used to measure the error/performance between the predicted value and actual value. The loss function is used to update the weights of the model to minimize the error. The loss function is used to update the weights of the model to minimize the error. Here are the most common loss functions used in DL:
1. Mean Squared Error: `tf.keras.losses.mean_squared_error`
2. Binary Crossentropy: `tf.keras.losses.binary_crossentropy`
3. Categorical Crossentropy: `tf.keras.losses.categorical_crossentropy`
4. Sparse Categorical Crossentropy: `tf.keras.losses.sparse_categorical_crossentropy`
5. Hinge: `tf.keras.losses.hinge`
Implementing MSE in python (numpy):
```python
def mean_squared_error(y_true, y_pred):
return np.mean(np.square(y_true - y_pred))
Optimizers¶
Optimizers are used to update the weights of the model to minimize the error (minimize the loss function). Here are the most common optimizers used in DL:
- SGD:
tf.keras.optimizers.SGD
- RMSprop:
tf.keras.optimizers.RMSprop
- Adagrad:
tf.keras.optimizers.Adagrad
- Adadelta:
tf.keras.optimizers.Adadelta
- Adam:
tf.keras.optimizers.Adam
def gradient_descent(weights, learning_rate, gradients):
return weights - learning_rate * gradients
def adam(weights, learning_rate, gradients, m, v, beta1=0.9, beta2=0.999, epsilon=1e-8):
m = beta1 * m + (1 - beta1) * gradients
v = beta2 * v + (1 - beta2) * gradients**2
m_hat = m / (1 - beta1)
v_hat = v / (1 - beta2)
return weights - learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)
Regularization¶
Regularization reduces model complexity and prevents overfitting by adding a penalty to the loss function.
Types of Regularization:
- L1 Regularization: Adds a penalty equal to the absolute value of coefficients.
tf.keras.regularizers.L1(l1=0.01)
- L2 Regularization: Adds a penalty equal to the square of coefficients.
tf.keras.regularizers.L2(l2=0.01)
- L1 and L2 Regularization: Combines both L1 and L2 penalties.
tf.keras.regularizers.L1L2(l1=0.01, l2=0.01)
Usage in Layers:
- Apply regularization to layers like Dense, Conv2D.
model = tf.keras.Sequential([ tf.keras.layers.Dense(128, activation='relu', kernel_regularizer=tf.keras.regularizers.L2(0.01)), tf.keras.layers.Dense(10, activation='softmax') ])
Impact on Training:
- Regularization terms are added to the loss function to control model complexity.
Hyperparameter Tuning:
- Adjust
l1
,l2
to balance underfitting and overfitting.
Dropout:
- Randomly ignore neurons during training.
tf.keras.layers.Dropout(rate=0.5)
Backpropagation¶
Backpropagation updates neural network weights by calculating the gradient of the loss function.
def backpropagation(weights, learning_rate, gradients):
return weights - learning_rate * gradients
Mathematics¶
Traversing Neural Network with Mathematical Operations¶
A neural network can be represented mathematically using matrices and vectors.
$$\ \mathbf{b}_2 = \begin{bmatrix} b_{21} \\ b_{22} \\ \vdots \\ b_{210} \end{bmatrix} \ $$Neural Network Tree Representation¶
forward propagation in a neural network can be represented as a tree structure with three layers: input, hidden, and output. Here is the graph over a simple neural network with one hidden layer mathematical operations for each node and edge are included.
plaintext
A1 A2 A3 Input layer
\\| |//
\|X|/
B1 B2 Hidden Layer
\ /
C1 Output Layer
Pseudocode¶
Define the input layer. Define the hidden layer with a specified number of neurons and an activation function. Define the output layer with the number of neurons corresponding to the number of classes and an activation function. Compile the model with a loss function and an optimizer. Train the model on the dataset.
Mathematical Representation using Matrices¶
Iput Layer:
- Neurons: Nodes that receive input data. $$ \mathbf{x} \in \mathbb{R}^{784} $$ Hidden Layer:
- Weights: $ \mathbf{W}_1 \in \mathbb{R}^{784 \times 128} $ $$ \mathbf{W}_1 = \begin{bmatrix} w_{11} & w_{12} & \cdots & w_{1,128} \\ w_{21} & w_{22} & \cdots & w_{2,128} \\ \vdots & \vdots & \ddots & \vdots \\ w_{784,1} & w_{784,2} & \cdots & w_{784,128} \end{bmatrix} $$
- Biases: $$ \mathbf{b}_1 = \begin{bmatrix} b_{11} \\ b_{12} \\ \vdots \\ b_{128} \end{bmatrix} $$
- Activation: $$ \mathbf{h} = \text{ReLU}(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) $$
Output Layer: $$ \mathbf{y} = \text{softmax}(\mathbf{W}_2 \mathbf{h} + \mathbf{b}_2) $$
Explanation Beyond the Maths¶
- Input Layer: Takes an input vector $ \ \mathbf{x} \ $ of size 784.
- Hidden Layer: Applies a linear transformation followed by a ReLU activation function.
- Output Layer: Applies another linear transformation followed by a softmax activation function to produce a probability distribution over the 10 classes.
Neural Network Layers¶
Layer | Pseudocode | Mathematical Representation | Explanation Beyond the Maths |
---|---|---|---|
Input Layer | x = np.random.randn(input_size) |
$$ \mathbf{x} \in \mathbb{R}^{784} $$ | Takes an input vector ( \mathbf{x} ) of size 784, representing a 28x28 pixel image. |
Hidden Layer | h = relu(np.dot(x, weight1) + bias1) |
$$ \mathbf{h} = \text{ReLU}(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) $$ | Applies a linear transformation followed by a ReLU activation function. |
Output Layer | y = softmax(np.dot(h, weight2) + bias2) |
$$ \mathbf{y} = \text{softmax}(\mathbf{W}_2 \mathbf{h} + \mathbf{b}_2) $$ | Applies another linear transformation followed by a softmax activation function. |
Mathematical Representation Details¶
Component | Pseudocode | Mathematical Representation | Explanation Beyond the Maths |
---|---|---|---|
Weights (Hidden Layer) | weight1 = np.random.randn(input_size, hidden_layer_size) |
$$ \mathbf{W}_1 \in \mathbb{R}^{784 \times 128} $$ | Matrix of weights connecting the input layer to the hidden layer. |
Biases (Hidden Layer) | bias1 = np.random.randn(hidden_layer_size) |
$$ \mathbf{b}_1 \in \mathbb{R}^{128} $$ | Bias vector added to the hidden layer. |
Weights (Output Layer) | weight2 = np.random.randn(hidden_layer_size, output_size) |
$$ \mathbf{W}_2 \in \mathbb{R}^{128 \times 10} $$ | Matrix of weights connecting the hidden layer to the output layer. |
Biases (Output Layer) | bias2 = np.random.randn(output_size) |
$$ \mathbf{b}_2 \in \mathbb{R}^{10} $$ | Bias vector added to the output layer. |
Example¶
For a digit recognition problem, the input layer receives a 784-dimensional vector representing a 28x28 pixel image. The hidden layer applies a linear transformation followed by a ReLU activation function. The output layer applies another linear transformation followed by a softmax activation function to produce a probability distribution over the 10 classes (digits 0-9).
Training a Neural Network¶
Training involves feeding input data through the network.
import numpy as np
# Define the input layer
input_size = 784
hidden_layer_size = 128
output_size = 10
# Initialize weights and biases
weight1 = np.random.randn(input_size, hidden_layer_size) # Weights for input to hidden layer
bias1 = np.random.randn(hidden_layer_size) # Biases for hidden layer
weight2 = np.random.randn(hidden_layer_size, output_size) # Weights for hidden to output layer
bias2 = np.random.randn(output_size) # Biases for output layer
# Forward pass
def relu(x):
return np.maximum(0, x)
def softmax(x):
exp_x = np.exp(x - np.max(x))
return exp_x / exp_x.sum(axis=0)
# Input vector
x = np.random.randn(input_size)
# Hidden layer computation
h = relu(np.dot(x, weight1) + bias1)
# Output layer computation
y = softmax(np.dot(h, weight2) + bias2)
Conclusion¶
Deep Learning is a subset of Machine Learning that uses neural networks to model complex patterns in data. It involves activation functions, loss functions, optimizers, regularization, backpropagation, and training. Mostly built using linear algebra translated to Python.