
How to do it...
Now that we understand how learning rate influences the output values, let's see the impact of the learning rate in action on the MNIST dataset we saw earlier, where we keep the same model architecture but will only be changing the learning rate parameter.
Note that we will be using the same data-preprocessing steps as those of step 1 and step 2 in the Scaling input dataset recipe.
Once we have the dataset preprocessed, we vary the learning rate of the model by specifying the optimizer in the next step:
- We change the learning rate as follows:
from keras import optimizers
adam=optimizers.Adam(lr=0.01)
With the preceding code, we have initialized the Adam optimizer with a specified learning rate of 0.01.
- We build, compile, and fit the model as follows:
model = Sequential()
model.add(Dense(1000, input_dim=784, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=500, batch_size=1024, verbose=1)
The accuracy of the preceding network is ~90% at the end of 500 epochs. Let's have a look at how loss function and accuracy vary over a different number of epochs (the code to generate the plots in the following diagram remains the same as the code we used in step 8 of the Training a vanilla neural network recipe):

Note that when the learning rate was high (0.01 in the current scenario) compared to 0.0001 (in the scenario considered in the Scaling input dataset recipe), the loss decreased less smoothly when compared to the low-learning-rate model.
The low-learning-rate model updates the weights slowly, thereby resulting in a smoothly reducing loss function, as well as a high accuracy, which was achieved slowly over a higher number of epochs.
Alternatively, the step changes in loss values when the learning rate is higher are due to the loss values getting stuck in a local minima until the weight values change to optimal values. A lower learning rate gives a better possibility of arriving at the optimal weight values faster, as the weights are changed slowly, but steadily, in the right direction.
In a similar manner, let's explore the network accuracy when the learning rate is as high as 0.1:
from keras import optimizers
adam=optimizers.Adam(lr=0.1)
model = Sequential()
model.add(Dense(1000, input_dim=784, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=500, batch_size=1024, verbose=1)
It is to be noted that the loss values could not decrease much further, as the learning rate was high; that is, potentially the weights got stuck in a local minima:

Thus, it is, in general, a good idea to set the learning rate to a low value and let the network learn over a high number of epochs.