Comparison of Gradient Descent, Stochastic Gradient Descent and L-BFGS

Comparison of Gradient Descent, Stochastic Gradient Descent and L-BFGS

By Z.H. Fu

切问录 www.fuzihao.org

Abstract

This article compares four optimization approaches on the logistic regression of mnist dataset. The results of Gradient Descent(GD), Stochastic Gradient Descent(SGD), L-BFGS will be discussed in detail. We proposed a Combined Stochastic Gradient Descent with L-BFGS(CL-BFGS) which is a improved version of L-BFGS and SGD. we conclude that when dataset is small, L-BFGS performans the best. If the dataset is big, SGD is recommended.

Dataset

The mnist dataset is one of the most popular dataset in machine learning especially in handwriting recognition systems. It contains 50000 pictures of 28x28 size as the training set while 10000 as validation set and 10000 as test set. The Bench mark error rate of mnist in traditional linear classification method is 7.6%[1]. However, the result improved dramatically after the deep learning method is proposed. DNN0.35%[2], CNN0.23%[3]. In this article, we only compare the optimization method in logistic regression.

Experiment

Gradient Descent

GD is the most basic optimization method in machine learning problems. The gradient is calculated after the input vector is set. The input vector changes along the gradient vector to get close to the local minimal/maximum solution during each epoch. However, GD is perhaps the worst choice to use among the methods above.

##Stochastic Gradient Descent
SGD does better than GD especially when the data is large. Sometimes, the data is too large to calculate at one time. So, it’s necessary to calculate in batches. SGD divided the data into some batches, and optimization one batch at each iteration. The parameter is passed to another batch after one is finished. The advantage of this machanicse is that comparing with the GD method, at each iteration, the parameter is optimized and get better and better after the first batch. However, the parameter of the batches keeps the same during each epoch in GD.

L-BFGS

L-BFGS stand for the Limited-memory Broyden–Fletcher–Goldfarb–Shanno. It approximates the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm using a limited amount of computer memory. It can often get the better solution than the two methods mentioned above with less iteration.

Combined Stochastic Gradient Descent with L-BFGS

CL-BFGS works the same as SGD. However the optimization method in each batch is substitutes with L-BFGS. It get a sharp improvement in the short term, but cost a lot of time, because it will initial L-BFGS at each epoch.

#Results
The results is as follow, the error rate of each method coresponding to a epoch is listed.

epoch GD train GD valid GD test SGD train SGD valid SGD test L-BFGD train L-BFGS valid L-BFGS test CL-BFGS train CL-BFGS valid CL-BFGS test
1 0.89404 0.8998 0.8966 0.46814 0.4499 0.4605 0.74056 0.7311 0.7273 0.15724 0.144 0.1531
10 0.81976 0.8195 0.817 0.17524 0.1645 0.1687 0.1587 0.1396 0.1494 0.08622 0.087 0.0925
20 0.79408 0.7964 0.7944 0.14498 0.1355 0.1373 0.12376 0.1173 0.1215 0.07998 0.084 0.0896
30 0.75974 0.7627 0.7588 0.13136 0.1243 0.1249 0.10712 0.1007 0.1094 0.07684 0.0821 0.089
40 0.71346 0.7171 0.7145 0.1229 0.1177 0.1182 0.09956 0.0982 0.102 0.0748 0.0818 0.0878
50 0.66438 0.664 0.6648 0.11716 0.112 0.1129 0.09642 0.0918 0.1001 0.07376 0.0816 0.0865
60 0.61738 0.6167 0.6184 0.11248 0.1084 0.11 0.08876 0.0872 0.0939 0.073 0.0809 0.0862
70 0.57716 0.5732 0.5755 0.10916 0.1062 0.1078 0.08532 0.0848 0.0907 0.07254 0.0806 0.0866
80 0.54128 0.5339 0.5393 0.10592 0.1042 0.1051 0.08094 0.0823 0.0881 0.07204 0.0812 0.0863
90 0.50982 0.4978 0.509 0.10296 0.1018 0.1029 0.08016 0.0803 0.0879 0.0718 0.0813 0.0865
100 0.48344 0.47 0.4815 0.10148 0.1004 0.1073 0.0782 0.0807 0.0877 0.07146 0.0809 0.0871
200 0.36656 0.3436 0.3437 0.08698 0.0891 0.0916 0.0635 0.0766 0.0791 0.06916 0.0814 0.0864
300 0.30336 0.2822 0.2827 0.08034 0.0843 0.084 0.06106 0.0743 0.0774 0.0687 0.0821 0.0881
400 0.26512 0.2451 0.245 0.07718 0.0801 0.0803 0.06016 0.0749 0.077 0.0687 0.0816 0.0891
500 0.23936 0.2206 0.2226 0.07414 0.0778 0.0788 0.05924 0.0755 0.0775 0.0684 0.0821 0.0892
600 0.21986 0.2048 0.2052 0.07164 0.0771 0.0778 0.05916 0.0757 0.0782 0.06826 0.0821 0.0897
700 0.20634 0.1932 0.1909 0.07002 0.0762 0.077 0.05848 0.0761 0.0784 0.0683 0.082 0.0894
800 0.1954 0.1833 0.1794 0.06876 0.075 0.0768 0.05834 0.0754 0.0789 0.06832 0.082 0.0893
900 0.18702 0.1756 0.1717 0.06782 0.0741 0.0767 0.05818 0.0765 0.0797 0.06828 0.0827 0.0902
1000 0.17968 0.1696 0.1655 0.06694 0.0733 0.0768 0.05804 0.0763 0.0796 0.06832 0.083 0.0905
1100 0.17342 0.1646 0.1586 0.06642 0.073 0.0773 0.05788 0.0769 0.0802 0.06842 0.0821 0.0909
1200 0.16904 0.1597 0.154 0.06568 0.0728 0.0776 0.05776 0.0766 0.0794 0.06836 0.0825 0.0916
1300 0.16414 0.1557 0.1503 0.06528 0.0729 0.0779 0.05756 0.0766 0.08 0.06838 0.0831 0.0907
1400 0.15982 0.1527 0.1457 0.06468 0.0729 0.0778 0.05768 0.0767 0.0796 0.06834 0.0832 0.0906
1500 0.15606 0.1479 0.143 0.06442 0.0729 0.0778 0.0577 0.0771 0.0797 0.06826 0.0828 0.0916
1600 0.15316 0.1439 0.1397 0.06402 0.0728 0.0777 0.05732 0.0771 0.08 0.06826 0.0828 0.0918
1700 0.15028 0.142 0.137 0.06362 0.0732 0.0775 0.05748 0.0771 0.0798 0.06832 0.0831 0.0915
1800 0.1476 0.1404 0.1354 0.0634 0.0731 0.0775 0.05718 0.0764 0.0798 0.06822 0.0832 0.0909
1900 0.14506 0.1386 0.1334 0.06296 0.0733 0.0777 0.05706 0.0771 0.0801 0.0686 0.0839 0.0917
2000 0.14276 0.1367 0.1314 0.0628 0.0732 0.0776 0.05676 0.0766 0.0797 0.06826 0.0837 0.0907
2100 0.14062 0.1345 0.1301 0.06278 0.0734 0.0776 0.0567 0.0772 0.0802 0.06808 0.0841 0.0915
2200 0.13868 0.1331 0.1287 0.06258 0.0736 0.0774 0.0562 0.0769 0.08 0.0679 0.0835 0.0923
2300 0.13676 0.1311 0.1274 0.06232 0.0736 0.0775 0.05662 0.0778 0.0802 0.06784 0.0832 0.0922
2400 0.1352 0.1295 0.1266 0.06222 0.0734 0.0774 0.05618 0.0772 0.0805 0.0678 0.0831 0.0919
2500 0.1338 0.1287 0.1257 0.06208 0.0733 0.0778 0.0564 0.0777 0.0803 0.0678 0.0832 0.0922
2600 0.1325 0.127 0.1244 0.06182 0.0735 0.0776 0.0563 0.0775 0.0804 0.06788 0.0831 0.0923
2700 0.1314 0.1257 0.1237 0.0616 0.0736 0.0772 0.05614 0.0781 0.0804 0.06788 0.0832 0.092
2800 0.1301 0.1247 0.1221 0.06148 0.0737 0.0773 0.05608 0.0776 0.0804 0.06786 0.0834 0.0921
2900 0.12886 0.1235 0.1214 0.06132 0.0733 0.077 0.05604 0.0779 0.0807 0.06766 0.0829 0.092
3000 0.12766 0.1232 0.1205 0.06118 0.0733 0.0769 0.0559 0.0778 0.0807 0.06756 0.083 0.0925
3100 0.12696 0.1217 0.1201 0.06104 0.0733 0.0771 0.05592 0.0781 0.081 0.06754 0.083 0.0922
3200 0.12618 0.1211 0.1186 0.06076 0.0737 0.0769 0.05586 0.0779 0.0805 0.0676 0.083 0.0921
3300 0.12548 0.1208 0.1179 0.06066 0.0735 0.077 0.05586 0.0781 0.0813 0.06758 0.083 0.092
3400 0.12442 0.1199 0.1174 0.06054 0.0737 0.077 0.05586 0.0779 0.0808 0.06732 0.0832 0.0919
3500 0.12364 0.1195 0.1161 0.06048 0.0734 0.0774 0.05588 0.078 0.0814 0.06762 0.0832 0.0922
3600 0.12294 0.1187 0.1158 0.06046 0.0733 0.0774 0.05584 0.078 0.0807 0.06808 0.0829 0.0916
3700 0.12218 0.1178 0.1149 0.06032 0.0734 0.0777 0.05578 0.0779 0.0812 0.06762 0.0833 0.0925
3800 0.12152 0.1169 0.1142 0.06026 0.0734 0.0778 0.05578 0.0779 0.0812 0.06746 0.0835 0.0924
3900 0.12072 0.1158 0.1135 0.06026 0.0736 0.0775 0.05578 0.0779 0.0812 0.06808 0.0832 0.0924
4000 0.11994 0.1145 0.1136 0.0602 0.0735 0.0773 0.05578 0.0779 0.0812 0.06772 0.0835 0.0927
4100 0.11926 0.1135 0.1132 0.06018 0.0736 0.0773 0.05578 0.0779 0.0812 0.06794 0.0833 0.0924
4200 0.11896 0.1128 0.1122 0.06008 0.0737 0.0772 0.05578 0.0779 0.0812 0.06786 0.0838 0.0924
4300 0.11804 0.1125 0.1115 0.06002 0.0739 0.0774 0.05578 0.0779 0.0812 0.06776 0.0835 0.0927
4400 0.1173 0.1122 0.1113 0.05986 0.0738 0.0774 0.05578 0.0779 0.0812 0.06756 0.0836 0.0926
4500 0.11664 0.1119 0.111 0.05976 0.0737 0.0774 0.05578 0.0779 0.0812 0.06792 0.0836 0.0923
4600 0.11614 0.1117 0.1104 0.05972 0.0738 0.0775 0.05578 0.0779 0.0812 0.06768 0.0838 0.0925
4700 0.11538 0.1109 0.1096 0.05968 0.0738 0.0774 0.05578 0.0779 0.0812 0.06768 0.0839 0.0925
4800 0.11502 0.1107 0.1094 0.05964 0.0739 0.0774 0.05578 0.0779 0.0812 0.06778 0.0839 0.0925
4900 0.11434 0.1101 0.1086 0.05966 0.074 0.0775 0.05578 0.0779 0.0812 0.06762 0.0839 0.0926
5000 0.11382 0.1088 0.1082 0.05968 0.0741 0.0775 0.05578 0.0779 0.0812 0.06766 0.0838 0.0927
5100 0.1134 0.1087 0.1081 0.05964 0.0743 0.0775 0.05578 0.0779 0.0812 0.06766 0.0839 0.0928
5200 0.11286 0.1083 0.1077 0.05962 0.0745 0.0776 0.05578 0.0779 0.0812 0.06762 0.0838 0.0928
5300 0.11262 0.1083 0.1075 0.05952 0.0744 0.0778 0.05578 0.0779 0.0812 0.0676 0.0839 0.0929
5400 0.1121 0.1079 0.107 0.05954 0.0744 0.0777 0.05578 0.0779 0.0812 0.06766 0.0843 0.0931
5500 0.11164 0.1074 0.1064 0.05954 0.0745 0.0777 0.05578 0.0779 0.0812 0.06772 0.0842 0.0927
5600 0.11128 0.1069 0.1059 0.05956 0.0745 0.0776 0.05578 0.0779 0.0812 0.06772 0.0844 0.093
5700 0.111 0.1069 0.1058 0.05958 0.0747 0.0774 0.05578 0.0779 0.0812 0.06768 0.0843 0.093
5800 0.11032 0.1068 0.1051 0.0595 0.0745 0.0775 0.05578 0.0779 0.0812 0.06772 0.0844 0.0931
5900 0.10988 0.1063 0.1047 0.0595 0.0744 0.0774 0.05578 0.0779 0.0812 0.06776 0.0844 0.0931
6000 0.10946 0.106 0.104 0.0595 0.0744 0.0776 0.05578 0.0779 0.0812 0.06774 0.0844 0.0931
6100 0.1091 0.1057 0.1036 0.0595 0.0744 0.0778 0.05578 0.0779 0.0812 0.06752 0.0846 0.0932
6200 0.10868 0.1055 0.1036 0.05944 0.0744 0.0778 0.05578 0.0779 0.0812 0.06764 0.0845 0.0931
6300 0.10822 0.1055 0.1035 0.05942 0.0741 0.0778 0.05578 0.0779 0.0812 0.06762 0.0848 0.0933
6400 0.10794 0.105 0.103 0.05938 0.0744 0.0779 0.05578 0.0779 0.0812 0.0676 0.0845 0.0926

#Conclusion
We can see from the results above:

  • In the short term, batch method works better than using the whole data. Parameter can improve during the calculation of each batch, so the error rate decrease faster.
  • In the long term, however, using the whole data is better than the batch method, because at last the classifier should balance the error of the whole data. Using a batch to optimize the parameter will make other batches work worse.
  • L-BFGS works better than the GD and SGD method.
  • CL-BFGS works better than L-BFGS method in the short term.
    So, if we want to handling data that can be dealt at one time, it is recommended to use L-BFGS. However, to a lot of problems that the data is too large, it’s necessary to use a batch method. CL-BFGS perform better than SGD in each epoch, but cost a lot of time. So SGD is recommended under such circumstances.

[1] LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick Haffner (1998). “Gradient-Based Learning Applied to Document Recognition” (PDF). Proceedings of the IEEE 86 86 (11): 2278–2324. doi:10.1109/5.726791
[2] Ciresan, Claudiu Dan; Dan, Ueli Meier, Luca Maria Gambardella, and Juergen Schmidhuber (December 2010). “Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition”. Neural Computation 22 (12). doi:10.1162/NECO_a_00052
[3] Cires¸an, Dan; Ueli Meier; Jürgen Schmidhuber (2012). “Multi-column deep neural networks for image classification” (PDF). 2012 IEEE Conference on Computer Vision and Pattern Recognition: 3642–3649. arXiv:1202.2745. doi:10.1109/CVPR.2012.6248110