Comparison of Gradient Descent, Stochastic Gradient Descent and L-BFGS

By Z.H. Fu
切问录 www.fuzihao.org

Abstract

This article compares four optimization approaches on the logistic regression of mnist dataset. The results of Gradient Descent(GD), Stochastic Gradient Descent(SGD), L-BFGS will be discussed in detail. We proposed a Combined Stochastic Gradient Descent with L-BFGS(CL-BFGS) which is a improved version of L-BFGS and SGD. we conclude that when dataset is small, L-BFGS performans the best. If the dataset is big, SGD is recommended.

Dataset

The mnist dataset is one of the most popular dataset in machine learning especially in handwriting recognition systems. It contains 50000 pictures of 28x28 size as the training set while 10000 as validation set and 10000 as test set. The Bench mark error rate of mnist in traditional linear classification method is 7.6%[1]. However, the result improved dramatically after the deep learning method is proposed. DNN0.35%[2], CNN0.23%[3]. In this article, we only compare the optimization method in logistic regression. # Experiment

Gradient Descent

GD is the most basic optimization method in machine learning problems. The gradient is calculated after the input vector is set. The input vector changes along the gradient vector to get close to the local minimal/maximum solution during each epoch. However, GD is perhaps the worst choice to use among the methods above.

Stochastic Gradient Descent

SGD does better than GD especially when the data is large. Sometimes, the data is too large to calculate at one time. So, it’s necessary to calculate in batches. SGD divided the data into some batches, and optimization one batch at each iteration. The parameter is passed to another batch after one is finished. The advantage of this machanicse is that comparing with the GD method, at each iteration, the parameter is optimized and get better and better after the first batch. However, the parameter of the batches keeps the same during each epoch in GD.

L-BFGS

L-BFGS stand for the Limited-memory Broyden–Fletcher–Goldfarb–Shanno. It approximates the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm using a limited amount of computer memory. It can often get the better solution than the two methods mentioned above with less iteration.

Combined Stochastic Gradient Descent with L-BFGS

CL-BFGS works the same as SGD. However the optimization method in each batch is substitutes with L-BFGS. It get a sharp improvement in the short term, but cost a lot of time, because it will initial L-BFGS at each epoch.

Results

The results is as follow, the error rate of each method coresponding to a epoch is listed.

epoch GD train GD valid GD test SGD train SGD valid SGD test L-BFGD train L-BFGS valid L-BFGS test CL-BFGS train CL-BFGS valid
1 0.89404 0.8998 0.8966 0.46814 0.4499 0.4605 0.74056 0.7311 0.7273 0.15724 0.144
10 0.81976 0.8195 0.817 0.17524 0.1645 0.1687 0.1587 0.1396 0.1494 0.08622 0.087
20 0.79408 0.7964 0.7944 0.14498 0.1355 0.1373 0.12376 0.1173 0.1215 0.07998 0.084
30 0.75974 0.7627 0.7588 0.13136 0.1243 0.1249 0.10712 0.1007 0.1094 0.07684 0.0821
40 0.71346 0.7171 0.7145 0.1229 0.1177 0.1182 0.09956 0.0982 0.102 0.0748 0.0818
50 0.66438 0.664 0.6648 0.11716 0.112 0.1129 0.09642 0.0918 0.1001 0.07376 0.0816
60 0.61738 0.6167 0.6184 0.11248 0.1084 0.11 0.08876 0.0872 0.0939 0.073 0.0809
70 0.57716 0.5732 0.5755 0.10916 0.1062 0.1078 0.08532 0.0848 0.0907 0.07254 0.0806
80 0.54128 0.5339 0.5393 0.10592 0.1042 0.1051 0.08094 0.0823 0.0881 0.07204 0.0812
90 0.50982 0.4978 0.509 0.10296 0.1018 0.1029 0.08016 0.0803 0.0879 0.0718 0.0813
100 0.48344 0.47 0.4815 0.10148 0.1004 0.1073 0.0782 0.0807 0.0877 0.07146 0.0809
200 0.36656 0.3436 0.3437 0.08698 0.0891 0.0916 0.0635 0.0766 0.0791 0.06916 0.0814
300 0.30336 0.2822 0.2827 0.08034 0.0843 0.084 0.06106 0.0743 0.0774 0.0687 0.0821
400 0.26512 0.2451 0.245 0.07718 0.0801 0.0803 0.06016 0.0749 0.077 0.0687 0.0816
500 0.23936 0.2206 0.2226 0.07414 0.0778 0.0788 0.05924 0.0755 0.0775 0.0684 0.0821
600 0.21986 0.2048 0.2052 0.07164 0.0771 0.0778 0.05916 0.0757 0.0782 0.06826 0.0821
700 0.20634 0.1932 0.1909 0.07002 0.0762 0.077 0.05848 0.0761 0.0784 0.0683 0.082
800 0.1954 0.1833 0.1794 0.06876 0.075 0.0768 0.05834 0.0754 0.0789 0.06832 0.082
900 0.18702 0.1756 0.1717 0.06782 0.0741 0.0767 0.05818 0.0765 0.0797 0.06828 0.0827
1000 0.17968 0.1696 0.1655 0.06694 0.0733 0.0768 0.05804 0.0763 0.0796 0.06832 0.083
1100 0.17342 0.1646 0.1586 0.06642 0.073 0.0773 0.05788 0.0769 0.0802 0.06842 0.0821
1200 0.16904 0.1597 0.154 0.06568 0.0728 0.0776 0.05776 0.0766 0.0794 0.06836 0.0825
1300 0.16414 0.1557 0.1503 0.06528 0.0729 0.0779 0.05756 0.0766 0.08 0.06838 0.0831
1400 0.15982 0.1527 0.1457 0.06468 0.0729 0.0778 0.05768 0.0767 0.0796 0.06834 0.0832
1500 0.15606 0.1479 0.143 0.06442 0.0729 0.0778 0.0577 0.0771 0.0797 0.06826 0.0828
1600 0.15316 0.1439 0.1397 0.06402 0.0728 0.0777 0.05732 0.0771 0.08 0.06826 0.0828
1700 0.15028 0.142 0.137 0.06362 0.0732 0.0775 0.05748 0.0771 0.0798 0.06832 0.0831
1800 0.1476 0.1404 0.1354 0.0634 0.0731 0.0775 0.05718 0.0764 0.0798 0.06822 0.0832
1900 0.14506 0.1386 0.1334 0.06296 0.0733 0.0777 0.05706 0.0771 0.0801 0.0686 0.0839
2000 0.14276 0.1367 0.1314 0.0628 0.0732 0.0776 0.05676 0.0766 0.0797 0.06826 0.0837
2100 0.14062 0.1345 0.1301 0.06278 0.0734 0.0776 0.0567 0.0772 0.0802 0.06808 0.0841
2200 0.13868 0.1331 0.1287 0.06258 0.0736 0.0774 0.0562 0.0769 0.08 0.0679 0.0835
2300 0.13676 0.1311 0.1274 0.06232 0.0736 0.0775 0.05662 0.0778 0.0802 0.06784 0.0832
2400 0.1352 0.1295 0.1266 0.06222 0.0734 0.0774 0.05618 0.0772 0.0805 0.0678 0.0831
2500 0.1338 0.1287 0.1257 0.06208 0.0733 0.0778 0.0564 0.0777 0.0803 0.0678 0.0832
2600 0.1325 0.127 0.1244 0.06182 0.0735 0.0776 0.0563 0.0775 0.0804 0.06788 0.0831
2700 0.1314 0.1257 0.1237 0.0616 0.0736 0.0772 0.05614 0.0781 0.0804 0.06788 0.0832
2800 0.1301 0.1247 0.1221 0.06148 0.0737 0.0773 0.05608 0.0776 0.0804 0.06786 0.0834
2900 0.12886 0.1235 0.1214 0.06132 0.0733 0.077 0.05604 0.0779 0.0807 0.06766 0.0829
3000 0.12766 0.1232 0.1205 0.06118 0.0733 0.0769 0.0559 0.0778 0.0807 0.06756 0.083
3100 0.12696 0.1217 0.1201 0.06104 0.0733 0.0771 0.05592 0.0781 0.081 0.06754 0.083
3200 0.12618 0.1211 0.1186 0.06076 0.0737 0.0769 0.05586 0.0779 0.0805 0.0676 0.083
3300 0.12548 0.1208 0.1179 0.06066 0.0735 0.077 0.05586 0.0781 0.0813 0.06758 0.083
3400 0.12442 0.1199 0.1174 0.06054 0.0737 0.077 0.05586 0.0779 0.0808 0.06732 0.0832
3500 0.12364 0.1195 0.1161 0.06048 0.0734 0.0774 0.05588 0.078 0.0814 0.06762 0.0832
3600 0.12294 0.1187 0.1158 0.06046 0.0733 0.0774 0.05584 0.078 0.0807 0.06808 0.0829
3700 0.12218 0.1178 0.1149 0.06032 0.0734 0.0777 0.05578 0.0779 0.0812 0.06762 0.0833
3800 0.12152 0.1169 0.1142 0.06026 0.0734 0.0778 0.05578 0.0779 0.0812 0.06746 0.0835
3900 0.12072 0.1158 0.1135 0.06026 0.0736 0.0775 0.05578 0.0779 0.0812 0.06808 0.0832
4000 0.11994 0.1145 0.1136 0.0602 0.0735 0.0773 0.05578 0.0779 0.0812 0.06772 0.0835
4100 0.11926 0.1135 0.1132 0.06018 0.0736 0.0773 0.05578 0.0779 0.0812 0.06794 0.0833
4200 0.11896 0.1128 0.1122 0.06008 0.0737 0.0772 0.05578 0.0779 0.0812 0.06786 0.0838
4300 0.11804 0.1125 0.1115 0.06002 0.0739 0.0774 0.05578 0.0779 0.0812 0.06776 0.0835
4400 0.1173 0.1122 0.1113 0.05986 0.0738 0.0774 0.05578 0.0779 0.0812 0.06756 0.0836
4500 0.11664 0.1119 0.111 0.05976 0.0737 0.0774 0.05578 0.0779 0.0812 0.06792 0.0836
4600 0.11614 0.1117 0.1104 0.05972 0.0738 0.0775 0.05578 0.0779 0.0812 0.06768 0.0838
4700 0.11538 0.1109 0.1096 0.05968 0.0738 0.0774 0.05578 0.0779 0.0812 0.06768 0.0839
4800 0.11502 0.1107 0.1094 0.05964 0.0739 0.0774 0.05578 0.0779 0.0812 0.06778 0.0839
4900 0.11434 0.1101 0.1086 0.05966 0.074 0.0775 0.05578 0.0779 0.0812 0.06762 0.0839
5000 0.11382 0.1088 0.1082 0.05968 0.0741 0.0775 0.05578 0.0779 0.0812 0.06766 0.0838
5100 0.1134 0.1087 0.1081 0.05964 0.0743 0.0775 0.05578 0.0779 0.0812 0.06766 0.0839
5200 0.11286 0.1083 0.1077 0.05962 0.0745 0.0776 0.05578 0.0779 0.0812 0.06762 0.0838
5300 0.11262 0.1083 0.1075 0.05952 0.0744 0.0778 0.05578 0.0779 0.0812 0.0676 0.0839
5400 0.1121 0.1079 0.107 0.05954 0.0744 0.0777 0.05578 0.0779 0.0812 0.06766 0.0843
5500 0.11164 0.1074 0.1064 0.05954 0.0745 0.0777 0.05578 0.0779 0.0812 0.06772 0.0842
5600 0.11128 0.1069 0.1059 0.05956 0.0745 0.0776 0.05578 0.0779 0.0812 0.06772 0.0844
5700 0.111 0.1069 0.1058 0.05958 0.0747 0.0774 0.05578 0.0779 0.0812 0.06768 0.0843
5800 0.11032 0.1068 0.1051 0.0595 0.0745 0.0775 0.05578 0.0779 0.0812 0.06772 0.0844
5900 0.10988 0.1063 0.1047 0.0595 0.0744 0.0774 0.05578 0.0779 0.0812 0.06776 0.0844
6000 0.10946 0.106 0.104 0.0595 0.0744 0.0776 0.05578 0.0779 0.0812 0.06774 0.0844
6100 0.1091 0.1057 0.1036 0.0595 0.0744 0.0778 0.05578 0.0779 0.0812 0.06752 0.0846
6200 0.10868 0.1055 0.1036 0.05944 0.0744 0.0778 0.05578 0.0779 0.0812 0.06764 0.0845
6300 0.10822 0.1055 0.1035 0.05942 0.0741 0.0778 0.05578 0.0779 0.0812 0.06762 0.0848
6400 0.10794 0.105 0.103 0.05938 0.0744 0.0779 0.05578 0.0779 0.0812 0.0676 0.0845

Conclusion

We can see from the results above: - In the short term, batch method works better than using the whole data. Parameter can improve during the calculation of each batch, so the error rate decrease faster. - In the long term, however, using the whole data is better than the batch method, because at last the classifier should balance the error of the whole data. Using a batch to optimize the parameter will make other batches work worse. - L-BFGS works better than the GD and SGD method. - CL-BFGS works better than L-BFGS method in the short term. So, if we want to handling data that can be dealt at one time, it is recommended to use L-BFGS. However, to a lot of problems that the data is too large, it’s necessary to use a batch method. CL-BFGS perform better than SGD in each epoch, but cost a lot of time. So SGD is recommended under such circumstances.

[1] LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick Haffner (1998). “Gradient-Based Learning Applied to Document Recognition” (PDF). Proceedings of the IEEE 86 86 (11): 2278–2324. doi:10.1109/5.726791 [2] Ciresan, Claudiu Dan; Dan, Ueli Meier, Luca Maria Gambardella, and Juergen Schmidhuber (December 2010). “Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition”. Neural Computation 22 (12). doi:10.1162/NECO_a_00052 [3] Cires¸an, Dan; Ueli Meier; Jürgen Schmidhuber (2012). “Multi-column deep neural networks for image classification” (PDF). 2012 IEEE Conference on Computer Vision and Pattern Recognition: 3642–3649. arXiv:1202.2745. doi:10.1109/CVPR.2012.6248110