1 00:00:00,270 --> 00:00:02,640 Hello and welcome to this new tutorial. 2 00:00:02,640 --> 00:00:08,190 All right so now we're going to build the last method of this policy class which is a method that is 3 00:00:08,190 --> 00:00:14,660 going to do the most important step of the whole algorithm which is basically great innocent. 4 00:00:14,730 --> 00:00:20,700 More specifically it's an approximation of grain in the sense because we don't compute directly the 5 00:00:20,700 --> 00:00:26,910 gradient of you know the reward with respect to the weights we approximate this gradient by measuring 6 00:00:26,910 --> 00:00:32,100 the difference of the reward between the reward and the positive direction and the reward in the negative 7 00:00:32,100 --> 00:00:33,000 direction. 8 00:00:33,000 --> 00:00:35,010 So we're going to have a look at it in detail. 9 00:00:35,010 --> 00:00:42,270 Now I will quickly explain the purpose of this great in this sense and we'll implement this very important 10 00:00:42,270 --> 00:00:47,130 step which is the updates step one step of the great and descent. 11 00:00:47,340 --> 00:00:53,550 So before we do this let's go back to the article to remind us what we have to implement exactly. 12 00:00:53,550 --> 00:01:02,100 So as a reminder we just implemented this point five here by applying the perturbations on the matrix 13 00:01:02,100 --> 00:01:02,770 of weight. 14 00:01:02,910 --> 00:01:09,750 And we also did the sampling of the perturbations following a gash in distribution of variance the noise 15 00:01:10,170 --> 00:01:15,990 which we implemented here and now we're going to move on to Step 7 make the update step which corresponds 16 00:01:15,990 --> 00:01:18,660 to exactly one step of graden in the sense. 17 00:01:18,930 --> 00:01:20,700 So why do we need to do this. 18 00:01:20,730 --> 00:01:21,840 And how does it work. 19 00:01:21,840 --> 00:01:24,750 Let me quickly remind the explanation. 20 00:01:24,750 --> 00:01:30,570 Well in class egregiousness And you know what we do we want to optimize something which is generally 21 00:01:30,840 --> 00:01:37,510 the last function you know we want to reduce the last error between a prediction and a target. 22 00:01:37,560 --> 00:01:38,170 Right. 23 00:01:38,190 --> 00:01:44,070 And we do that with respect to the weight so we're going to differentiate the less with respect to the 24 00:01:44,070 --> 00:01:46,270 weight to reduce this loss error. 25 00:01:46,440 --> 00:01:52,500 And then we will update the weight by adding this differentiation of the last with respect to the weight 26 00:01:52,800 --> 00:01:53,450 to the weight. 27 00:01:53,460 --> 00:01:59,270 And that will update the weights in a direction of last reduction here we don't have any loss. 28 00:01:59,280 --> 00:02:02,340 We don't have any target and prediction here. 29 00:02:02,340 --> 00:02:07,040 What we have is some we want we want to optimize the reward. 30 00:02:07,050 --> 00:02:13,260 Therefore what we're going to do is not to try to reduce unless what we're going to do is try to increase 31 00:02:13,350 --> 00:02:19,620 the reward and therefore we're not going to differentiate the loss with respect the weight we are going 32 00:02:19,620 --> 00:02:23,630 to differentiate that we want with respect to the weight. 33 00:02:23,700 --> 00:02:30,840 And that's why here what you see is exactly this differentiation but it's not a classic mathematical 34 00:02:30,840 --> 00:02:37,410 differentiation it's an approximation of the differentiation of the reward because this reward that 35 00:02:37,410 --> 00:02:44,890 you see here is the reward obtained by applying some perturbations to your policy in the positive direction 36 00:02:45,330 --> 00:02:52,560 and this reward that you see here is the reward obtained by applying some perturbations to that same 37 00:02:52,560 --> 00:02:56,460 policy in the opposite direction as this one. 38 00:02:56,460 --> 00:03:03,000 And since the preservation consists of adding to the weight a very small number you know like a small 39 00:03:03,060 --> 00:03:05,650 Epsilon following the normal distribution. 40 00:03:05,820 --> 00:03:13,590 Well this difference of the words here is almost the same as differentiating the rewards with respect 41 00:03:13,590 --> 00:03:19,530 to the weight because a real differentiation would be the partial derivative of the reward divided by 42 00:03:19,530 --> 00:03:22,480 the partial derivative of the weights. 43 00:03:22,500 --> 00:03:28,680 But since this difference is a difference of words with respect to directions that are opposite and 44 00:03:28,680 --> 00:03:32,320 also with respect to some very small values of the perturbations. 45 00:03:32,340 --> 00:03:38,760 Well this is exactly the approximation of the gradient of the word with respect to the weight. 46 00:03:38,760 --> 00:03:39,160 All right. 47 00:03:39,180 --> 00:03:41,640 And this method has a specific name. 48 00:03:41,670 --> 00:03:49,770 It's called the method of Finot differences and that exactly happens right here with this Finot difference 49 00:03:50,120 --> 00:03:56,340 of the words in two opposite directions by playing some very small perturbations in that approximates 50 00:03:56,680 --> 00:04:01,680 the gradient with respect to the weights and then we multiply indeed by the value of the perturbation 51 00:04:02,130 --> 00:04:04,770 as the way we do it with great in the sense. 52 00:04:04,830 --> 00:04:11,640 So that's exactly one step of gradient descent but by approximating the gradient. 53 00:04:11,790 --> 00:04:16,170 So it's important to get this because this is really the heart of the algorithm. 54 00:04:16,170 --> 00:04:22,380 This is really where the added value is indeed the algorithm that we're implementing right now is called 55 00:04:22,560 --> 00:04:28,650 commented random search but it's so important this is so crucial in the algorithm that we could call 56 00:04:28,650 --> 00:04:30,390 it the methods of any differences. 57 00:04:30,390 --> 00:04:35,430 In fact the method of any differences is at the heart of augmented random search. 58 00:04:35,430 --> 00:04:39,910 All right there was an explanation of the purpose of this because this is really important. 59 00:04:39,960 --> 00:04:41,780 And now let's implement this. 60 00:04:41,790 --> 00:04:45,170 It's going to be very simple we just need to implement that sum. 61 00:04:45,180 --> 00:04:51,150 So by the way we do this some of the finit differences of the words for all the best directions you 62 00:04:51,150 --> 00:04:57,750 know the bigger response here to the best directions which I remind is one of our hyper parameters and 63 00:04:57,750 --> 00:05:01,270 we set it equal to 16 but that can be modifiable. 64 00:05:01,450 --> 00:05:06,550 So we're going to some of these finit differences multiplied by the perturbation itself. 65 00:05:06,850 --> 00:05:14,410 For the 16 best erections and we are going to add this factor here alférez the learning rate divided 66 00:05:14,410 --> 00:05:15,470 by B. 67 00:05:15,570 --> 00:05:21,380 The total number of best directions and sigma are which is the variance of the words. 68 00:05:21,610 --> 00:05:25,790 Because indeed in 3.1 we haven't taken care of 3.1 yet. 69 00:05:25,930 --> 00:05:33,540 But in order to improve the algorithm we need to scale by the standard deviation Sigma or so here we 70 00:05:33,540 --> 00:05:37,640 are taking Cygne are in fact to normalize what we want. 71 00:05:37,810 --> 00:05:38,560 All right. 72 00:05:38,680 --> 00:05:41,060 So let's do this let's implement this. 73 00:05:41,200 --> 00:05:49,490 And so now I'm going to go back to Python to make this final method of our policy class let's do this. 74 00:05:49,720 --> 00:05:57,100 So this method we're going to call it a date because indeed it's the update step of the paper and that 75 00:05:57,100 --> 00:06:01,880 corresponds to exactly one step of approximated great in the sense. 76 00:06:02,180 --> 00:06:07,540 So date and now it's going to take three arguments the first one is self as usual because we're going 77 00:06:07,540 --> 00:06:11,650 to use our future object or future instances of the policy class. 78 00:06:11,950 --> 00:06:16,630 Then the second argument you cannot really guess where it's going to be but it's going to be rolled 79 00:06:16,630 --> 00:06:17,490 out. 80 00:06:17,560 --> 00:06:24,520 Actually they talk about rollouts in the paper and a rollout is basically going to be a list of several 81 00:06:24,520 --> 00:06:31,420 triplets and the triplets will contain First the word obtained in the positive direction. 82 00:06:31,540 --> 00:06:35,940 Then second the reward obtained in the opposite direction. 83 00:06:35,980 --> 00:06:43,240 And third the perturbation in this specific direction that gave you these two first elements that we 84 00:06:43,240 --> 00:06:46,740 were in the positive direction and that we were in a negative direction. 85 00:06:46,960 --> 00:06:54,100 So basically it's a session that will contain the best triplets of rewards obtained by applying some 86 00:06:54,100 --> 00:06:56,870 perturbations in the best directions. 87 00:06:56,920 --> 00:06:57,660 Right. 88 00:06:57,700 --> 00:07:00,660 So roll out and then the last arguments. 89 00:07:00,700 --> 00:07:06,540 Remember in the paper we're going to use this Sigma are here the standard deviation of the word that 90 00:07:06,560 --> 00:07:09,180 we will take care of later on in the implementation. 91 00:07:09,250 --> 00:07:16,240 But we need to put it here as argument because it is not one of our hyper parameters we will compute 92 00:07:16,240 --> 00:07:17,530 it later on. 93 00:07:17,560 --> 00:07:24,250 So we're going to call that sigma on this core are to specify that it is a standard deviation of the 94 00:07:24,250 --> 00:07:25,260 reward. 95 00:07:25,330 --> 00:07:28,490 All right to self rule out Sigma are all good. 96 00:07:28,720 --> 00:07:31,650 Let's make this a data step. 97 00:07:31,660 --> 00:07:39,670 So speaking of step I'm going to introduce here first variable step that is going to be exactly this 98 00:07:39,670 --> 00:07:46,840 part which is the some of the best directions being the number of reservation's of this finit difference 99 00:07:46,960 --> 00:07:52,660 of the words obtained in respectively the positive direction and the opposite or negative direction 100 00:07:53,260 --> 00:07:58,520 multiplied by the value of the perturbation step is going to be exactly this. 101 00:07:58,540 --> 00:08:08,950 So that's why I'm going to initialize step as a matrix of zeros and the dimensions of this matrix will 102 00:08:08,950 --> 00:08:15,750 be exactly the same as the dimensions of the matrix of weights theta. 103 00:08:15,810 --> 00:08:22,570 Now you might be wondering why didn't I use to star you that simply because the run and function is 104 00:08:22,570 --> 00:08:29,800 expecting a different format than the format expected by the zero's function the random function expects 105 00:08:30,100 --> 00:08:33,080 this couple of two dimensions. 106 00:08:33,130 --> 00:08:39,310 You know self-deceit at that shape of the next zero for the number of lines and then come on then said 107 00:08:39,320 --> 00:08:44,560 that they did that shape and the next one for the number of columns whereas the zeros function here 108 00:08:44,560 --> 00:08:49,310 directly expect selve that feature that shape basically the shape of theta. 109 00:08:49,500 --> 00:08:50,020 OK. 110 00:08:50,170 --> 00:08:53,420 So just some different formats expected by two functions. 111 00:08:53,560 --> 00:09:01,600 And now we've initialized this matrix step that represent all the sum here with the right format because 112 00:09:01,600 --> 00:09:06,310 we're going to sum this to the current matrix of weight theta. 113 00:09:06,670 --> 00:09:15,510 So now we're going to do exactly this some here and the best way to do a sum like that is through a 114 00:09:15,510 --> 00:09:16,460 for loop. 115 00:09:16,470 --> 00:09:24,830 So now we're going to do a full loop and we need to iterate on all the rollouts because I remind that 116 00:09:24,840 --> 00:09:31,590 the rollouts contain exactly the triplets of the puzzle we want there is the word think in the positive 117 00:09:31,590 --> 00:09:37,060 direction the negative reward there is we were obtained in the opposite direction. 118 00:09:37,260 --> 00:09:44,520 And then the value of the perturbation you see the purpose of creating the structure rollout that gathers 119 00:09:44,520 --> 00:09:49,240 the three elements because that's exactly the three elements we need to do this. 120 00:09:49,500 --> 00:09:55,590 And then the rural adds there are the triplets and B response to the number of best erections which 121 00:09:55,590 --> 00:09:58,660 is part already of our hyper parameters. 122 00:09:58,860 --> 00:10:06,570 So that's why now in order to make this some We're going to loop over all the positive words which I'm 123 00:10:06,570 --> 00:10:08,190 going to call our Pazz. 124 00:10:08,370 --> 00:10:10,020 All the best but the rewards actually. 125 00:10:10,100 --> 00:10:15,140 That is all the rewards obtained in the best directions then. 126 00:10:15,270 --> 00:10:18,540 Same for the negative reward our neck. 127 00:10:18,540 --> 00:10:25,710 But these are negs we were trained in the opposite direction as the direction that was used to obtain 128 00:10:26,010 --> 00:10:27,060 this are post here. 129 00:10:27,060 --> 00:10:29,100 So they kind of go together. 130 00:10:29,250 --> 00:10:34,430 And then of course the value of the perturbation itself which we can call deep. 131 00:10:34,680 --> 00:10:43,680 So that's the triplets values that are all into rollouts which is one of our arguments here but we will 132 00:10:43,950 --> 00:10:46,090 build to roll out later on. 133 00:10:46,380 --> 00:10:52,640 So for all these triplets of positive reward negative reward and perturbations. 134 00:10:52,860 --> 00:11:01,230 Well now we can have data step by taking the step which is right now equal to zero and we're going to 135 00:11:01,590 --> 00:11:03,340 add to this step. 136 00:11:03,390 --> 00:11:05,050 That's why I'm using plus equals. 137 00:11:05,280 --> 00:11:12,960 We're going to add exactly this element here that is the finit difference of the word obtained in the 138 00:11:12,960 --> 00:11:18,990 positive direction and the reward obtained in negative direction multiplied by the perturbation. 139 00:11:18,990 --> 00:11:21,900 And we already have all these things to rule out. 140 00:11:21,900 --> 00:11:23,240 This is our pose. 141 00:11:23,370 --> 00:11:24,380 This is our neg. 142 00:11:24,540 --> 00:11:25,420 And this is D. 143 00:11:25,680 --> 00:11:34,760 So what we're simply going to add to our current value of our matrix step here is our pas plus every 144 00:11:34,800 --> 00:11:42,150 word minus our neg the negative word or should I say we word obtained in the opposite direction as the 145 00:11:42,150 --> 00:11:51,420 direction that was used to obtain Arpels multiplied by D which is the perturbation that was applied 146 00:11:51,690 --> 00:11:56,300 in the same direction that led us to obtain our POWs and RNA. 147 00:11:56,320 --> 00:12:00,830 But this was obtained in the opposite direction of this perturbation. 148 00:12:00,840 --> 00:12:01,650 All right. 149 00:12:01,650 --> 00:12:04,670 So we have just a data step that's great. 150 00:12:04,660 --> 00:12:06,600 And now we have one final thing to do. 151 00:12:06,780 --> 00:12:13,680 It's just of course the most important it is of course to update the matrix of weight our matrix of 152 00:12:13,680 --> 00:12:21,510 weight theta is going to become theta plus all this which we already computed thanks to this for loop 153 00:12:21,810 --> 00:12:28,200 but multiplied by this factor here which is the learning rate Alpha bit which we called learning rate 154 00:12:28,620 --> 00:12:34,020 divided by the total number of best directions times the standard deviation of the reward. 155 00:12:34,200 --> 00:12:34,680 All right. 156 00:12:34,680 --> 00:12:41,250 We already have all this in our arguments so let's make that final computation we need to get of the 157 00:12:41,250 --> 00:12:47,980 for loop because the for loop was just to compute this some here so that's done. 158 00:12:48,120 --> 00:12:54,480 So now we just need to add this factor here and this is exactly what we're going to do as the last line 159 00:12:54,570 --> 00:12:56,340 of this update function. 160 00:12:56,760 --> 00:13:06,360 Let's do this let's update our matrix of weight theta so self that theta is going to become the previous 161 00:13:06,810 --> 00:13:13,740 self that theta Plus that's what I'm adding plus you call our learning rate which we can get by taking 162 00:13:13,800 --> 00:13:20,390 our hyper parameter object that was not created yet but which we will create in the future HP. 163 00:13:20,520 --> 00:13:24,450 And then we take its proper variable learning rates. 164 00:13:24,480 --> 00:13:25,280 Here we go. 165 00:13:25,410 --> 00:13:33,910 That's the learning rate which we have to divide by Remember the total number of resurrections be times 166 00:13:34,240 --> 00:13:36,090 the standard deviation of the word. 167 00:13:36,460 --> 00:13:42,250 Let's go back let's get first of all number of resurrections B which is one of our hyper parameters 168 00:13:42,250 --> 00:13:44,370 so I'm taking H-P again. 169 00:13:44,500 --> 00:13:46,320 Our hyper barrière object. 170 00:13:46,480 --> 00:13:48,710 And from this object I'm it's variable. 171 00:13:48,970 --> 00:13:52,330 And the best directions. 172 00:13:52,510 --> 00:13:59,890 And then we multiply this by Sigma R which is a standard deviation of the word and which is one of our 173 00:13:59,980 --> 00:14:02,560 argument here we'll compute that later. 174 00:14:02,800 --> 00:14:09,310 But at least we have everything to make this formula to have the matrix of weight in a direction that 175 00:14:09,400 --> 00:14:15,190 increases the most that we want which is exactly what we want to do with our AI because the higher the 176 00:14:15,190 --> 00:14:19,030 we word the better the AI has the ability to walk. 177 00:14:19,060 --> 00:14:19,630 All right. 178 00:14:19,660 --> 00:14:26,980 And then of course final step we need to multiply this by the step that is the sum of the finit differences 179 00:14:26,980 --> 00:14:29,960 of the word multiplied by the perturbation. 180 00:14:30,340 --> 00:14:31,470 Congratulations. 181 00:14:31,480 --> 00:14:38,290 That was the most important step of this implementation of the step the step of approximated great in 182 00:14:38,290 --> 00:14:44,110 the sense you did it and now we have two final things to do first. 183 00:14:44,170 --> 00:14:50,440 Another function which is to explore function and that will compute either an average reward or some 184 00:14:50,440 --> 00:14:57,190 of reward on one full episode because we imagine that we're not going to compare the words in the different 185 00:14:57,190 --> 00:14:59,630 directions on one single action. 186 00:14:59,650 --> 00:15:03,450 We want to compare them on a large series of actions in a row. 187 00:15:03,580 --> 00:15:07,330 Otherwise this wouldn't be very relevant to compare the rewards to each other. 188 00:15:07,330 --> 00:15:11,590 You know we want to have a relevant measure of that we want and so we're going to measure the words 189 00:15:11,650 --> 00:15:13,020 on one full episode. 190 00:15:13,020 --> 00:15:20,770 And I remind that an episode is just a session where either the AI manages to walk up to the episode 191 00:15:20,770 --> 00:15:23,320 length or it's when it fails. 192 00:15:23,360 --> 00:15:25,760 You know when it fails that's the end of the episode. 193 00:15:25,840 --> 00:15:29,680 We reset the environment and the air tries to walk again. 194 00:15:29,890 --> 00:15:37,300 So we're going to measure the words in each of the 16 directions on one full episode. 195 00:15:37,300 --> 00:15:44,770 And so in the next day Doyle will make this explore function that will simply return either the average 196 00:15:44,770 --> 00:15:51,460 word or the accumulated reward by playing the perturbations in one specific direction over one full 197 00:15:51,520 --> 00:15:59,200 episode and then we'll compare all the accumulated rewards for the different directions and we'll sort 198 00:15:59,200 --> 00:16:05,140 them by the best ones you know the best couples of rewards of ten are POWs in our neck. 199 00:16:05,470 --> 00:16:09,640 And that's exactly what we'll do as this step 6 year. 200 00:16:09,670 --> 00:16:16,140 So the directions by Max Pozza reward up as it were reward over the different directions. 201 00:16:16,150 --> 00:16:16,690 All right. 202 00:16:16,690 --> 00:16:18,870 So we'll do all this in the next tutorials. 203 00:16:18,940 --> 00:16:20,620 And until then enjoy AI.