1 00:00:00,490 --> 00:00:02,540 Hello and welcome to this new tutorial. 2 00:00:02,590 --> 00:00:07,980 So we've just got the puzzle rewards obtained by playing the perturbations in positive directions. 3 00:00:08,140 --> 00:00:09,850 And I went to get the negative rewards. 4 00:00:09,850 --> 00:00:15,570 That is the words obtained by playing the perturbations in the opposite directions as the previous ones. 5 00:00:15,560 --> 00:00:21,520 This once and so it's going to be so easy now because we've done this and we're going to get these rewards 6 00:00:21,550 --> 00:00:27,520 with the same values of the perturbations that is the Delta scale here for each of the six indirections. 7 00:00:27,760 --> 00:00:33,370 But the only thing that will change your mind is where it is is of course the direction which this time 8 00:00:33,370 --> 00:00:34,750 will be negative. 9 00:00:34,870 --> 00:00:41,440 That is opposite to the previous ones so that we will be in this condition else which corresponds to 10 00:00:41,530 --> 00:00:43,910 Elif directional co-equals negative. 11 00:00:44,170 --> 00:00:50,320 And therefore it will return the matrix of weights minus which represent the opposite direction the 12 00:00:50,320 --> 00:00:55,710 noise times the perturbation but it is the same value here of the perturbation Delta scale. 13 00:00:56,020 --> 00:01:02,650 And so that when we explore the policy on the full episode for each of the six indirections the action 14 00:01:02,650 --> 00:01:08,890 here will be played by evaluating the policy on the current state. 15 00:01:08,890 --> 00:01:14,680 The same value of the perturbation Delta that it was simpled but the opposite direction here because 16 00:01:14,680 --> 00:01:17,340 we are going to specify that the direction is negative. 17 00:01:17,650 --> 00:01:22,580 So the only thing that we have to do is just take these two lines. 18 00:01:24,420 --> 00:01:24,920 All right. 19 00:01:24,940 --> 00:01:29,630 So I just copy them I'm getting here everything has to be well aligned. 20 00:01:29,890 --> 00:01:33,710 I'm facing that then it's the exact same for loop. 21 00:01:33,910 --> 00:01:41,600 But instead of specifying the other direction Well we're going to specified the negative direction. 22 00:01:41,620 --> 00:01:42,660 And here you go. 23 00:01:42,760 --> 00:01:45,150 This is done to get negative rewards. 24 00:01:45,430 --> 00:01:53,890 And of course we need to replace you positive by negative to update our list of negative rewards variable 25 00:01:54,010 --> 00:02:00,430 that contains all the rewards obtained by playing the perturbations in the opposite directions that 26 00:02:00,430 --> 00:02:02,950 were applied here in the previous step. 27 00:02:02,950 --> 00:02:08,130 All right so here we go we have our positive rewards now and our negative rewards. 28 00:02:08,450 --> 00:02:11,590 And now what we're going to do is gather all of them. 29 00:02:11,620 --> 00:02:16,180 All of these positive words and negative words in one same list. 30 00:02:16,180 --> 00:02:19,130 So it will be a list of 32 words. 31 00:02:19,330 --> 00:02:20,740 And why do we need to do this. 32 00:02:20,750 --> 00:02:27,490 That's just to apply this 3.1 feature scaling by the standard deviation. 33 00:02:27,550 --> 00:02:32,270 We want to scale the reward for Again performance purposes. 34 00:02:32,560 --> 00:02:39,640 And so by scaling by standard deviation we will get Sigma R which is just one standard deviation of 35 00:02:39,640 --> 00:02:47,800 the word and we'll integrate that factor here in the method definitive answers to our one step of Graylands 36 00:02:47,800 --> 00:02:49,770 sent to update the policy. 37 00:02:50,110 --> 00:02:52,290 So we'll do that in the next its oil. 38 00:02:52,300 --> 00:02:53,700 It's going to be pretty easy. 39 00:02:53,800 --> 00:03:00,220 We'll just gather all the words in one simplest and then using the SDD function will compute the standard 40 00:03:00,220 --> 00:03:03,420 deviation of all the awards composing that list. 41 00:03:03,520 --> 00:03:05,270 Let's do that in the next as oil. 42 00:03:05,290 --> 00:03:06,810 And until then enjoy AAE.