1
00:00:00,270 --> 00:00:02,640
Hello and welcome to this new tutorial.

2
00:00:02,640 --> 00:00:08,190
All right so now we're going to build the last method of this policy class which is a method that is

3
00:00:08,190 --> 00:00:14,660
going to do the most important step of the whole algorithm which is basically great innocent.

4
00:00:14,730 --> 00:00:20,700
More specifically it's an approximation of grain in the sense because we don't compute directly the

5
00:00:20,700 --> 00:00:26,910
gradient of you know the reward with respect to the weights we approximate this gradient by measuring

6
00:00:26,910 --> 00:00:32,100
the difference of the reward between the reward and the positive direction and the reward in the negative

7
00:00:32,100 --> 00:00:33,000
direction.

8
00:00:33,000 --> 00:00:35,010
So we're going to have a look at it in detail.

9
00:00:35,010 --> 00:00:42,270
Now I will quickly explain the purpose of this great in this sense and we'll implement this very important

10
00:00:42,270 --> 00:00:47,130
step which is the updates step one step of the great and descent.

11
00:00:47,340 --> 00:00:53,550
So before we do this let's go back to the article to remind us what we have to implement exactly.

12
00:00:53,550 --> 00:01:02,100
So as a reminder we just implemented this point five here by applying the perturbations on the matrix

13
00:01:02,100 --> 00:01:02,770
of weight.

14
00:01:02,910 --> 00:01:09,750
And we also did the sampling of the perturbations following a gash in distribution of variance the noise

15
00:01:10,170 --> 00:01:15,990
which we implemented here and now we're going to move on to Step 7 make the update step which corresponds

16
00:01:15,990 --> 00:01:18,660
to exactly one step of graden in the sense.

17
00:01:18,930 --> 00:01:20,700
So why do we need to do this.

18
00:01:20,730 --> 00:01:21,840
And how does it work.

19
00:01:21,840 --> 00:01:24,750
Let me quickly remind the explanation.

20
00:01:24,750 --> 00:01:30,570
Well in class egregiousness And you know what we do we want to optimize something which is generally

21
00:01:30,840 --> 00:01:37,510
the last function you know we want to reduce the last error between a prediction and a target.

22
00:01:37,560 --> 00:01:38,170
Right.

23
00:01:38,190 --> 00:01:44,070
And we do that with respect to the weight so we're going to differentiate the less with respect to the

24
00:01:44,070 --> 00:01:46,270
weight to reduce this loss error.

25
00:01:46,440 --> 00:01:52,500
And then we will update the weight by adding this differentiation of the last with respect to the weight

26
00:01:52,800 --> 00:01:53,450
to the weight.

27
00:01:53,460 --> 00:01:59,270
And that will update the weights in a direction of last reduction here we don't have any loss.

28
00:01:59,280 --> 00:02:02,340
We don't have any target and prediction here.

29
00:02:02,340 --> 00:02:07,040
What we have is some we want we want to optimize the reward.

30
00:02:07,050 --> 00:02:13,260
Therefore what we're going to do is not to try to reduce unless what we're going to do is try to increase

31
00:02:13,350 --> 00:02:19,620
the reward and therefore we're not going to differentiate the loss with respect the weight we are going

32
00:02:19,620 --> 00:02:23,630
to differentiate that we want with respect to the weight.

33
00:02:23,700 --> 00:02:30,840
And that's why here what you see is exactly this differentiation but it's not a classic mathematical

34
00:02:30,840 --> 00:02:37,410
differentiation it's an approximation of the differentiation of the reward because this reward that

35
00:02:37,410 --> 00:02:44,890
you see here is the reward obtained by applying some perturbations to your policy in the positive direction

36
00:02:45,330 --> 00:02:52,560
and this reward that you see here is the reward obtained by applying some perturbations to that same

37
00:02:52,560 --> 00:02:56,460
policy in the opposite direction as this one.

38
00:02:56,460 --> 00:03:03,000
And since the preservation consists of adding to the weight a very small number you know like a small

39
00:03:03,060 --> 00:03:05,650
Epsilon following the normal distribution.

40
00:03:05,820 --> 00:03:13,590
Well this difference of the words here is almost the same as differentiating the rewards with respect

41
00:03:13,590 --> 00:03:19,530
to the weight because a real differentiation would be the partial derivative of the reward divided by

42
00:03:19,530 --> 00:03:22,480
the partial derivative of the weights.

43
00:03:22,500 --> 00:03:28,680
But since this difference is a difference of words with respect to directions that are opposite and

44
00:03:28,680 --> 00:03:32,320
also with respect to some very small values of the perturbations.

45
00:03:32,340 --> 00:03:38,760
Well this is exactly the approximation of the gradient of the word with respect to the weight.

46
00:03:38,760 --> 00:03:39,160
All right.

47
00:03:39,180 --> 00:03:41,640
And this method has a specific name.

48
00:03:41,670 --> 00:03:49,770
It's called the method of Finot differences and that exactly happens right here with this Finot difference

49
00:03:50,120 --> 00:03:56,340
of the words in two opposite directions by playing some very small perturbations in that approximates

50
00:03:56,680 --> 00:04:01,680
the gradient with respect to the weights and then we multiply indeed by the value of the perturbation

51
00:04:02,130 --> 00:04:04,770
as the way we do it with great in the sense.

52
00:04:04,830 --> 00:04:11,640
So that's exactly one step of gradient descent but by approximating the gradient.

53
00:04:11,790 --> 00:04:16,170
So it's important to get this because this is really the heart of the algorithm.

54
00:04:16,170 --> 00:04:22,380
This is really where the added value is indeed the algorithm that we're implementing right now is called

55
00:04:22,560 --> 00:04:28,650
commented random search but it's so important this is so crucial in the algorithm that we could call

56
00:04:28,650 --> 00:04:30,390
it the methods of any differences.

57
00:04:30,390 --> 00:04:35,430
In fact the method of any differences is at the heart of augmented random search.

58
00:04:35,430 --> 00:04:39,910
All right there was an explanation of the purpose of this because this is really important.

59
00:04:39,960 --> 00:04:41,780
And now let's implement this.

60
00:04:41,790 --> 00:04:45,170
It's going to be very simple we just need to implement that sum.

61
00:04:45,180 --> 00:04:51,150
So by the way we do this some of the finit differences of the words for all the best directions you

62
00:04:51,150 --> 00:04:57,750
know the bigger response here to the best directions which I remind is one of our hyper parameters and

63
00:04:57,750 --> 00:05:01,270
we set it equal to 16 but that can be modifiable.

64
00:05:01,450 --> 00:05:06,550
So we're going to some of these finit differences multiplied by the perturbation itself.

65
00:05:06,850 --> 00:05:14,410
For the 16 best erections and we are going to add this factor here alférez the learning rate divided

66
00:05:14,410 --> 00:05:15,470
by B.

67
00:05:15,570 --> 00:05:21,380
The total number of best directions and sigma are which is the variance of the words.

68
00:05:21,610 --> 00:05:25,790
Because indeed in 3.1 we haven't taken care of 3.1 yet.

69
00:05:25,930 --> 00:05:33,540
But in order to improve the algorithm we need to scale by the standard deviation Sigma or so here we

70
00:05:33,540 --> 00:05:37,640
are taking Cygne are in fact to normalize what we want.

71
00:05:37,810 --> 00:05:38,560
All right.

72
00:05:38,680 --> 00:05:41,060
So let's do this let's implement this.

73
00:05:41,200 --> 00:05:49,490
And so now I'm going to go back to Python to make this final method of our policy class let's do this.

74
00:05:49,720 --> 00:05:57,100
So this method we're going to call it a date because indeed it's the update step of the paper and that

75
00:05:57,100 --> 00:06:01,880
corresponds to exactly one step of approximated great in the sense.

76
00:06:02,180 --> 00:06:07,540
So date and now it's going to take three arguments the first one is self as usual because we're going

77
00:06:07,540 --> 00:06:11,650
to use our future object or future instances of the policy class.

78
00:06:11,950 --> 00:06:16,630
Then the second argument you cannot really guess where it's going to be but it's going to be rolled

79
00:06:16,630 --> 00:06:17,490
out.

80
00:06:17,560 --> 00:06:24,520
Actually they talk about rollouts in the paper and a rollout is basically going to be a list of several

81
00:06:24,520 --> 00:06:31,420
triplets and the triplets will contain First the word obtained in the positive direction.

82
00:06:31,540 --> 00:06:35,940
Then second the reward obtained in the opposite direction.

83
00:06:35,980 --> 00:06:43,240
And third the perturbation in this specific direction that gave you these two first elements that we

84
00:06:43,240 --> 00:06:46,740
were in the positive direction and that we were in a negative direction.

85
00:06:46,960 --> 00:06:54,100
So basically it's a session that will contain the best triplets of rewards obtained by applying some

86
00:06:54,100 --> 00:06:56,870
perturbations in the best directions.

87
00:06:56,920 --> 00:06:57,660
Right.

88
00:06:57,700 --> 00:07:00,660
So roll out and then the last arguments.

89
00:07:00,700 --> 00:07:06,540
Remember in the paper we're going to use this Sigma are here the standard deviation of the word that

90
00:07:06,560 --> 00:07:09,180
we will take care of later on in the implementation.

91
00:07:09,250 --> 00:07:16,240
But we need to put it here as argument because it is not one of our hyper parameters we will compute

92
00:07:16,240 --> 00:07:17,530
it later on.

93
00:07:17,560 --> 00:07:24,250
So we're going to call that sigma on this core are to specify that it is a standard deviation of the

94
00:07:24,250 --> 00:07:25,260
reward.

95
00:07:25,330 --> 00:07:28,490
All right to self rule out Sigma are all good.

96
00:07:28,720 --> 00:07:31,650
Let's make this a data step.

97
00:07:31,660 --> 00:07:39,670
So speaking of step I'm going to introduce here first variable step that is going to be exactly this

98
00:07:39,670 --> 00:07:46,840
part which is the some of the best directions being the number of reservation's of this finit difference

99
00:07:46,960 --> 00:07:52,660
of the words obtained in respectively the positive direction and the opposite or negative direction

100
00:07:53,260 --> 00:07:58,520
multiplied by the value of the perturbation step is going to be exactly this.

101
00:07:58,540 --> 00:08:08,950
So that's why I'm going to initialize step as a matrix of zeros and the dimensions of this matrix will

102
00:08:08,950 --> 00:08:15,750
be exactly the same as the dimensions of the matrix of weights theta.

103
00:08:15,810 --> 00:08:22,570
Now you might be wondering why didn't I use to star you that simply because the run and function is

104
00:08:22,570 --> 00:08:29,800
expecting a different format than the format expected by the zero's function the random function expects

105
00:08:30,100 --> 00:08:33,080
this couple of two dimensions.

106
00:08:33,130 --> 00:08:39,310
You know self-deceit at that shape of the next zero for the number of lines and then come on then said

107
00:08:39,320 --> 00:08:44,560
that they did that shape and the next one for the number of columns whereas the zeros function here

108
00:08:44,560 --> 00:08:49,310
directly expect selve that feature that shape basically the shape of theta.

109
00:08:49,500 --> 00:08:50,020
OK.

110
00:08:50,170 --> 00:08:53,420
So just some different formats expected by two functions.

111
00:08:53,560 --> 00:09:01,600
And now we've initialized this matrix step that represent all the sum here with the right format because

112
00:09:01,600 --> 00:09:06,310
we're going to sum this to the current matrix of weight theta.

113
00:09:06,670 --> 00:09:15,510
So now we're going to do exactly this some here and the best way to do a sum like that is through a

114
00:09:15,510 --> 00:09:16,460
for loop.

115
00:09:16,470 --> 00:09:24,830
So now we're going to do a full loop and we need to iterate on all the rollouts because I remind that

116
00:09:24,840 --> 00:09:31,590
the rollouts contain exactly the triplets of the puzzle we want there is the word think in the positive

117
00:09:31,590 --> 00:09:37,060
direction the negative reward there is we were obtained in the opposite direction.

118
00:09:37,260 --> 00:09:44,520
And then the value of the perturbation you see the purpose of creating the structure rollout that gathers

119
00:09:44,520 --> 00:09:49,240
the three elements because that's exactly the three elements we need to do this.

120
00:09:49,500 --> 00:09:55,590
And then the rural adds there are the triplets and B response to the number of best erections which

121
00:09:55,590 --> 00:09:58,660
is part already of our hyper parameters.

122
00:09:58,860 --> 00:10:06,570
So that's why now in order to make this some We're going to loop over all the positive words which I'm

123
00:10:06,570 --> 00:10:08,190
going to call our Pazz.

124
00:10:08,370 --> 00:10:10,020
All the best but the rewards actually.

125
00:10:10,100 --> 00:10:15,140
That is all the rewards obtained in the best directions then.

126
00:10:15,270 --> 00:10:18,540
Same for the negative reward our neck.

127
00:10:18,540 --> 00:10:25,710
But these are negs we were trained in the opposite direction as the direction that was used to obtain

128
00:10:26,010 --> 00:10:27,060
this are post here.

129
00:10:27,060 --> 00:10:29,100
So they kind of go together.

130
00:10:29,250 --> 00:10:34,430
And then of course the value of the perturbation itself which we can call deep.

131
00:10:34,680 --> 00:10:43,680
So that's the triplets values that are all into rollouts which is one of our arguments here but we will

132
00:10:43,950 --> 00:10:46,090
build to roll out later on.

133
00:10:46,380 --> 00:10:52,640
So for all these triplets of positive reward negative reward and perturbations.

134
00:10:52,860 --> 00:11:01,230
Well now we can have data step by taking the step which is right now equal to zero and we're going to

135
00:11:01,590 --> 00:11:03,340
add to this step.

136
00:11:03,390 --> 00:11:05,050
That's why I'm using plus equals.

137
00:11:05,280 --> 00:11:12,960
We're going to add exactly this element here that is the finit difference of the word obtained in the

138
00:11:12,960 --> 00:11:18,990
positive direction and the reward obtained in negative direction multiplied by the perturbation.

139
00:11:18,990 --> 00:11:21,900
And we already have all these things to rule out.

140
00:11:21,900 --> 00:11:23,240
This is our pose.

141
00:11:23,370 --> 00:11:24,380
This is our neg.

142
00:11:24,540 --> 00:11:25,420
And this is D.

143
00:11:25,680 --> 00:11:34,760
So what we're simply going to add to our current value of our matrix step here is our pas plus every

144
00:11:34,800 --> 00:11:42,150
word minus our neg the negative word or should I say we word obtained in the opposite direction as the

145
00:11:42,150 --> 00:11:51,420
direction that was used to obtain Arpels multiplied by D which is the perturbation that was applied

146
00:11:51,690 --> 00:11:56,300
in the same direction that led us to obtain our POWs and RNA.

147
00:11:56,320 --> 00:12:00,830
But this was obtained in the opposite direction of this perturbation.

148
00:12:00,840 --> 00:12:01,650
All right.

149
00:12:01,650 --> 00:12:04,670
So we have just a data step that's great.

150
00:12:04,660 --> 00:12:06,600
And now we have one final thing to do.

151
00:12:06,780 --> 00:12:13,680
It's just of course the most important it is of course to update the matrix of weight our matrix of

152
00:12:13,680 --> 00:12:21,510
weight theta is going to become theta plus all this which we already computed thanks to this for loop

153
00:12:21,810 --> 00:12:28,200
but multiplied by this factor here which is the learning rate Alpha bit which we called learning rate

154
00:12:28,620 --> 00:12:34,020
divided by the total number of best directions times the standard deviation of the reward.

155
00:12:34,200 --> 00:12:34,680
All right.

156
00:12:34,680 --> 00:12:41,250
We already have all this in our arguments so let's make that final computation we need to get of the

157
00:12:41,250 --> 00:12:47,980
for loop because the for loop was just to compute this some here so that's done.

158
00:12:48,120 --> 00:12:54,480
So now we just need to add this factor here and this is exactly what we're going to do as the last line

159
00:12:54,570 --> 00:12:56,340
of this update function.

160
00:12:56,760 --> 00:13:06,360
Let's do this let's update our matrix of weight theta so self that theta is going to become the previous

161
00:13:06,810 --> 00:13:13,740
self that theta Plus that's what I'm adding plus you call our learning rate which we can get by taking

162
00:13:13,800 --> 00:13:20,390
our hyper parameter object that was not created yet but which we will create in the future HP.

163
00:13:20,520 --> 00:13:24,450
And then we take its proper variable learning rates.

164
00:13:24,480 --> 00:13:25,280
Here we go.

165
00:13:25,410 --> 00:13:33,910
That's the learning rate which we have to divide by Remember the total number of resurrections be times

166
00:13:34,240 --> 00:13:36,090
the standard deviation of the word.

167
00:13:36,460 --> 00:13:42,250
Let's go back let's get first of all number of resurrections B which is one of our hyper parameters

168
00:13:42,250 --> 00:13:44,370
so I'm taking H-P again.

169
00:13:44,500 --> 00:13:46,320
Our hyper barrière object.

170
00:13:46,480 --> 00:13:48,710
And from this object I'm it's variable.

171
00:13:48,970 --> 00:13:52,330
And the best directions.

172
00:13:52,510 --> 00:13:59,890
And then we multiply this by Sigma R which is a standard deviation of the word and which is one of our

173
00:13:59,980 --> 00:14:02,560
argument here we'll compute that later.

174
00:14:02,800 --> 00:14:09,310
But at least we have everything to make this formula to have the matrix of weight in a direction that

175
00:14:09,400 --> 00:14:15,190
increases the most that we want which is exactly what we want to do with our AI because the higher the

176
00:14:15,190 --> 00:14:19,030
we word the better the AI has the ability to walk.

177
00:14:19,060 --> 00:14:19,630
All right.

178
00:14:19,660 --> 00:14:26,980
And then of course final step we need to multiply this by the step that is the sum of the finit differences

179
00:14:26,980 --> 00:14:29,960
of the word multiplied by the perturbation.

180
00:14:30,340 --> 00:14:31,470
Congratulations.

181
00:14:31,480 --> 00:14:38,290
That was the most important step of this implementation of the step the step of approximated great in

182
00:14:38,290 --> 00:14:44,110
the sense you did it and now we have two final things to do first.

183
00:14:44,170 --> 00:14:50,440
Another function which is to explore function and that will compute either an average reward or some

184
00:14:50,440 --> 00:14:57,190
of reward on one full episode because we imagine that we're not going to compare the words in the different

185
00:14:57,190 --> 00:14:59,630
directions on one single action.

186
00:14:59,650 --> 00:15:03,450
We want to compare them on a large series of actions in a row.

187
00:15:03,580 --> 00:15:07,330
Otherwise this wouldn't be very relevant to compare the rewards to each other.

188
00:15:07,330 --> 00:15:11,590
You know we want to have a relevant measure of that we want and so we're going to measure the words

189
00:15:11,650 --> 00:15:13,020
on one full episode.

190
00:15:13,020 --> 00:15:20,770
And I remind that an episode is just a session where either the AI manages to walk up to the episode

191
00:15:20,770 --> 00:15:23,320
length or it's when it fails.

192
00:15:23,360 --> 00:15:25,760
You know when it fails that's the end of the episode.

193
00:15:25,840 --> 00:15:29,680
We reset the environment and the air tries to walk again.

194
00:15:29,890 --> 00:15:37,300
So we're going to measure the words in each of the 16 directions on one full episode.

195
00:15:37,300 --> 00:15:44,770
And so in the next day Doyle will make this explore function that will simply return either the average

196
00:15:44,770 --> 00:15:51,460
word or the accumulated reward by playing the perturbations in one specific direction over one full

197
00:15:51,520 --> 00:15:59,200
episode and then we'll compare all the accumulated rewards for the different directions and we'll sort

198
00:15:59,200 --> 00:16:05,140
them by the best ones you know the best couples of rewards of ten are POWs in our neck.

199
00:16:05,470 --> 00:16:09,640
And that's exactly what we'll do as this step 6 year.

200
00:16:09,670 --> 00:16:16,140
So the directions by Max Pozza reward up as it were reward over the different directions.

201
00:16:16,150 --> 00:16:16,690
All right.

202
00:16:16,690 --> 00:16:18,870
So we'll do all this in the next tutorials.

203
00:16:18,940 --> 00:16:20,620
And until then enjoy AI.