1
00:00:00,390 --> 00:00:02,550
Hello and welcome to this new tutorial.

2
00:00:02,730 --> 00:00:09,360
So we made good progress so far because we built the AI and now we're going to make an additional function

3
00:00:09,600 --> 00:00:15,530
which we'll call the Xplornet function and that will not belong to any class but is just a function

4
00:00:15,530 --> 00:00:17,900
we will use to explore.

5
00:00:18,030 --> 00:00:24,040
As you can see our policy on one specific direction and over one episode.

6
00:00:24,300 --> 00:00:27,420
So let me quickly remind why we have to do that.

7
00:00:27,450 --> 00:00:34,910
Remember the principle of this whole A R S algorithm is that we are testing several directions to update

8
00:00:34,940 --> 00:00:40,020
the weights and we want to take the direction that increases the most that we want.

9
00:00:40,020 --> 00:00:44,070
And then you know we take the opposite direction because we're going to do a differentiation of the

10
00:00:44,070 --> 00:00:48,620
gradient to update the weights in that best optimal direction.

11
00:00:48,900 --> 00:00:55,920
But when we compare the directions with each other we need to find a relevant measure of do we want

12
00:00:55,920 --> 00:01:02,520
to assess in a relevant way which direction is the best and therefore we're not going to compare the

13
00:01:02,530 --> 00:01:10,400
words on one single action no by getting the reward obtained by playing one specific action after applying

14
00:01:10,410 --> 00:01:12,660
the perturbation on the weights.

15
00:01:12,660 --> 00:01:19,470
No we're going to do that on one full episode because one full episode is composed of several actions

16
00:01:19,650 --> 00:01:24,510
actually a lot of actions because you know it can last a few seconds and there are a lot of actions

17
00:01:24,510 --> 00:01:29,490
happening in a few seconds and therefore we'll get lots of rewards for each direction.

18
00:01:29,490 --> 00:01:36,150
And then by just taking the average of those words or some of that we words like the cumulative word

19
00:01:36,210 --> 00:01:37,630
over the full episode.

20
00:01:37,830 --> 00:01:45,210
Well we will have a set of several words each word corresponding to one specific direction and this

21
00:01:45,210 --> 00:01:51,510
will be a much more relevant way to measure that we want to compare them with each other and get the

22
00:01:51,510 --> 00:01:54,350
highest ones and therefore get the best reactions.

23
00:01:54,360 --> 00:02:02,130
So that's what we have to do now and we'll do it through a function which we'll call the Explore function.

24
00:02:02,130 --> 00:02:04,590
So this function is going to take several arguments.

25
00:02:04,590 --> 00:02:10,170
The first one is the environment of course because we're exploring in one specific environment and I

26
00:02:10,170 --> 00:02:12,610
remind that we will test many of them.

27
00:02:12,900 --> 00:02:18,600
The second one is the normalizer because we will use Indeed the normalizer to normalize the states.

28
00:02:18,600 --> 00:02:25,170
I remind that the states are the input of the policy or the perception of the policy and each time we're

29
00:02:25,170 --> 00:02:30,360
going to feed those inputs to the policy we will normalize them first so that's where I'm taking the

30
00:02:30,360 --> 00:02:31,840
normalizer here.

31
00:02:31,950 --> 00:02:34,430
Then of course we're going to take our policy.

32
00:02:34,460 --> 00:02:40,560
We're exploring our policies so we need to specify in the arguments which policy we want to explore

33
00:02:41,190 --> 00:02:46,390
then the direction which will set by default equal to none.

34
00:02:46,530 --> 00:02:49,380
Because again we will use the evaluate method.

35
00:02:49,410 --> 00:02:55,470
Remember from the policy class and we just want to remind that the direction is set by default to none

36
00:02:55,770 --> 00:03:01,230
because the evaluate method is not only used to apply some perturbations and getting the output when

37
00:03:01,230 --> 00:03:07,620
we feed the input but also when we simply want to get the output by filling the inputs and without applying

38
00:03:07,650 --> 00:03:10,730
any perturbation to direction it Course none.

39
00:03:10,950 --> 00:03:13,300
And then one final argument.

40
00:03:13,560 --> 00:03:16,470
Delta equals none as well.

41
00:03:16,530 --> 00:03:18,780
Right Delta as a perturbation.

42
00:03:18,780 --> 00:03:25,050
All right so it takes these four arguments and now we're going to do the whole exploration on one full

43
00:03:25,050 --> 00:03:25,870
episode.

44
00:03:26,220 --> 00:03:32,760
So the first thing we have to do of course is to reset the environment because we have to start from

45
00:03:32,760 --> 00:03:38,140
scratch we have to start from the very beginning of a new episode in the very beginning of a new episode

46
00:03:38,180 --> 00:03:39,880
is when the ice starts.

47
00:03:39,930 --> 00:03:46,020
You know I remind that an episode is when they start to try to walk and then at some point it either

48
00:03:46,080 --> 00:03:51,810
falls and that's the end of the episode or it manages to walk for a few seconds.

49
00:03:51,810 --> 00:03:57,060
And so in that case the end of the piece that will be when the eye reaches the abyss at length which

50
00:03:57,060 --> 00:03:58,980
is one of our hyper parameters.

51
00:03:59,100 --> 00:04:07,190
So we are going to reset the environment and to do this we're going to take our environment and we're

52
00:04:07,200 --> 00:04:11,180
going to add the reset method.

53
00:04:11,340 --> 00:04:16,180
But the reset method actually returns the first date when you reset the environment.

54
00:04:16,320 --> 00:04:23,220
And therefore here I'm going to add state equals to get that first date and that's perfect.

55
00:04:23,370 --> 00:04:27,840
That's going to be our first input of this exploration on one full episode.

56
00:04:27,840 --> 00:04:36,150
All right so we have our state and now the next step is to introduce a done variable which is a Boolean

57
00:04:36,510 --> 00:04:42,110
that is equal to either true or false and which will be equal to true when the episode is done.

58
00:04:42,110 --> 00:04:48,180
That is when we reach the end of an episode and false if we haven't reached yet the end of the episode

59
00:04:48,780 --> 00:04:54,390
and we will initialised of course to Fox because at the very beginning of course the episode is not

60
00:04:54,540 --> 00:04:56,220
done yet.

61
00:04:56,220 --> 00:05:04,370
The next step we have to introduce the number of actions played and we will call this non-place.

62
00:05:04,400 --> 00:05:10,140
So that's number of action plays and therefore of course we have to initialize it to zero because at

63
00:05:10,140 --> 00:05:17,760
the very beginning no action was played yet and we will even initialize this to zero point because later

64
00:05:17,760 --> 00:05:22,760
on you'll see that this number of actions played will be used in the float computation.

65
00:05:22,750 --> 00:05:23,310
All right.

66
00:05:23,340 --> 00:05:30,390
And then one final variable will introduce an initialized to zero and it's of course what we want to

67
00:05:30,390 --> 00:05:31,270
get in the end.

68
00:05:31,350 --> 00:05:34,620
So that's the most important one which is the sum of the reward.

69
00:05:34,800 --> 00:05:37,750
Or if you want you can call it the accumulated reward.

70
00:05:38,040 --> 00:05:42,420
And it doesn't have to be the sum of the reward it could also be the average reward.

71
00:05:42,480 --> 00:05:48,590
The only thing that matter is that we need a relevant measure of the world over the full episode.

72
00:05:48,660 --> 00:05:51,950
And since you know that we can either be positive or negative.

73
00:05:51,960 --> 00:05:55,600
Well by taking the accumulated reward no something the reward.

74
00:05:55,680 --> 00:06:02,610
Each time we play a new action well this will get us a relevant measure at the end of how high the final

75
00:06:02,610 --> 00:06:05,740
reward is going to be over the full episode.

76
00:06:05,870 --> 00:06:13,050
And so we will initialize this sum of words to zero and we don't have to set it as a float because it

77
00:06:13,050 --> 00:06:15,630
won't be used in a flawed computation.

78
00:06:15,630 --> 00:06:18,030
All right so perfect.

79
00:06:18,030 --> 00:06:21,850
Now we're done with the initialization of the variables.

80
00:06:21,970 --> 00:06:30,300
And so now the next natural step is to do a loop because indeed an episode will contain a lot of actions

81
00:06:30,300 --> 00:06:36,780
to play and we will do some kind of loop until we reach the episode length that is until the number

82
00:06:36,780 --> 00:06:39,570
of actions reaches the opposite length.

83
00:06:39,600 --> 00:06:44,010
The episode length is basically the total number of actions you can play in an episode.

84
00:06:44,190 --> 00:06:45,410
You can see it this way.

85
00:06:45,700 --> 00:06:53,430
So not only we're going to loop until the number of action splayed reaches the episode length but also

86
00:06:53,760 --> 00:06:57,970
we have to do this loop as long as we don't reach the end of the episode.

87
00:06:58,020 --> 00:07:03,190
That is as long as done here is a quote for us as soon as dawn is equal to true.

88
00:07:03,240 --> 00:07:09,900
The exploration of the policy will be over and therefore the most natural loop that we should use here

89
00:07:10,140 --> 00:07:16,710
is not to fall because of this done very well here but a while loop is in the while loop we can specify

90
00:07:16,740 --> 00:07:21,440
easily that you want to do the loop as long as done is equal to false.

91
00:07:21,450 --> 00:07:22,060
All right.

92
00:07:22,170 --> 00:07:33,240
So the way to do this world loop is well done is equal to falso So as long as done is not equal to true

93
00:07:33,630 --> 00:07:40,980
and by typing was not done like that its exactly saying as long as done is not equal to true because

94
00:07:40,980 --> 00:07:43,780
the default value of a boolean is true.

95
00:07:43,860 --> 00:07:48,320
So by just writing done here its like writing down equals equals true.

96
00:07:48,650 --> 00:07:52,040
So well not done because true basically.

97
00:07:52,450 --> 00:08:01,080
And the number of action played is lower than the fool episode.

98
00:08:02,110 --> 00:08:09,520
Lengths and as you can notice I didn't forget the HP hyper parameter object which will create later

99
00:08:09,520 --> 00:08:14,800
on because episode length is a variable of our hyper parameters class.

100
00:08:14,810 --> 00:08:17,950
No the class containing all the hyper parameters.

101
00:08:18,190 --> 00:08:26,260
So while we haven't reached the end of the episode and wow the number of actions played is lower than

102
00:08:26,260 --> 00:08:29,800
the total episode length which I remind was set to 1000.

103
00:08:29,890 --> 00:08:36,420
Well in that case we are going to compute the cumulated reward on the full episode.

104
00:08:36,640 --> 00:08:38,110
And so let's do this.

105
00:08:38,110 --> 00:08:43,270
You're going to see the steps to do it are going to feel very natural now that we've done the hardest

106
00:08:43,270 --> 00:08:44,370
part.

107
00:08:44,410 --> 00:08:52,060
So the first natural step is to feed the perception of the policy with the input because indeed we want

108
00:08:52,060 --> 00:08:58,600
to return the output which is the action to play because that action to play will get us our first word.

109
00:08:58,810 --> 00:09:05,200
But before we feed the input to the perception remember we need to normalize it.

110
00:09:05,380 --> 00:09:12,820
And before normalizing it we need remember the mean and the variance because in order to normalize it

111
00:09:13,240 --> 00:09:18,200
we indeed need the mean and standard deviation which is the root of the variance.

112
00:09:18,400 --> 00:09:26,800
So the very first step here is to take our normalizer which is one of our arguments here and then use

113
00:09:26,800 --> 00:09:34,990
the observe method which we apply to the state because remember we normalized the state and that the

114
00:09:34,990 --> 00:09:40,090
observed method remember will give us exactly the mean and the variance.

115
00:09:40,300 --> 00:09:43,570
So now we have the mean and the variance.

116
00:09:43,630 --> 00:09:47,600
And so that's when we can normalize our state.

117
00:09:47,650 --> 00:09:54,610
And so this time we're going to use the normalized method which returns directly remember the normalized

118
00:09:54,730 --> 00:10:00,540
state it takes as input to state and returns the normalized state.

119
00:10:00,550 --> 00:10:08,350
Therefore since it returns something here and it returns the state will directly get our state which

120
00:10:08,380 --> 00:10:11,790
so far was just the first state of the environment.

121
00:10:12,040 --> 00:10:18,120
And this state will be equal to what is returned by this normalized method of the normalizer class and

122
00:10:18,130 --> 00:10:25,150
therefore I'm taking the normalizer object of the normalizer class and I'm going to use the normalize

123
00:10:25,810 --> 00:10:30,030
method which I'm going to apply to the state.

124
00:10:30,040 --> 00:10:35,800
The previous version of the state which by applying the normalized method will become the normalized

125
00:10:35,950 --> 00:10:36,860
state.

126
00:10:36,940 --> 00:10:42,580
And now we can feed the perception with this state because it was indeed normalized.

127
00:10:42,850 --> 00:10:48,190
And we want to work with the normalized version of the alias of the paper.

128
00:10:48,610 --> 00:10:53,190
So now that we have this date we have input so we can fit it to the perception.

129
00:10:53,410 --> 00:11:01,330
And so now we're going to naturally use the evaluate method in of course that specific direction that

130
00:11:01,330 --> 00:11:02,410
we want to explore.

131
00:11:02,410 --> 00:11:05,420
By applying the perturbation in that direction.

132
00:11:05,620 --> 00:11:12,910
So right now what we have to do is apply the evaluate method by taking as input the state that which

133
00:11:12,910 --> 00:11:20,710
is normalized then the perturbation Delta which is one of our argument and the direction which is another

134
00:11:20,710 --> 00:11:21,820
one of our argument.

135
00:11:21,970 --> 00:11:25,990
And since the evaluate method of course returns the action to play.

136
00:11:26,200 --> 00:11:33,670
Well I'm going to get that action that we're going to play now by taking the evaluate method but be

137
00:11:33,670 --> 00:11:38,500
careful the evaluate method belongs to our policy object.

138
00:11:38,500 --> 00:11:46,780
So we take policy that evaluate this evaluate method takes on the three arguments first the input state

139
00:11:47,350 --> 00:11:55,660
which was well normalized then the perturbation Delta and then the direction of the perturbation.

140
00:11:55,660 --> 00:12:01,360
So right now the action is set by default to none and Delta I said bellyfull to non-bird of course when

141
00:12:01,360 --> 00:12:07,270
we were going to explore the policy on one specific direction and for one specific perturbation.

142
00:12:07,270 --> 00:12:09,650
Well these will not be equal to none.

143
00:12:09,940 --> 00:12:19,330
OK so here we evaluate our policy by filling to the input state and by applying the perturbation delta

144
00:12:19,360 --> 00:12:26,460
on one specific direction and all of this returns the action played in that situation.

145
00:12:26,500 --> 00:12:34,290
And now since we played an action well that will lead us to a next date that will lead us to a word

146
00:12:34,540 --> 00:12:40,290
and also that lead us to know whether Dunn is still equal to false or true.

147
00:12:40,540 --> 00:12:46,750
And luckily for us we're going to get these three updates of the words very easily.

148
00:12:46,750 --> 00:12:54,110
It is thanks to the step method of the environment class of the Bible library which just takes us and

149
00:12:54,120 --> 00:13:00,620
put one argument which is the action played and that we've just got and that will return the state.

150
00:13:00,640 --> 00:13:06,320
The word and this done variable you know when we've just played the action and we reach the new state

151
00:13:06,320 --> 00:13:13,820
of the environment so that's why in order to get on these next stage and we would well we simply to

152
00:13:14,540 --> 00:13:18,400
update our variables here state we want and done.

153
00:13:18,650 --> 00:13:24,590
And then there's even a fourth variable but that we don't use here so no need to worry about this.

154
00:13:24,830 --> 00:13:30,650
And here we go to get these next state we wanted and done well we simply need to take our environment

155
00:13:31,070 --> 00:13:38,750
which is not ject of the public library then the step method which only takes as input one argument

156
00:13:38,840 --> 00:13:41,870
which is of course the action that was just played.

157
00:13:42,200 --> 00:13:47,900
And this step function this step method from the environment object of the public library returns to

158
00:13:47,900 --> 00:13:50,090
next state and the environment.

159
00:13:50,160 --> 00:13:55,700
The reward obtained after playing that action and whether or not the episode is done.

160
00:13:56,090 --> 00:13:56,390
All right.

161
00:13:56,390 --> 00:13:57,250
Perfect.

162
00:13:57,560 --> 00:14:04,330
And now that you've just got the word we have to do a trick that is very classic in reinforcement learning.

163
00:14:04,460 --> 00:14:12,860
It's the fact that we don't want to be biased by any super high reward or super low negative reward

164
00:14:12,890 --> 00:14:14,310
you know imagine that.

165
00:14:14,480 --> 00:14:20,450
Imagine that most of the words we get are taking small values like for example between minus 1 and plus

166
00:14:20,450 --> 00:14:20,910
1.

167
00:14:21,140 --> 00:14:27,040
But in the same episode you find one or two rewards are taking some very high values.

168
00:14:27,290 --> 00:14:34,460
Well imagine the bias and the final accumulated reward or the final average reward that would cause

169
00:14:34,730 --> 00:14:41,660
to keep these very high values that would disturb the measure of the reward and therefore for those

170
00:14:41,870 --> 00:14:48,500
outliers we call them outliers these very high reward or very high positive or negative rewards that

171
00:14:48,500 --> 00:14:51,480
we can get during the expiration of one full episode.

172
00:14:51,500 --> 00:14:58,160
We want to ignore them or we want to set them equal to either plus one for the very high positive reward

173
00:14:58,400 --> 00:15:01,400
or minus one for the very high negative reward.

174
00:15:01,610 --> 00:15:07,680
And by doing this we will protect ourselves from any bias due to the outliers.

175
00:15:07,910 --> 00:15:13,450
And the trick to do this is actually to force all of super high positive rewards to be equal to one

176
00:15:13,730 --> 00:15:17,640
and those who were high negative reward to be equal to minus one.

177
00:15:17,750 --> 00:15:24,620
So we can either do it with an IF condition but there is a better trick faster trick which is you take

178
00:15:24,620 --> 00:15:36,950
your reward and you take the max of the men between the reward and 1 and minus 1 and you can check that

179
00:15:37,040 --> 00:15:42,770
this will get you that we want all the wars between minus 1 and plus 1 or more specifically all the

180
00:15:42,770 --> 00:15:49,370
super high but every word will be forced to be equal to one and no the super low negative reward will

181
00:15:49,370 --> 00:15:51,490
be us to be equal to minus 1.

182
00:15:51,500 --> 00:15:53,740
That's a classic trick in reinforcement learning.

183
00:15:53,780 --> 00:15:54,560
Keep it in mind.

184
00:15:54,590 --> 00:16:00,320
You might have to use it again if you want to go further in reinforcement learning.

185
00:16:00,470 --> 00:16:01,190
Perfect.

186
00:16:01,190 --> 00:16:03,850
And so now we have a few more steps to go.

187
00:16:03,860 --> 00:16:09,890
The first one is of course now that we have the we wanted to update the accumulated reward or the sum

188
00:16:09,890 --> 00:16:11,650
of rewards and to obtain it.

189
00:16:11,660 --> 00:16:20,990
Well naturally we need to take our some of words and incremented by this last word that we've just got.

190
00:16:21,110 --> 00:16:29,140
So Plus you quote the Word Perfect then almost over one last step we have to do inside this while loop.

191
00:16:29,180 --> 00:16:30,430
Can you guess what it is.

192
00:16:30,530 --> 00:16:32,680
It is of course the non-place.

193
00:16:32,810 --> 00:16:38,990
There is number of actions play that need to be incremented by 1 because indeed we need to check that

194
00:16:39,200 --> 00:16:43,640
number of actions played is lower than this episode length.

195
00:16:43,670 --> 00:16:49,280
In order to do all this so we need to increment to 1 in order to check that we haven't reached the end

196
00:16:49,310 --> 00:16:56,610
of the episode and therefore I'm taking the number of actions played which record called non-place and

197
00:16:56,990 --> 00:16:59,430
incrementing it by one.

198
00:16:59,540 --> 00:17:00,320
Perfect.

199
00:17:00,560 --> 00:17:08,240
And now we can get out of the while loop and return of course the sum of the rewards because that's

200
00:17:08,240 --> 00:17:11,050
exactly the purpose of this export function.

201
00:17:11,250 --> 00:17:13,030
So now we have some of the words.

202
00:17:13,220 --> 00:17:19,070
Don't worry about this warning here that just because we haven't defined HP yet but HP is going to be

203
00:17:19,370 --> 00:17:22,330
our future object of the hyper Brenner's class.

204
00:17:22,460 --> 00:17:23,340
So all good.

205
00:17:23,570 --> 00:17:30,230
And now we have this explorer function that will give us a relevant measure of the word over one full

206
00:17:30,230 --> 00:17:34,070
episode and for one specific direction of perturbation.

207
00:17:34,100 --> 00:17:41,480
So we will be able to compare in a relevant way the different words we obtain in different directions

208
00:17:41,540 --> 00:17:49,250
of perturbations and now mostly the good news is that we were ready for the major step that will make

209
00:17:49,310 --> 00:17:57,270
our AI smart or should I say are able to walk on the field and I'm talking of course about the training.

210
00:17:57,290 --> 00:18:02,780
The next section will implement now is going to be the training at the AA.

211
00:18:02,940 --> 00:18:08,430
And so we're going to make a train function that all days and put the environment the policy because

212
00:18:08,430 --> 00:18:15,330
we want to train that policy to normalizer and our hyper parameters and we will do this whole full hoop

213
00:18:15,750 --> 00:18:21,710
which will basically do all the steps here from 1 to 10.

214
00:18:21,930 --> 00:18:26,970
And to do that we're going to use of course the tools we made so far you know different classes that

215
00:18:26,970 --> 00:18:33,990
contains method and therefore tools to implement all this including the normalization of the states.

216
00:18:34,110 --> 00:18:39,540
So I can't wait to do that because that means that we're also getting closer to the final demo when

217
00:18:39,540 --> 00:18:42,860
we see our final AI walk on the field.

218
00:18:42,900 --> 00:18:44,280
It's going to be very exciting.

219
00:18:44,280 --> 00:18:46,230
I can't wait to show you all this.

220
00:18:46,260 --> 00:18:47,900
And until then enjoy AI.