1
00:00:00,450 --> 00:00:02,610
Hello and welcome to this new tutorial.

2
00:00:02,670 --> 00:00:06,500
Now we are one or two tutorials away from getting the final results.

3
00:00:06,720 --> 00:00:11,500
Well we'll when needed first toile to indeed run the main code as you see here.

4
00:00:11,610 --> 00:00:17,310
And you know create all the different objects get our policy allwell initialized with all its weights

5
00:00:17,340 --> 00:00:18,380
equal to zero.

6
00:00:18,450 --> 00:00:24,750
You know at the beginning and then applying other updates things to the chain function that indeed has

7
00:00:24,780 --> 00:00:28,950
integrated all the tools that we've developed in this implementation.

8
00:00:28,950 --> 00:00:30,870
And that's the finish line of the journey.

9
00:00:30,870 --> 00:00:37,170
Then we'll have nothing to do the training will execute and we will see the words of each training loop

10
00:00:37,420 --> 00:00:44,320
populating here in the council and we'll see mostly if the word is Will increasing over the training.

11
00:00:44,380 --> 00:00:51,390
I remind that there are going to be one thousand training loops 1000 number of steps or you know 1000

12
00:00:51,490 --> 00:00:53,960
updates of the policy.

13
00:00:53,970 --> 00:00:55,240
All right so let's do this.

14
00:00:55,410 --> 00:01:01,160
As you can notice I as I tell you in the previous the toile included this code here that you know the

15
00:01:01,170 --> 00:01:07,350
classic code you can find on the Internet because indeed and Kadeer means make directory.

16
00:01:07,470 --> 00:01:13,910
And there is even a we could be page 4 and Cadia because it is a Linux tool to indeed make the directory.

17
00:01:14,010 --> 00:01:18,990
And you know since this has nothing to do with AI I prefer to do it here.

18
00:01:19,020 --> 00:01:20,740
You can find it on the Internet.

19
00:01:20,760 --> 00:01:26,520
Basically what it will do is that it will create you know when we execute the code the first folder

20
00:01:26,730 --> 00:01:33,570
which will be X then a second subfolder which will be before us and another third subfolder which will

21
00:01:33,660 --> 00:01:41,850
monitor and inside this monitor subfolder you will have all the videos of your AI trying to walk on

22
00:01:41,850 --> 00:01:43,080
the field.

23
00:01:43,080 --> 00:01:47,010
All right so let's do this let's run the main code.

24
00:01:47,010 --> 00:01:50,020
So the first thing we have to do according to you what is it.

25
00:01:50,190 --> 00:01:57,370
Well very naturally the first thing we integrate here is our hyper parameters object.

26
00:01:57,370 --> 00:02:04,740
Our HP object of the HP class because by creating this object we'll get all these hyper parameters here

27
00:02:05,140 --> 00:02:08,340
well-defined and which will be fixed during the training.

28
00:02:08,340 --> 00:02:15,420
No we won't be able to change them anyhow so let's do this let's create this object and to create such

29
00:02:15,420 --> 00:02:16,140
an object.

30
00:02:16,140 --> 00:02:21,000
The only thing that we need to do is first give a name to that object which we're going to call simply

31
00:02:21,300 --> 00:02:29,030
HP and then we call the age Reclast which differentiates things to the Keble h here.

32
00:02:29,200 --> 00:02:36,980
HP glass and with some parenthesis and there you go that create an object of your hyper parameters HP

33
00:02:36,990 --> 00:02:43,800
class and therefore by creating this object all these variables are defined and well initialized and

34
00:02:43,800 --> 00:02:45,990
fixed during the whole training.

35
00:02:45,990 --> 00:02:49,940
I know we haven't chosen an environment name so far.

36
00:02:49,950 --> 00:02:51,900
We will do that at the very end with no.

37
00:02:51,990 --> 00:02:57,700
We haven't executed anything yet so we still have time to choose a first environment to work on.

38
00:02:57,840 --> 00:02:59,700
But we'll do that in the end.

39
00:02:59,730 --> 00:03:02,450
All right so we have our hyper parameters now.

40
00:03:02,470 --> 00:03:03,940
And now what is the next step.

41
00:03:04,110 --> 00:03:11,820
Well the next step is actually optional but the next step could be to you know choose a seed for the

42
00:03:11,820 --> 00:03:18,180
training and that's mostly for one purpose it's so that we have you know both you the student and instructor

43
00:03:18,450 --> 00:03:24,600
have the same results and the same policy in the end because indeed by fixing a seed we will fix a c

44
00:03:24,600 --> 00:03:31,440
with none by and therefore all the random operations that we do with non-bio such as the sampling of

45
00:03:31,440 --> 00:03:33,140
the perturbations deltas here.

46
00:03:33,330 --> 00:03:35,120
Well we'll get the same sampling.

47
00:03:35,130 --> 00:03:42,120
You know these are random values following a normal distribution but by fixing the same C in our code

48
00:03:42,420 --> 00:03:47,550
we'll get the same random values there for the same perturbations therefore the same updates of the

49
00:03:47,550 --> 00:03:50,570
weight and therefore in the end the same policy.

50
00:03:50,570 --> 00:03:56,190
All right so you can choose another city if you want but if you want to get same results as mine well

51
00:03:56,280 --> 00:03:57,410
let's fix a seat.

52
00:03:57,780 --> 00:04:02,880
And to fix this with an umpire where we take our non-Thai libraries first which has a shortcut entry

53
00:04:03,330 --> 00:04:10,260
and then we take the random module because indeed the seed is applied on all the random functions we

54
00:04:10,260 --> 00:04:12,300
use in our implementation.

55
00:04:12,480 --> 00:04:20,740
And then from the Saranda module we take the seed function which will take us and put exactly this seed

56
00:04:20,880 --> 00:04:29,340
high per parameter that we not only created but that is now existing because we created that object

57
00:04:29,610 --> 00:04:32,980
which includes this hyper parameter.

58
00:04:33,210 --> 00:04:37,800
OK so HPC that fixes the C will get the same result.

59
00:04:38,130 --> 00:04:45,500
And the important new step is a new step is about connecting the environment to our AI.

60
00:04:45,500 --> 00:04:49,970
In other words to choose the environment on which the AI AI will be trained.

61
00:04:50,280 --> 00:04:56,700
So I won't hide you that I spend quite a lot of time on that because indeed we always do our best so

62
00:04:56,700 --> 00:05:01,350
that you don't get any you know rounding error or installation issue.

63
00:05:01,480 --> 00:05:07,760
And there are actually many ways to connect your AI to the pilot environment you have a module of pilot

64
00:05:07,800 --> 00:05:09,080
that you can install.

65
00:05:09,160 --> 00:05:14,500
But I noticed that this would cause some issues on the different systems.

66
00:05:14,500 --> 00:05:19,570
You might be having like you know Linux Windows or Mac and it had some incompatibilities with one of

67
00:05:19,570 --> 00:05:20,010
them.

68
00:05:20,200 --> 00:05:26,020
So this wasn't the best solution but fortunately I found a much better solution which is direct and

69
00:05:26,020 --> 00:05:31,750
which won't cause any installation issue because now it is well developed and therefore there is a way

70
00:05:31,750 --> 00:05:34,110
to install it smoothly on the three systems.

71
00:05:34,170 --> 00:05:36,830
It's through the gym environment.

72
00:05:36,940 --> 00:05:42,530
You know we're going to connect our Bible that environment through Jane.

73
00:05:42,640 --> 00:05:44,300
That's the easiest way to do.

74
00:05:44,410 --> 00:05:47,140
And that will work the best for the three systems.

75
00:05:47,230 --> 00:05:47,830
OK.

76
00:05:47,950 --> 00:05:49,990
So let's take the solution.

77
00:05:49,990 --> 00:05:51,340
Of course I will do it.

78
00:05:51,350 --> 00:05:57,870
Tauriel on how to install Jim and by bullet I tested it on the three systems and no issue has encountered.

79
00:05:57,910 --> 00:06:02,650
You shouldn't have any problem but of course if you have any problem we'll provide some debug solution

80
00:06:02,650 --> 00:06:05,830
and I will actually make it to toile fully on that.

81
00:06:05,830 --> 00:06:10,980
All right so let's do this let's get our environment through Jim.

82
00:06:11,200 --> 00:06:18,820
And therefore now what we're going to do is you know since we called our hyper parameter for the environment

83
00:06:19,000 --> 00:06:25,210
and its name well we're going to take that gym library and use the make function which will take us

84
00:06:25,210 --> 00:06:32,590
and put this name and which will make this environment ready through the gym library and how we're going

85
00:06:32,590 --> 00:06:33,040
to do this.

86
00:06:33,040 --> 00:06:41,740
Well let's first say that a year and name and now what we're going to do is here next line we're going

87
00:06:41,740 --> 00:06:44,830
to introduce a new variable which will be on.

88
00:06:44,860 --> 00:06:51,340
And that will represent exactly the environment but made by Jim and therefore since we want it to be

89
00:06:51,430 --> 00:06:58,630
made by Jim where we're going to take Jim first and then that's where we use the make function in which

90
00:06:58,630 --> 00:06:59,140
we put.

91
00:06:59,140 --> 00:07:07,420
Indeed our hyper parameter object and especially the alpha name farmable from this HPI object.

92
00:07:07,420 --> 00:07:07,710
All right.

93
00:07:07,720 --> 00:07:12,040
But then as you can notice We'll have the warning undefine named Jim.

94
00:07:12,180 --> 00:07:14,740
Well that's because we have to import it.

95
00:07:14,770 --> 00:07:16,480
It wasn't imported yet.

96
00:07:16,480 --> 00:07:22,630
So we're going to add it here because this is where we have all the libraries that we need to import

97
00:07:22,630 --> 00:07:24,050
for the implementation.

98
00:07:24,130 --> 00:07:29,290
And there is nothing more simple than importing gem you just need to do the same.

99
00:07:29,290 --> 00:07:31,960
As for the others import.

100
00:07:32,230 --> 00:07:34,540
Right now the warning should disappear.

101
00:07:34,540 --> 00:07:35,430
Here we go.

102
00:07:35,440 --> 00:07:36,460
Disappeared.

103
00:07:36,880 --> 00:07:41,190
OK so now we have our environment made by.

104
00:07:41,470 --> 00:07:44,060
And later on we'll specify the name of the environment.

105
00:07:44,140 --> 00:07:46,070
Let's do that at the very end.

106
00:07:46,240 --> 00:07:49,650
So that's not over to get in Varman ready.

107
00:07:49,660 --> 00:07:55,520
The next thing we have to do now that end is our environment made by Jim with the right name.

108
00:07:55,760 --> 00:08:02,680
Well what we want to do now is to indeed check out the videos of our different eyes working on the fields

109
00:08:03,040 --> 00:08:03,800
on our screen.

110
00:08:03,980 --> 00:08:08,070
Now on our monitor and so what we're going to do now is use a wrappers.

111
00:08:08,080 --> 00:08:12,280
You know we have the environment but we want to watch the trainings happen happening in that environment

112
00:08:12,610 --> 00:08:15,980
and to do that we need to take the Raptors module by Jim.

113
00:08:16,030 --> 00:08:27,350
So not only we're going to import Jim but from Jim we will import these rappers which will allow us

114
00:08:27,350 --> 00:08:34,850
to see our AI's walking on the fields on our monitor you know inside this sub subfolder that we created

115
00:08:34,850 --> 00:08:37,140
thanks to the M-K function.

116
00:08:37,250 --> 00:08:37,850
OK.

117
00:08:38,060 --> 00:08:45,700
So from Jim import wrappers now we have a warning saying that it isn't news but we are about to use

118
00:08:45,700 --> 00:08:53,380
it now because indeed in order to basically see the videos of our AI's and or Mulliner Well we need

119
00:08:53,390 --> 00:09:01,170
to say that rappers by Jim and then we take the money class which will take us and put our environment

120
00:09:01,600 --> 00:09:08,980
made ready by Gen then the location where we want to have these videos or am working on the field in

121
00:09:08,980 --> 00:09:15,850
our computer and this location is exactly this monitor there that basically is the directory folder

122
00:09:16,180 --> 00:09:21,520
of the future videos that will be populated once we launched the training some money this year.

123
00:09:21,520 --> 00:09:23,620
That's the second argument here.

124
00:09:24,040 --> 00:09:24,730
Perfect.

125
00:09:24,850 --> 00:09:30,940
And then we're going to use the following trick so that we can not be stopped by any warning because

126
00:09:30,940 --> 00:09:34,130
you'll see that when we execute the code we'll have some warnings.

127
00:09:34,240 --> 00:09:40,190
And in order to prevent those warnings from stopping the training Well we're going to add this trick.

128
00:09:40,210 --> 00:09:43,400
Force equals true.

129
00:09:43,400 --> 00:09:51,010
All right so we're going to force the training and videos to be populated inside this monitor directory

130
00:09:51,010 --> 00:09:52,320
folder.

131
00:09:52,330 --> 00:09:52,660
All right.

132
00:09:52,660 --> 00:10:01,750
And that's the trick to not only get the environment of bullets but also to get the videos are trying

133
00:10:01,750 --> 00:10:05,950
to walk on the field on a specific monitor directory folder.

134
00:10:06,120 --> 00:10:06,720
OK.

135
00:10:06,820 --> 00:10:09,680
That's the best we could use this is the simplest one.

136
00:10:09,770 --> 00:10:12,240
And that's what works best for the three systems.

137
00:10:12,620 --> 00:10:15,820
OK so we have now our environment.

138
00:10:15,820 --> 00:10:16,680
Perfect.

139
00:10:16,870 --> 00:10:19,200
And now according to you what do we need.

140
00:10:19,240 --> 00:10:23,770
You know we made a lot of tools but what do we need to take exactly to launch this training.

141
00:10:23,950 --> 00:10:25,210
Well that's very simple.

142
00:10:25,220 --> 00:10:34,600
See the train function as we made before is second as arguments the environment the policy the normalizer

143
00:10:34,750 --> 00:10:36,570
and our hyper parameters.

144
00:10:36,670 --> 00:10:40,230
And so you know to launch this training of our AI.

145
00:10:40,450 --> 00:10:44,440
Well the only thing to do is to get these four arguments ready.

146
00:10:44,440 --> 00:10:47,020
We already made our hyper parameters ready.

147
00:10:47,350 --> 00:10:50,890
That's what we did here by creating this object of the HP glass.

148
00:10:51,010 --> 00:10:53,960
Then we made our environment ready through Jim.

149
00:10:54,280 --> 00:10:59,880
And not only we did that but also we used a trick to populate the videos of our AI trying to walk in

150
00:10:59,880 --> 00:11:00,510
a field.

151
00:11:00,700 --> 00:11:08,290
So the only things remaining that we had to do now is to of course get our policy ready you know creating

152
00:11:08,590 --> 00:11:13,180
the policy object of the policy class that's for sure.

153
00:11:13,180 --> 00:11:19,660
Because indeed I remind that the SARS the whole principle of the IRS is that we are doing an exploration

154
00:11:19,960 --> 00:11:26,860
on the space of policies and therefore you know we're starting with a policy that is perception of weight

155
00:11:26,920 --> 00:11:28,570
all initialized to zero.

156
00:11:28,840 --> 00:11:36,100
So that's the beginning of our exploration and then we're going to explore different policies by trying

157
00:11:36,100 --> 00:11:39,100
some different updates of the weights in different directions.

158
00:11:39,160 --> 00:11:43,310
And that's where the exploration in space of policies happen.

159
00:11:43,480 --> 00:11:49,630
So indeed we're going to create one object for the policy but then the exploration of space of policies

160
00:11:49,930 --> 00:11:56,500
happens by you know trying different updates of the weight of this policy in different directions.

161
00:11:56,500 --> 00:11:58,350
All right so it's important to get that.

162
00:11:58,510 --> 00:12:02,950
The next step we have to do is of course to build our policy object.

163
00:12:02,980 --> 00:12:03,770
OK.

164
00:12:03,820 --> 00:12:08,950
And then of course the last step and this will really be the last step before we you know execute the

165
00:12:08,950 --> 00:12:13,660
train function is to get our normalizer ready as well.

166
00:12:13,660 --> 00:12:14,010
All right.

167
00:12:14,020 --> 00:12:16,120
So it's going to be done pretty soon.

168
00:12:16,120 --> 00:12:16,970
We're almost there.

169
00:12:16,970 --> 00:12:20,650
The final exciting step let's take of our policy first.

170
00:12:20,650 --> 00:12:25,390
So now my next question is what do we need to create our policy.

171
00:12:25,540 --> 00:12:27,690
Well let's crawl back up.

172
00:12:27,790 --> 00:12:28,300
Here we go.

173
00:12:28,300 --> 00:12:29,830
That's our policy class.

174
00:12:29,950 --> 00:12:36,190
And now to see what we need to create a new instance of this class or you know this policy object of

175
00:12:36,190 --> 00:12:36,940
this class.

176
00:12:37,150 --> 00:12:42,910
Well we simply need to look at the init method and see the arguments that it needs and it needs you

177
00:12:42,910 --> 00:12:49,570
know the input size which is the number of elements in the code in input state vector and the output

178
00:12:49,570 --> 00:12:55,960
size which is of course the number of actions the AI has to predict at each time.

179
00:12:56,350 --> 00:12:57,400
Again that's it.

180
00:12:57,460 --> 00:13:02,920
That's all you need to have to create your policy and by the way for the hyper parameters Well indeed

181
00:13:02,920 --> 00:13:09,590
we didn't have anything to specify in the arguments of the HP class to create our HP object.

182
00:13:09,670 --> 00:13:16,420
Because indeed and in a method we are not calling any arguments but for the policy class indeed we will

183
00:13:16,420 --> 00:13:19,910
need input size and the output size so let's get them right now.

184
00:13:19,990 --> 00:13:21,420
It's quite easy.

185
00:13:21,490 --> 00:13:26,920
You know thanks to what we did with making the environment ready by Jim and then using these wrappers

186
00:13:26,920 --> 00:13:32,210
to get not only the videos but also all the tools that we have in the public library.

187
00:13:32,380 --> 00:13:34,300
Basically what we did here is a shortcut.

188
00:13:34,480 --> 00:13:42,070
So let's get first the number of inputs which I remind is the number of elements in your input state

189
00:13:42,070 --> 00:13:42,980
vector.

190
00:13:43,220 --> 00:13:45,290
And so to get it it's quite easy.

191
00:13:45,330 --> 00:13:52,740
You know it's just how the public library works you take your environment first then you take the observation

192
00:13:53,250 --> 00:13:58,860
space which is something that exists already you know it's part of the whole Bible library observation

193
00:13:58,860 --> 00:14:04,270
space the space of your observations and therefore the space of your input observation space and then

194
00:14:04,350 --> 00:14:12,600
to get the shape attribute because indeed that's what will get you by taking the next 0 the number of

195
00:14:12,650 --> 00:14:18,650
inputs right and number of input is contained in that specific indexed location.

196
00:14:18,810 --> 00:14:26,940
Then we're going to do the same for the number of outputs which is of course the number of actions that

197
00:14:26,940 --> 00:14:30,020
the is predicting to play at each time t.

198
00:14:30,330 --> 00:14:36,960
And to get them again we take our environment than that and then we're going to take well here the action

199
00:14:36,960 --> 00:14:38,540
space exactly that.

200
00:14:38,940 --> 00:14:48,500
And then again shape in square brackets zero basically zero returns the number of the space you're looking

201
00:14:48,500 --> 00:14:48,890
at.

202
00:14:48,890 --> 00:14:49,430
OK.

203
00:14:49,670 --> 00:14:53,170
So here the action space in here the space of inputs.

204
00:14:53,540 --> 00:14:54,460
Perfect.

205
00:14:54,470 --> 00:14:54,770
Great.

206
00:14:54,770 --> 00:15:01,400
So we have a number of inputs and the number of outputs now and therefore we are ready to create our

207
00:15:01,610 --> 00:15:07,820
policy class because indeed in method it is calling for the input size and the output size.

208
00:15:07,820 --> 00:15:09,450
So how are we going to do this.

209
00:15:09,590 --> 00:15:15,230
Well again we have to give a name to that object which we're going to called policy and then we take

210
00:15:15,290 --> 00:15:22,040
our policy class this time with capital P and then our two arguments are required as we can see here

211
00:15:22,350 --> 00:15:24,860
and the inputs first for the input side.

212
00:15:24,860 --> 00:15:28,880
You know we just gave some different names for the arguments.

213
00:15:28,970 --> 00:15:31,540
That's quite safer to do.

214
00:15:31,790 --> 00:15:41,700
And then the number of output for of course the output size here we go we have our AI ready and initialized

215
00:15:41,940 --> 00:15:44,850
as a perception of weights equal to zero.

216
00:15:45,120 --> 00:15:48,500
And then when we run the train function now this function here.

217
00:15:48,750 --> 00:15:54,950
Well you know after initializing the perturbations in the different directions Well we get you know

218
00:15:54,990 --> 00:16:00,440
the positive words and negative rewards and then we sort them by the maximum of the couple of words.

219
00:16:00,450 --> 00:16:05,040
And then here at this specific time of the implementation or the training.

220
00:16:05,160 --> 00:16:07,980
Well that's where the policy will be updated.

221
00:16:07,980 --> 00:16:15,360
And therefore that's where the way it will be updated in different directions to explore the ways that

222
00:16:15,360 --> 00:16:17,790
will lead to the highest rewards.

223
00:16:17,790 --> 00:16:23,960
All right so you have to understand now we're just you know taking a step back to understand how all

224
00:16:24,040 --> 00:16:27,130
this will work you know how all this will happen.

225
00:16:27,310 --> 00:16:27,910
OK.

226
00:16:28,050 --> 00:16:30,820
So here we go.

227
00:16:30,900 --> 00:16:32,340
Our place is ready.

228
00:16:32,460 --> 00:16:40,320
And now let's go back to the train function the train function takes with the environment which we have

229
00:16:40,600 --> 00:16:44,480
the policy which we have the hyper parameters which we also have.

230
00:16:44,640 --> 00:16:48,070
And also the normalizer which we don't have yet.

231
00:16:48,120 --> 00:16:49,960
That's the only remaining thing we need.

232
00:16:50,040 --> 00:16:52,440
And that's what we're going to get right now.

233
00:16:52,650 --> 00:17:01,440
So let's get our normalizer which is going to be an object or you know an instance of this normalizer

234
00:17:01,440 --> 00:17:02,240
class.

235
00:17:02,310 --> 00:17:03,570
So we're going to call this class.

236
00:17:03,570 --> 00:17:10,500
And now to know which arguments we will need as input at this class will we just need to look at init

237
00:17:10,500 --> 00:17:16,260
method and the init method only takes one argument which is a number of inputs and which we already

238
00:17:16,260 --> 00:17:16,950
have.

239
00:17:16,950 --> 00:17:18,350
So perfect.

240
00:17:18,570 --> 00:17:27,750
Let's create our normalizer object and by calling this normalize a class and by taking input of this

241
00:17:27,750 --> 00:17:32,640
normalized class the number of inputs.

242
00:17:32,970 --> 00:17:33,540
Perfect.

243
00:17:33,540 --> 00:17:34,400
Here we go.

244
00:17:34,590 --> 00:17:39,390
We have everything ready to use the train function because indeed.

245
00:17:39,390 --> 00:17:47,160
Now to get this training ready the only thing we have to do is just to take the train function and input

246
00:17:47,440 --> 00:17:51,070
the four environments it is expecting.

247
00:17:51,180 --> 00:17:58,830
So first the environment then second our policy then third our normalizer.

248
00:17:59,490 --> 00:18:03,530
And finally our hyper parameters and there we go.

249
00:18:03,600 --> 00:18:04,810
Congratulations.

250
00:18:04,820 --> 00:18:07,430
This implementation is over.

251
00:18:07,590 --> 00:18:11,960
So just getting the final good ready.

252
00:18:11,960 --> 00:18:20,180
Here we go 140 lines of code not bad you know especially for a non-pilot implementation and you're going

253
00:18:20,180 --> 00:18:26,740
to see that you know this contrast between the simplicity of this implementation you know in 140 lines

254
00:18:26,750 --> 00:18:34,880
that's not much and done results will get me and is mind blowing so I 20:18 get ready for it.