1 00:00:00,450 --> 00:00:02,610 Hello and welcome to this new tutorial. 2 00:00:02,670 --> 00:00:06,500 Now we are one or two tutorials away from getting the final results. 3 00:00:06,720 --> 00:00:11,500 Well we'll when needed first toile to indeed run the main code as you see here. 4 00:00:11,610 --> 00:00:17,310 And you know create all the different objects get our policy allwell initialized with all its weights 5 00:00:17,340 --> 00:00:18,380 equal to zero. 6 00:00:18,450 --> 00:00:24,750 You know at the beginning and then applying other updates things to the chain function that indeed has 7 00:00:24,780 --> 00:00:28,950 integrated all the tools that we've developed in this implementation. 8 00:00:28,950 --> 00:00:30,870 And that's the finish line of the journey. 9 00:00:30,870 --> 00:00:37,170 Then we'll have nothing to do the training will execute and we will see the words of each training loop 10 00:00:37,420 --> 00:00:44,320 populating here in the council and we'll see mostly if the word is Will increasing over the training. 11 00:00:44,380 --> 00:00:51,390 I remind that there are going to be one thousand training loops 1000 number of steps or you know 1000 12 00:00:51,490 --> 00:00:53,960 updates of the policy. 13 00:00:53,970 --> 00:00:55,240 All right so let's do this. 14 00:00:55,410 --> 00:01:01,160 As you can notice I as I tell you in the previous the toile included this code here that you know the 15 00:01:01,170 --> 00:01:07,350 classic code you can find on the Internet because indeed and Kadeer means make directory. 16 00:01:07,470 --> 00:01:13,910 And there is even a we could be page 4 and Cadia because it is a Linux tool to indeed make the directory. 17 00:01:14,010 --> 00:01:18,990 And you know since this has nothing to do with AI I prefer to do it here. 18 00:01:19,020 --> 00:01:20,740 You can find it on the Internet. 19 00:01:20,760 --> 00:01:26,520 Basically what it will do is that it will create you know when we execute the code the first folder 20 00:01:26,730 --> 00:01:33,570 which will be X then a second subfolder which will be before us and another third subfolder which will 21 00:01:33,660 --> 00:01:41,850 monitor and inside this monitor subfolder you will have all the videos of your AI trying to walk on 22 00:01:41,850 --> 00:01:43,080 the field. 23 00:01:43,080 --> 00:01:47,010 All right so let's do this let's run the main code. 24 00:01:47,010 --> 00:01:50,020 So the first thing we have to do according to you what is it. 25 00:01:50,190 --> 00:01:57,370 Well very naturally the first thing we integrate here is our hyper parameters object. 26 00:01:57,370 --> 00:02:04,740 Our HP object of the HP class because by creating this object we'll get all these hyper parameters here 27 00:02:05,140 --> 00:02:08,340 well-defined and which will be fixed during the training. 28 00:02:08,340 --> 00:02:15,420 No we won't be able to change them anyhow so let's do this let's create this object and to create such 29 00:02:15,420 --> 00:02:16,140 an object. 30 00:02:16,140 --> 00:02:21,000 The only thing that we need to do is first give a name to that object which we're going to call simply 31 00:02:21,300 --> 00:02:29,030 HP and then we call the age Reclast which differentiates things to the Keble h here. 32 00:02:29,200 --> 00:02:36,980 HP glass and with some parenthesis and there you go that create an object of your hyper parameters HP 33 00:02:36,990 --> 00:02:43,800 class and therefore by creating this object all these variables are defined and well initialized and 34 00:02:43,800 --> 00:02:45,990 fixed during the whole training. 35 00:02:45,990 --> 00:02:49,940 I know we haven't chosen an environment name so far. 36 00:02:49,950 --> 00:02:51,900 We will do that at the very end with no. 37 00:02:51,990 --> 00:02:57,700 We haven't executed anything yet so we still have time to choose a first environment to work on. 38 00:02:57,840 --> 00:02:59,700 But we'll do that in the end. 39 00:02:59,730 --> 00:03:02,450 All right so we have our hyper parameters now. 40 00:03:02,470 --> 00:03:03,940 And now what is the next step. 41 00:03:04,110 --> 00:03:11,820 Well the next step is actually optional but the next step could be to you know choose a seed for the 42 00:03:11,820 --> 00:03:18,180 training and that's mostly for one purpose it's so that we have you know both you the student and instructor 43 00:03:18,450 --> 00:03:24,600 have the same results and the same policy in the end because indeed by fixing a seed we will fix a c 44 00:03:24,600 --> 00:03:31,440 with none by and therefore all the random operations that we do with non-bio such as the sampling of 45 00:03:31,440 --> 00:03:33,140 the perturbations deltas here. 46 00:03:33,330 --> 00:03:35,120 Well we'll get the same sampling. 47 00:03:35,130 --> 00:03:42,120 You know these are random values following a normal distribution but by fixing the same C in our code 48 00:03:42,420 --> 00:03:47,550 we'll get the same random values there for the same perturbations therefore the same updates of the 49 00:03:47,550 --> 00:03:50,570 weight and therefore in the end the same policy. 50 00:03:50,570 --> 00:03:56,190 All right so you can choose another city if you want but if you want to get same results as mine well 51 00:03:56,280 --> 00:03:57,410 let's fix a seat. 52 00:03:57,780 --> 00:04:02,880 And to fix this with an umpire where we take our non-Thai libraries first which has a shortcut entry 53 00:04:03,330 --> 00:04:10,260 and then we take the random module because indeed the seed is applied on all the random functions we 54 00:04:10,260 --> 00:04:12,300 use in our implementation. 55 00:04:12,480 --> 00:04:20,740 And then from the Saranda module we take the seed function which will take us and put exactly this seed 56 00:04:20,880 --> 00:04:29,340 high per parameter that we not only created but that is now existing because we created that object 57 00:04:29,610 --> 00:04:32,980 which includes this hyper parameter. 58 00:04:33,210 --> 00:04:37,800 OK so HPC that fixes the C will get the same result. 59 00:04:38,130 --> 00:04:45,500 And the important new step is a new step is about connecting the environment to our AI. 60 00:04:45,500 --> 00:04:49,970 In other words to choose the environment on which the AI AI will be trained. 61 00:04:50,280 --> 00:04:56,700 So I won't hide you that I spend quite a lot of time on that because indeed we always do our best so 62 00:04:56,700 --> 00:05:01,350 that you don't get any you know rounding error or installation issue. 63 00:05:01,480 --> 00:05:07,760 And there are actually many ways to connect your AI to the pilot environment you have a module of pilot 64 00:05:07,800 --> 00:05:09,080 that you can install. 65 00:05:09,160 --> 00:05:14,500 But I noticed that this would cause some issues on the different systems. 66 00:05:14,500 --> 00:05:19,570 You might be having like you know Linux Windows or Mac and it had some incompatibilities with one of 67 00:05:19,570 --> 00:05:20,010 them. 68 00:05:20,200 --> 00:05:26,020 So this wasn't the best solution but fortunately I found a much better solution which is direct and 69 00:05:26,020 --> 00:05:31,750 which won't cause any installation issue because now it is well developed and therefore there is a way 70 00:05:31,750 --> 00:05:34,110 to install it smoothly on the three systems. 71 00:05:34,170 --> 00:05:36,830 It's through the gym environment. 72 00:05:36,940 --> 00:05:42,530 You know we're going to connect our Bible that environment through Jane. 73 00:05:42,640 --> 00:05:44,300 That's the easiest way to do. 74 00:05:44,410 --> 00:05:47,140 And that will work the best for the three systems. 75 00:05:47,230 --> 00:05:47,830 OK. 76 00:05:47,950 --> 00:05:49,990 So let's take the solution. 77 00:05:49,990 --> 00:05:51,340 Of course I will do it. 78 00:05:51,350 --> 00:05:57,870 Tauriel on how to install Jim and by bullet I tested it on the three systems and no issue has encountered. 79 00:05:57,910 --> 00:06:02,650 You shouldn't have any problem but of course if you have any problem we'll provide some debug solution 80 00:06:02,650 --> 00:06:05,830 and I will actually make it to toile fully on that. 81 00:06:05,830 --> 00:06:10,980 All right so let's do this let's get our environment through Jim. 82 00:06:11,200 --> 00:06:18,820 And therefore now what we're going to do is you know since we called our hyper parameter for the environment 83 00:06:19,000 --> 00:06:25,210 and its name well we're going to take that gym library and use the make function which will take us 84 00:06:25,210 --> 00:06:32,590 and put this name and which will make this environment ready through the gym library and how we're going 85 00:06:32,590 --> 00:06:33,040 to do this. 86 00:06:33,040 --> 00:06:41,740 Well let's first say that a year and name and now what we're going to do is here next line we're going 87 00:06:41,740 --> 00:06:44,830 to introduce a new variable which will be on. 88 00:06:44,860 --> 00:06:51,340 And that will represent exactly the environment but made by Jim and therefore since we want it to be 89 00:06:51,430 --> 00:06:58,630 made by Jim where we're going to take Jim first and then that's where we use the make function in which 90 00:06:58,630 --> 00:06:59,140 we put. 91 00:06:59,140 --> 00:07:07,420 Indeed our hyper parameter object and especially the alpha name farmable from this HPI object. 92 00:07:07,420 --> 00:07:07,710 All right. 93 00:07:07,720 --> 00:07:12,040 But then as you can notice We'll have the warning undefine named Jim. 94 00:07:12,180 --> 00:07:14,740 Well that's because we have to import it. 95 00:07:14,770 --> 00:07:16,480 It wasn't imported yet. 96 00:07:16,480 --> 00:07:22,630 So we're going to add it here because this is where we have all the libraries that we need to import 97 00:07:22,630 --> 00:07:24,050 for the implementation. 98 00:07:24,130 --> 00:07:29,290 And there is nothing more simple than importing gem you just need to do the same. 99 00:07:29,290 --> 00:07:31,960 As for the others import. 100 00:07:32,230 --> 00:07:34,540 Right now the warning should disappear. 101 00:07:34,540 --> 00:07:35,430 Here we go. 102 00:07:35,440 --> 00:07:36,460 Disappeared. 103 00:07:36,880 --> 00:07:41,190 OK so now we have our environment made by. 104 00:07:41,470 --> 00:07:44,060 And later on we'll specify the name of the environment. 105 00:07:44,140 --> 00:07:46,070 Let's do that at the very end. 106 00:07:46,240 --> 00:07:49,650 So that's not over to get in Varman ready. 107 00:07:49,660 --> 00:07:55,520 The next thing we have to do now that end is our environment made by Jim with the right name. 108 00:07:55,760 --> 00:08:02,680 Well what we want to do now is to indeed check out the videos of our different eyes working on the fields 109 00:08:03,040 --> 00:08:03,800 on our screen. 110 00:08:03,980 --> 00:08:08,070 Now on our monitor and so what we're going to do now is use a wrappers. 111 00:08:08,080 --> 00:08:12,280 You know we have the environment but we want to watch the trainings happen happening in that environment 112 00:08:12,610 --> 00:08:15,980 and to do that we need to take the Raptors module by Jim. 113 00:08:16,030 --> 00:08:27,350 So not only we're going to import Jim but from Jim we will import these rappers which will allow us 114 00:08:27,350 --> 00:08:34,850 to see our AI's walking on the fields on our monitor you know inside this sub subfolder that we created 115 00:08:34,850 --> 00:08:37,140 thanks to the M-K function. 116 00:08:37,250 --> 00:08:37,850 OK. 117 00:08:38,060 --> 00:08:45,700 So from Jim import wrappers now we have a warning saying that it isn't news but we are about to use 118 00:08:45,700 --> 00:08:53,380 it now because indeed in order to basically see the videos of our AI's and or Mulliner Well we need 119 00:08:53,390 --> 00:09:01,170 to say that rappers by Jim and then we take the money class which will take us and put our environment 120 00:09:01,600 --> 00:09:08,980 made ready by Gen then the location where we want to have these videos or am working on the field in 121 00:09:08,980 --> 00:09:15,850 our computer and this location is exactly this monitor there that basically is the directory folder 122 00:09:16,180 --> 00:09:21,520 of the future videos that will be populated once we launched the training some money this year. 123 00:09:21,520 --> 00:09:23,620 That's the second argument here. 124 00:09:24,040 --> 00:09:24,730 Perfect. 125 00:09:24,850 --> 00:09:30,940 And then we're going to use the following trick so that we can not be stopped by any warning because 126 00:09:30,940 --> 00:09:34,130 you'll see that when we execute the code we'll have some warnings. 127 00:09:34,240 --> 00:09:40,190 And in order to prevent those warnings from stopping the training Well we're going to add this trick. 128 00:09:40,210 --> 00:09:43,400 Force equals true. 129 00:09:43,400 --> 00:09:51,010 All right so we're going to force the training and videos to be populated inside this monitor directory 130 00:09:51,010 --> 00:09:52,320 folder. 131 00:09:52,330 --> 00:09:52,660 All right. 132 00:09:52,660 --> 00:10:01,750 And that's the trick to not only get the environment of bullets but also to get the videos are trying 133 00:10:01,750 --> 00:10:05,950 to walk on the field on a specific monitor directory folder. 134 00:10:06,120 --> 00:10:06,720 OK. 135 00:10:06,820 --> 00:10:09,680 That's the best we could use this is the simplest one. 136 00:10:09,770 --> 00:10:12,240 And that's what works best for the three systems. 137 00:10:12,620 --> 00:10:15,820 OK so we have now our environment. 138 00:10:15,820 --> 00:10:16,680 Perfect. 139 00:10:16,870 --> 00:10:19,200 And now according to you what do we need. 140 00:10:19,240 --> 00:10:23,770 You know we made a lot of tools but what do we need to take exactly to launch this training. 141 00:10:23,950 --> 00:10:25,210 Well that's very simple. 142 00:10:25,220 --> 00:10:34,600 See the train function as we made before is second as arguments the environment the policy the normalizer 143 00:10:34,750 --> 00:10:36,570 and our hyper parameters. 144 00:10:36,670 --> 00:10:40,230 And so you know to launch this training of our AI. 145 00:10:40,450 --> 00:10:44,440 Well the only thing to do is to get these four arguments ready. 146 00:10:44,440 --> 00:10:47,020 We already made our hyper parameters ready. 147 00:10:47,350 --> 00:10:50,890 That's what we did here by creating this object of the HP glass. 148 00:10:51,010 --> 00:10:53,960 Then we made our environment ready through Jim. 149 00:10:54,280 --> 00:10:59,880 And not only we did that but also we used a trick to populate the videos of our AI trying to walk in 150 00:10:59,880 --> 00:11:00,510 a field. 151 00:11:00,700 --> 00:11:08,290 So the only things remaining that we had to do now is to of course get our policy ready you know creating 152 00:11:08,590 --> 00:11:13,180 the policy object of the policy class that's for sure. 153 00:11:13,180 --> 00:11:19,660 Because indeed I remind that the SARS the whole principle of the IRS is that we are doing an exploration 154 00:11:19,960 --> 00:11:26,860 on the space of policies and therefore you know we're starting with a policy that is perception of weight 155 00:11:26,920 --> 00:11:28,570 all initialized to zero. 156 00:11:28,840 --> 00:11:36,100 So that's the beginning of our exploration and then we're going to explore different policies by trying 157 00:11:36,100 --> 00:11:39,100 some different updates of the weights in different directions. 158 00:11:39,160 --> 00:11:43,310 And that's where the exploration in space of policies happen. 159 00:11:43,480 --> 00:11:49,630 So indeed we're going to create one object for the policy but then the exploration of space of policies 160 00:11:49,930 --> 00:11:56,500 happens by you know trying different updates of the weight of this policy in different directions. 161 00:11:56,500 --> 00:11:58,350 All right so it's important to get that. 162 00:11:58,510 --> 00:12:02,950 The next step we have to do is of course to build our policy object. 163 00:12:02,980 --> 00:12:03,770 OK. 164 00:12:03,820 --> 00:12:08,950 And then of course the last step and this will really be the last step before we you know execute the 165 00:12:08,950 --> 00:12:13,660 train function is to get our normalizer ready as well. 166 00:12:13,660 --> 00:12:14,010 All right. 167 00:12:14,020 --> 00:12:16,120 So it's going to be done pretty soon. 168 00:12:16,120 --> 00:12:16,970 We're almost there. 169 00:12:16,970 --> 00:12:20,650 The final exciting step let's take of our policy first. 170 00:12:20,650 --> 00:12:25,390 So now my next question is what do we need to create our policy. 171 00:12:25,540 --> 00:12:27,690 Well let's crawl back up. 172 00:12:27,790 --> 00:12:28,300 Here we go. 173 00:12:28,300 --> 00:12:29,830 That's our policy class. 174 00:12:29,950 --> 00:12:36,190 And now to see what we need to create a new instance of this class or you know this policy object of 175 00:12:36,190 --> 00:12:36,940 this class. 176 00:12:37,150 --> 00:12:42,910 Well we simply need to look at the init method and see the arguments that it needs and it needs you 177 00:12:42,910 --> 00:12:49,570 know the input size which is the number of elements in the code in input state vector and the output 178 00:12:49,570 --> 00:12:55,960 size which is of course the number of actions the AI has to predict at each time. 179 00:12:56,350 --> 00:12:57,400 Again that's it. 180 00:12:57,460 --> 00:13:02,920 That's all you need to have to create your policy and by the way for the hyper parameters Well indeed 181 00:13:02,920 --> 00:13:09,590 we didn't have anything to specify in the arguments of the HP class to create our HP object. 182 00:13:09,670 --> 00:13:16,420 Because indeed and in a method we are not calling any arguments but for the policy class indeed we will 183 00:13:16,420 --> 00:13:19,910 need input size and the output size so let's get them right now. 184 00:13:19,990 --> 00:13:21,420 It's quite easy. 185 00:13:21,490 --> 00:13:26,920 You know thanks to what we did with making the environment ready by Jim and then using these wrappers 186 00:13:26,920 --> 00:13:32,210 to get not only the videos but also all the tools that we have in the public library. 187 00:13:32,380 --> 00:13:34,300 Basically what we did here is a shortcut. 188 00:13:34,480 --> 00:13:42,070 So let's get first the number of inputs which I remind is the number of elements in your input state 189 00:13:42,070 --> 00:13:42,980 vector. 190 00:13:43,220 --> 00:13:45,290 And so to get it it's quite easy. 191 00:13:45,330 --> 00:13:52,740 You know it's just how the public library works you take your environment first then you take the observation 192 00:13:53,250 --> 00:13:58,860 space which is something that exists already you know it's part of the whole Bible library observation 193 00:13:58,860 --> 00:14:04,270 space the space of your observations and therefore the space of your input observation space and then 194 00:14:04,350 --> 00:14:12,600 to get the shape attribute because indeed that's what will get you by taking the next 0 the number of 195 00:14:12,650 --> 00:14:18,650 inputs right and number of input is contained in that specific indexed location. 196 00:14:18,810 --> 00:14:26,940 Then we're going to do the same for the number of outputs which is of course the number of actions that 197 00:14:26,940 --> 00:14:30,020 the is predicting to play at each time t. 198 00:14:30,330 --> 00:14:36,960 And to get them again we take our environment than that and then we're going to take well here the action 199 00:14:36,960 --> 00:14:38,540 space exactly that. 200 00:14:38,940 --> 00:14:48,500 And then again shape in square brackets zero basically zero returns the number of the space you're looking 201 00:14:48,500 --> 00:14:48,890 at. 202 00:14:48,890 --> 00:14:49,430 OK. 203 00:14:49,670 --> 00:14:53,170 So here the action space in here the space of inputs. 204 00:14:53,540 --> 00:14:54,460 Perfect. 205 00:14:54,470 --> 00:14:54,770 Great. 206 00:14:54,770 --> 00:15:01,400 So we have a number of inputs and the number of outputs now and therefore we are ready to create our 207 00:15:01,610 --> 00:15:07,820 policy class because indeed in method it is calling for the input size and the output size. 208 00:15:07,820 --> 00:15:09,450 So how are we going to do this. 209 00:15:09,590 --> 00:15:15,230 Well again we have to give a name to that object which we're going to called policy and then we take 210 00:15:15,290 --> 00:15:22,040 our policy class this time with capital P and then our two arguments are required as we can see here 211 00:15:22,350 --> 00:15:24,860 and the inputs first for the input side. 212 00:15:24,860 --> 00:15:28,880 You know we just gave some different names for the arguments. 213 00:15:28,970 --> 00:15:31,540 That's quite safer to do. 214 00:15:31,790 --> 00:15:41,700 And then the number of output for of course the output size here we go we have our AI ready and initialized 215 00:15:41,940 --> 00:15:44,850 as a perception of weights equal to zero. 216 00:15:45,120 --> 00:15:48,500 And then when we run the train function now this function here. 217 00:15:48,750 --> 00:15:54,950 Well you know after initializing the perturbations in the different directions Well we get you know 218 00:15:54,990 --> 00:16:00,440 the positive words and negative rewards and then we sort them by the maximum of the couple of words. 219 00:16:00,450 --> 00:16:05,040 And then here at this specific time of the implementation or the training. 220 00:16:05,160 --> 00:16:07,980 Well that's where the policy will be updated. 221 00:16:07,980 --> 00:16:15,360 And therefore that's where the way it will be updated in different directions to explore the ways that 222 00:16:15,360 --> 00:16:17,790 will lead to the highest rewards. 223 00:16:17,790 --> 00:16:23,960 All right so you have to understand now we're just you know taking a step back to understand how all 224 00:16:24,040 --> 00:16:27,130 this will work you know how all this will happen. 225 00:16:27,310 --> 00:16:27,910 OK. 226 00:16:28,050 --> 00:16:30,820 So here we go. 227 00:16:30,900 --> 00:16:32,340 Our place is ready. 228 00:16:32,460 --> 00:16:40,320 And now let's go back to the train function the train function takes with the environment which we have 229 00:16:40,600 --> 00:16:44,480 the policy which we have the hyper parameters which we also have. 230 00:16:44,640 --> 00:16:48,070 And also the normalizer which we don't have yet. 231 00:16:48,120 --> 00:16:49,960 That's the only remaining thing we need. 232 00:16:50,040 --> 00:16:52,440 And that's what we're going to get right now. 233 00:16:52,650 --> 00:17:01,440 So let's get our normalizer which is going to be an object or you know an instance of this normalizer 234 00:17:01,440 --> 00:17:02,240 class. 235 00:17:02,310 --> 00:17:03,570 So we're going to call this class. 236 00:17:03,570 --> 00:17:10,500 And now to know which arguments we will need as input at this class will we just need to look at init 237 00:17:10,500 --> 00:17:16,260 method and the init method only takes one argument which is a number of inputs and which we already 238 00:17:16,260 --> 00:17:16,950 have. 239 00:17:16,950 --> 00:17:18,350 So perfect. 240 00:17:18,570 --> 00:17:27,750 Let's create our normalizer object and by calling this normalize a class and by taking input of this 241 00:17:27,750 --> 00:17:32,640 normalized class the number of inputs. 242 00:17:32,970 --> 00:17:33,540 Perfect. 243 00:17:33,540 --> 00:17:34,400 Here we go. 244 00:17:34,590 --> 00:17:39,390 We have everything ready to use the train function because indeed. 245 00:17:39,390 --> 00:17:47,160 Now to get this training ready the only thing we have to do is just to take the train function and input 246 00:17:47,440 --> 00:17:51,070 the four environments it is expecting. 247 00:17:51,180 --> 00:17:58,830 So first the environment then second our policy then third our normalizer. 248 00:17:59,490 --> 00:18:03,530 And finally our hyper parameters and there we go. 249 00:18:03,600 --> 00:18:04,810 Congratulations. 250 00:18:04,820 --> 00:18:07,430 This implementation is over. 251 00:18:07,590 --> 00:18:11,960 So just getting the final good ready. 252 00:18:11,960 --> 00:18:20,180 Here we go 140 lines of code not bad you know especially for a non-pilot implementation and you're going 253 00:18:20,180 --> 00:18:26,740 to see that you know this contrast between the simplicity of this implementation you know in 140 lines 254 00:18:26,750 --> 00:18:34,880 that's not much and done results will get me and is mind blowing so I 20:18 get ready for it.