1 00:00:00,680 --> 00:00:04,420 Hello and welcome to the final step of the training. 2 00:00:04,520 --> 00:00:10,130 So this final step is quite easy and it's just so that we can have a look at how the training is going 3 00:00:10,190 --> 00:00:16,190 in the council when we execute the training because indeed what we want to do now is print the final 4 00:00:16,190 --> 00:00:19,420 word of the policy after the update. 5 00:00:19,420 --> 00:00:26,030 So after we apply this date method here from our policy I object on the rollout and the standard deviation 6 00:00:26,090 --> 00:00:32,090 are made make one step of sand which will bear the weight of the policy in the best directions that 7 00:00:32,090 --> 00:00:33,580 increase the most do we word. 8 00:00:33,800 --> 00:00:35,490 And so we want to test. 9 00:00:35,570 --> 00:00:39,860 We want to test out the policy after we made that up there to see how it is going. 10 00:00:39,950 --> 00:00:47,600 On one full episode that is by playing 1000 action or by falling at some certain point and we can do 11 00:00:47,600 --> 00:00:54,230 this because indeed in the Explore function we have the option to take no direction and no perturbation 12 00:00:54,490 --> 00:00:57,820 there is basically just applying the policy on the direction. 13 00:00:57,830 --> 00:00:59,960 But over one full episode. 14 00:00:59,960 --> 00:01:07,730 And so that's why exactly that we're going to do now is take our explore function and apply it to of 15 00:01:07,730 --> 00:01:14,450 course our environments than our normalizer same we might test several normalizer. 16 00:01:14,660 --> 00:01:20,130 And of course our policy that was just updated thanks to the update function here. 17 00:01:20,490 --> 00:01:20,910 All right. 18 00:01:20,930 --> 00:01:29,510 And remember this explorer function returns at the end the cumulated reward we get over a one full episode 19 00:01:29,960 --> 00:01:32,330 with no directional perturbation applied. 20 00:01:32,330 --> 00:01:40,220 So here indeed we'll get this accumulated reward which we can call reward evaluation because that's 21 00:01:40,220 --> 00:01:42,530 just to evaluate our policy. 22 00:01:42,590 --> 00:01:44,650 After making that date. 23 00:01:44,840 --> 00:01:52,540 So we want evaluation is equal to what is returned by this explore function applied to the current environment. 24 00:01:52,550 --> 00:01:57,320 We're testing our normalizer in our policy after the date was made. 25 00:01:57,430 --> 00:02:03,440 And you can notice that I didn't even have to add the two other arguments here which were the direction 26 00:02:03,530 --> 00:02:07,450 and Delta because by default they're equal to none and none here. 27 00:02:07,550 --> 00:02:12,640 And therefore since that's exactly what I want there is you know no direction and no perturbation. 28 00:02:12,680 --> 00:02:15,960 What I didn't even have to specify duration equals none. 29 00:02:16,010 --> 00:02:17,650 And Delta was not. 30 00:02:17,660 --> 00:02:25,490 So you see the purpose of specifying a value here is just to not having to input them whenever we don't 31 00:02:25,490 --> 00:02:29,840 want to apply any perturbation in some specific direction. 32 00:02:29,840 --> 00:02:30,070 All right. 33 00:02:30,060 --> 00:02:36,500 Perfect So we have this accumulated reward of the one episode after the date of our policy and with 34 00:02:36,500 --> 00:02:38,000 no perturbations. 35 00:02:38,120 --> 00:02:42,480 And so now we are going to print this. 36 00:02:42,710 --> 00:02:43,720 And how can we do that. 37 00:02:43,730 --> 00:02:51,320 Well first we're going to put a string which will be step you need to specify which step of the training 38 00:02:51,320 --> 00:02:59,240 we are when the training is going and all this is printed in the console so step here then we need to 39 00:02:59,240 --> 00:03:06,450 add indeed the value this time of the step which is nothing else and this variable year step. 40 00:03:06,620 --> 00:03:13,970 And then of course we want to specify with the string that we're going to print the word we can specify 41 00:03:13,980 --> 00:03:15,590 We were like that you will understand. 42 00:03:15,590 --> 00:03:17,260 It is the word of the step. 43 00:03:17,300 --> 00:03:18,160 So we wanted. 44 00:03:18,200 --> 00:03:25,340 And then of course we need to add the variable that will get this value which is we can just copy it. 45 00:03:25,410 --> 00:03:28,370 Word evaluation. 46 00:03:28,370 --> 00:03:32,900 All right let's base that here and here we go. 47 00:03:32,950 --> 00:03:38,760 We will have the word printed in the console as soon as we execute the training. 48 00:03:38,770 --> 00:03:39,360 Perfect. 49 00:03:39,370 --> 00:03:42,510 And now we're getting closer and closer to the final demo. 50 00:03:42,520 --> 00:03:44,210 I'm super excited to start. 51 00:03:44,440 --> 00:03:47,320 So we just have one remaining code section to implement. 52 00:03:47,320 --> 00:03:54,050 And actually it will be very easy you know it's just to create objects and to run the train function. 53 00:03:54,160 --> 00:03:57,020 So it will just be a couple of lines to type. 54 00:03:57,100 --> 00:04:02,490 I will include in the code that we won't implement together but that's just to you know prepare for 55 00:04:02,570 --> 00:04:08,290 that that will be populated in the working directory for then that will contain all the videos of your 56 00:04:08,290 --> 00:04:10,410 AI trying to work on the field. 57 00:04:10,420 --> 00:04:14,580 So it's just something that will take from the internet and paste here. 58 00:04:14,680 --> 00:04:18,960 I will quickly explain how it works but that has nothing to do with artificial intelligence. 59 00:04:18,960 --> 00:04:22,390 And that's why I don't want us to focus this energy on it. 60 00:04:22,390 --> 00:04:28,270 But then of course we'll create the objects together and gather the functions and in the end run them. 61 00:04:28,270 --> 00:04:30,600 So let's start doing that in the next. 62 00:04:30,730 --> 00:04:32,290 And until then enjoy AI.