1
00:00:00,680 --> 00:00:04,420
Hello and welcome to the final step of the training.

2
00:00:04,520 --> 00:00:10,130
So this final step is quite easy and it's just so that we can have a look at how the training is going

3
00:00:10,190 --> 00:00:16,190
in the council when we execute the training because indeed what we want to do now is print the final

4
00:00:16,190 --> 00:00:19,420
word of the policy after the update.

5
00:00:19,420 --> 00:00:26,030
So after we apply this date method here from our policy I object on the rollout and the standard deviation

6
00:00:26,090 --> 00:00:32,090
are made make one step of sand which will bear the weight of the policy in the best directions that

7
00:00:32,090 --> 00:00:33,580
increase the most do we word.

8
00:00:33,800 --> 00:00:35,490
And so we want to test.

9
00:00:35,570 --> 00:00:39,860
We want to test out the policy after we made that up there to see how it is going.

10
00:00:39,950 --> 00:00:47,600
On one full episode that is by playing 1000 action or by falling at some certain point and we can do

11
00:00:47,600 --> 00:00:54,230
this because indeed in the Explore function we have the option to take no direction and no perturbation

12
00:00:54,490 --> 00:00:57,820
there is basically just applying the policy on the direction.

13
00:00:57,830 --> 00:00:59,960
But over one full episode.

14
00:00:59,960 --> 00:01:07,730
And so that's why exactly that we're going to do now is take our explore function and apply it to of

15
00:01:07,730 --> 00:01:14,450
course our environments than our normalizer same we might test several normalizer.

16
00:01:14,660 --> 00:01:20,130
And of course our policy that was just updated thanks to the update function here.

17
00:01:20,490 --> 00:01:20,910
All right.

18
00:01:20,930 --> 00:01:29,510
And remember this explorer function returns at the end the cumulated reward we get over a one full episode

19
00:01:29,960 --> 00:01:32,330
with no directional perturbation applied.

20
00:01:32,330 --> 00:01:40,220
So here indeed we'll get this accumulated reward which we can call reward evaluation because that's

21
00:01:40,220 --> 00:01:42,530
just to evaluate our policy.

22
00:01:42,590 --> 00:01:44,650
After making that date.

23
00:01:44,840 --> 00:01:52,540
So we want evaluation is equal to what is returned by this explore function applied to the current environment.

24
00:01:52,550 --> 00:01:57,320
We're testing our normalizer in our policy after the date was made.

25
00:01:57,430 --> 00:02:03,440
And you can notice that I didn't even have to add the two other arguments here which were the direction

26
00:02:03,530 --> 00:02:07,450
and Delta because by default they're equal to none and none here.

27
00:02:07,550 --> 00:02:12,640
And therefore since that's exactly what I want there is you know no direction and no perturbation.

28
00:02:12,680 --> 00:02:15,960
What I didn't even have to specify duration equals none.

29
00:02:16,010 --> 00:02:17,650
And Delta was not.

30
00:02:17,660 --> 00:02:25,490
So you see the purpose of specifying a value here is just to not having to input them whenever we don't

31
00:02:25,490 --> 00:02:29,840
want to apply any perturbation in some specific direction.

32
00:02:29,840 --> 00:02:30,070
All right.

33
00:02:30,060 --> 00:02:36,500
Perfect So we have this accumulated reward of the one episode after the date of our policy and with

34
00:02:36,500 --> 00:02:38,000
no perturbations.

35
00:02:38,120 --> 00:02:42,480
And so now we are going to print this.

36
00:02:42,710 --> 00:02:43,720
And how can we do that.

37
00:02:43,730 --> 00:02:51,320
Well first we're going to put a string which will be step you need to specify which step of the training

38
00:02:51,320 --> 00:02:59,240
we are when the training is going and all this is printed in the console so step here then we need to

39
00:02:59,240 --> 00:03:06,450
add indeed the value this time of the step which is nothing else and this variable year step.

40
00:03:06,620 --> 00:03:13,970
And then of course we want to specify with the string that we're going to print the word we can specify

41
00:03:13,980 --> 00:03:15,590
We were like that you will understand.

42
00:03:15,590 --> 00:03:17,260
It is the word of the step.

43
00:03:17,300 --> 00:03:18,160
So we wanted.

44
00:03:18,200 --> 00:03:25,340
And then of course we need to add the variable that will get this value which is we can just copy it.

45
00:03:25,410 --> 00:03:28,370
Word evaluation.

46
00:03:28,370 --> 00:03:32,900
All right let's base that here and here we go.

47
00:03:32,950 --> 00:03:38,760
We will have the word printed in the console as soon as we execute the training.

48
00:03:38,770 --> 00:03:39,360
Perfect.

49
00:03:39,370 --> 00:03:42,510
And now we're getting closer and closer to the final demo.

50
00:03:42,520 --> 00:03:44,210
I'm super excited to start.

51
00:03:44,440 --> 00:03:47,320
So we just have one remaining code section to implement.

52
00:03:47,320 --> 00:03:54,050
And actually it will be very easy you know it's just to create objects and to run the train function.

53
00:03:54,160 --> 00:03:57,020
So it will just be a couple of lines to type.

54
00:03:57,100 --> 00:04:02,490
I will include in the code that we won't implement together but that's just to you know prepare for

55
00:04:02,570 --> 00:04:08,290
that that will be populated in the working directory for then that will contain all the videos of your

56
00:04:08,290 --> 00:04:10,410
AI trying to work on the field.

57
00:04:10,420 --> 00:04:14,580
So it's just something that will take from the internet and paste here.

58
00:04:14,680 --> 00:04:18,960
I will quickly explain how it works but that has nothing to do with artificial intelligence.

59
00:04:18,960 --> 00:04:22,390
And that's why I don't want us to focus this energy on it.

60
00:04:22,390 --> 00:04:28,270
But then of course we'll create the objects together and gather the functions and in the end run them.

61
00:04:28,270 --> 00:04:30,600
So let's start doing that in the next.

62
00:04:30,730 --> 00:04:32,290
And until then enjoy AI.