1
00:00:00,420 --> 00:00:02,640
Hello and welcome this news it's horrible.

2
00:00:02,640 --> 00:00:09,470
So we've gathered the positive words and negative words we computed the standard deviation of all these

3
00:00:09,480 --> 00:00:17,460
we want and now we have to do a step which we haven't done any time so far during the training and I'm

4
00:00:17,460 --> 00:00:25,350
talking of course about that step six just before we made the update step and the steps six consist

5
00:00:25,440 --> 00:00:33,150
of soaring the directions by the maximum of the couple of positive words and negative words meaning

6
00:00:33,450 --> 00:00:38,820
the words we get by applying the perturbations in a positive direction and the reward we get by applying

7
00:00:38,820 --> 00:00:41,470
the perturbation in the opposite direction.

8
00:00:41,520 --> 00:00:49,200
So we have these couples of positive and negative rewards for each of the 16 directions that we're testing

9
00:00:49,430 --> 00:00:54,440
and for each of the 16 directions we're going to get these positive reward and negative word.

10
00:00:54,510 --> 00:01:00,240
We're going to take the next of them and then we'll sort all the directions by the highest of these

11
00:01:00,330 --> 00:01:01,760
maximum of words.

12
00:01:01,950 --> 00:01:03,000
That's what we need to do.

13
00:01:03,000 --> 00:01:09,660
And of course we want to do this because we looking for the directions that increased the most the word

14
00:01:09,990 --> 00:01:16,160
because indeed the higher is the reward we reach the better the I will have the ability to walk.

15
00:01:16,170 --> 00:01:20,310
So it's always about trying to optimize and increase the reward.

16
00:01:20,310 --> 00:01:26,940
And that's why right after the step six you have this update step one step of great the sense which

17
00:01:26,940 --> 00:01:32,910
will update your weight in these best directions then increase the rewards.

18
00:01:32,940 --> 00:01:38,820
And here this is the approximated grade in the sense that we're doing by approximating the gradient

19
00:01:38,970 --> 00:01:41,580
of the word with respect to the weight.

20
00:01:41,580 --> 00:01:43,380
All right so let's do this let's do this.

21
00:01:43,380 --> 00:01:47,800
Step 6 soling the directions by the max of the rewards.

22
00:01:48,230 --> 00:01:50,160
OK so how are we going to do this.

23
00:01:50,160 --> 00:01:56,640
Well first we are going to introduce a variable that we're going to call scores and that will get these

24
00:01:57,020 --> 00:02:03,690
you know maximums of the reward in the positive direction and the reward in negative direction or the

25
00:02:03,690 --> 00:02:04,870
opposite direction.

26
00:02:05,130 --> 00:02:09,810
So we're going to get these maximum first for each of the 16 directions.

27
00:02:09,810 --> 00:02:10,050
All right.

28
00:02:10,050 --> 00:02:12,350
So let's do this let's go back to Python.

29
00:02:12,420 --> 00:02:17,220
So our scores here how are we going to gather all these maximums.

30
00:02:17,430 --> 00:02:24,480
Well since then after gathering this course here we're going to use the Soledad function which is a

31
00:02:24,480 --> 00:02:29,130
function that allows you to sort the values of a dictionary.

32
00:02:29,280 --> 00:02:35,760
Well the thing we're going to do with course here is set it as a dictionary in which the keys will be

33
00:02:36,060 --> 00:02:44,610
just the numbers from 1 to 16 or no from 0 to 15 and the values of this case will be these maximums

34
00:02:45,030 --> 00:02:46,580
that we are trying to sort.

35
00:02:46,590 --> 00:02:52,100
So that then it will be much easier for us to sort these maximums.

36
00:02:52,110 --> 00:02:58,620
So the first thing we're going to do here is add some brackets like that because that's the syntax of

37
00:02:58,620 --> 00:03:00,240
a dictionary in Python.

38
00:03:00,630 --> 00:03:06,570
So we had some brackets and then we first need to define the keys which I remind are just going to be

39
00:03:06,570 --> 00:03:13,950
some integers starting from 0 to 15 because each cue will correspond to one of the 16 directions that

40
00:03:13,950 --> 00:03:15,090
we're testing.

41
00:03:15,090 --> 00:03:20,510
So clear here will represent this key and integer going from zero to 15.

42
00:03:20,790 --> 00:03:28,020
And then you need to add a column and then you specify the value of that key and the value of that key

43
00:03:28,040 --> 00:03:34,680
is going to be exactly the maximum of the reward we get in the positive direction.

44
00:03:34,680 --> 00:03:41,870
For one specific direction then the reward we get in the negative direction meaning the opposite direction

45
00:03:41,890 --> 00:03:43,920
as this positive direction.

46
00:03:44,160 --> 00:03:46,400
That's our values the maximum of this.

47
00:03:46,470 --> 00:03:51,000
We will we get in the positive direction and that's where we get in the negative direction.

48
00:03:51,150 --> 00:03:56,870
And of course we need to get these keys and maximum of these rewards for each of the six indirections.

49
00:03:56,880 --> 00:04:01,150
So now what we're going to do is a for loop inside the dictionary.

50
00:04:01,290 --> 00:04:06,110
We can do that to classical Trigon python instead of putting the dictionary inside a for loop.

51
00:04:06,200 --> 00:04:11,240
You put the for loop inside a dictionary and you can also do that to populate a list.

52
00:04:11,310 --> 00:04:19,130
And so I'm going to add here of four and then you know we want to generate this for each of the 16 keys.

53
00:04:19,260 --> 00:04:22,100
Each of the key corresponding to one of the six indirections.

54
00:04:22,470 --> 00:04:30,030
So since our keys here are the case I'm going to do K but also the word in the positive direction and

55
00:04:30,030 --> 00:04:34,970
the reward in the negative direction because we need to get them as well in.

56
00:04:35,270 --> 00:04:41,660
And now we're going to get everything using two tricks to enumerate function which will enumerate the

57
00:04:41,670 --> 00:04:42,120
integers.

58
00:04:42,120 --> 00:04:47,890
Want to get that is the case from 0 to 15 and the zip function together.

59
00:04:48,000 --> 00:04:53,010
The positive words and the negative words because indeed we need the our post we want in the positive

60
00:04:53,010 --> 00:04:56,410
direction and our nega we work in the negative direction.

61
00:04:56,670 --> 00:04:57,940
So here is what we have to do.

62
00:04:57,960 --> 00:05:05,420
We take first enumerate to generate integers and then inside enumerate we take the zip function that

63
00:05:05,420 --> 00:05:12,770
will gather together the positive reward which we already gathered here all the positive words.

64
00:05:13,010 --> 00:05:23,970
And of course the negative reward this all of this here will get me the case that is integers from 0

65
00:05:23,970 --> 00:05:27,820
to 15 corresponding to the 0 to 15 directions.

66
00:05:27,840 --> 00:05:33,110
Then the puzzle was that is the words we get by applying the perturbations in opposite directions.

67
00:05:33,150 --> 00:05:39,030
In other words these guys here and the negative words that is the reward we get by playing the perturbations

68
00:05:39,060 --> 00:05:42,650
in the negative directions are the opposite directions as these ones.

69
00:05:42,990 --> 00:05:45,610
And these are these ones negative word.

70
00:05:45,630 --> 00:05:46,530
So all good.

71
00:05:46,530 --> 00:05:55,200
We just made a dictionary taking as the keys to integers from 0 to 15 corresponding to the 16 directions

72
00:05:55,200 --> 00:06:01,970
that we're testing and as values the maximum of these words in a positive direction and we words in

73
00:06:01,970 --> 00:06:03,430
a negative direction.

74
00:06:03,630 --> 00:06:04,070
All right.

75
00:06:04,110 --> 00:06:07,270
So that's the first thing done then next step.

76
00:06:07,410 --> 00:06:13,470
Now that we have the scores we want to sort them and because we made a dictionary it's going to be very

77
00:06:13,470 --> 00:06:19,650
easy for us to sort them indeed we're going to use the sorted function which is a function by Peyton

78
00:06:20,100 --> 00:06:27,960
to sort just the keys of the dictionaries by their values you know from their highest value to their

79
00:06:27,960 --> 00:06:31,910
lowest value which means that we're not going to get.

80
00:06:31,920 --> 00:06:37,360
Now this time the maximums themselves we're just going to get these indexes in the list.

81
00:06:37,520 --> 00:06:43,010
But instead of having the list from you know 0 1 2 3 to 16 we'll get the list.

82
00:06:43,050 --> 00:06:47,600
In this specific order sold by the maximum of these we want.

83
00:06:47,610 --> 00:06:53,580
So for example if the highest of these maximums is is next three and if the second highest is in the

84
00:06:53,580 --> 00:07:02,010
next 7 for example well the order list that we're about to get will start with three and seven.

85
00:07:02,130 --> 00:07:07,950
And then because we have a dictionary with all the keys and values well organized well we'll be able

86
00:07:07,950 --> 00:07:13,260
to find the highest word things to the indexes which are the keys.

87
00:07:13,260 --> 00:07:18,810
All right so what I'm going to do now is introduce a new variable which will be order and which will

88
00:07:18,810 --> 00:07:22,880
be a list that will contain these keys of the highest.

89
00:07:22,950 --> 00:07:28,330
Couple of words right of the highest maximum of positive reward and negative reward.

90
00:07:28,590 --> 00:07:32,000
So how are we going to get these keys and indexes.

91
00:07:32,250 --> 00:07:33,380
Well it's easy.

92
00:07:33,420 --> 00:07:39,750
We need to take the sorted function by Python which will sort your dictionary.

93
00:07:40,080 --> 00:07:46,290
And in this sort function the first thing that to specify or input or what you want to get in order

94
00:07:46,630 --> 00:07:49,480
do you want to get the maximum of the rewards.

95
00:07:49,500 --> 00:07:51,590
Or do you want to get the keys.

96
00:07:51,770 --> 00:07:56,790
Are you going to say do we want to get the keys because these are the keys that will help us find everything

97
00:07:57,150 --> 00:08:03,120
meaning the directions and the words in the positive direction and opposite directions for that specific

98
00:08:03,120 --> 00:08:04,100
direction.

99
00:08:04,110 --> 00:08:10,890
So what you want are the keys and therefore the first thing I'm going to specify here is this chorus

100
00:08:11,220 --> 00:08:16,130
that keys that is the keys of my scores dictionary.

101
00:08:16,560 --> 00:08:24,210
Then we add some parenthesis here and then here we go we get the keys and now you add a comma and that's

102
00:08:24,210 --> 00:08:29,670
when you specify by what you want to sort the keys of your dictionary.

103
00:08:29,970 --> 00:08:36,210
So you get the keys of your dictionary but then you want to sort them by these maximums for each of

104
00:08:36,210 --> 00:08:42,030
the directions starting from 0 to 15 because we have 16 directions in total.

105
00:08:42,570 --> 00:08:45,760
And so to do this you're going to do it through function.

106
00:08:45,840 --> 00:08:53,010
It's basically the argument expected by the sort of function here and this argument is called qi but

107
00:08:53,010 --> 00:08:57,300
be careful it's not to be confused with the keys or dictionary.

108
00:08:57,300 --> 00:09:04,620
So we take key and then we add equals and then Lunda which is just to specify that we're about to do

109
00:09:04,620 --> 00:09:11,940
a function then the argument of the function which we call expert which will be the key of the dictionary

110
00:09:12,750 --> 00:09:20,610
then colon and then what this Lunda function will return which are this course right the scores of the

111
00:09:20,610 --> 00:09:22,150
keys xx.

112
00:09:22,380 --> 00:09:31,320
So that right now thanks to this function here we know that we want to sort our scores dictionary by

113
00:09:31,320 --> 00:09:37,440
discours here which are the maximums of the word we get in the positive direction in the word we get

114
00:09:37,440 --> 00:09:44,790
in the negative direction for the 16 different directions and by specifying scores that keys here will

115
00:09:44,790 --> 00:09:52,530
get not these maximums returned in this order list but the keys of these dictionaries correspond into

116
00:09:52,530 --> 00:09:57,810
the keys of the highest word among the 16 positive and opposite directions.

117
00:09:57,810 --> 00:10:04,540
All right so that's a very useful trick in Python if you want to sort a mapping of indexes and values.

118
00:10:04,680 --> 00:10:10,470
You know the indexes here were just the indexes of the directions from 0 to 15 and the values where

119
00:10:10,560 --> 00:10:15,380
the maximums of these words correspond to each of these 16 directions.

120
00:10:15,390 --> 00:10:23,090
So that's a useful trick to know with them that's not over here we get a list here but we won't remember

121
00:10:23,390 --> 00:10:26,890
to get the best directions.

122
00:10:26,930 --> 00:10:27,560
Right.

123
00:10:27,560 --> 00:10:32,100
We have the total number of directions but we also have the total number of best directions.

124
00:10:32,180 --> 00:10:40,100
And remember that in the paper here we're not summing on all the directions but the best directions

125
00:10:40,130 --> 00:10:42,480
be the number of best directions here.

126
00:10:42,920 --> 00:10:49,520
So what we simply need to add here is you know since this is a list well we're going to add some brackets

127
00:10:49,940 --> 00:10:53,200
to take the best directions on.

128
00:10:53,540 --> 00:10:54,680
And how can we do that.

129
00:10:54,680 --> 00:11:02,090
Well since this list is already sorted by the maximums of the reward in the positive direction and we

130
00:11:02,090 --> 00:11:03,690
were in the opposite direction.

131
00:11:03,920 --> 00:11:12,050
Well what we simply need to do here to get the best solutions is just to go from zero to the number

132
00:11:12,920 --> 00:11:20,600
of best directions and you know we don't even have to specify the Zero Year because the lower bound

133
00:11:20,630 --> 00:11:22,650
of a range is by default zero.

134
00:11:22,940 --> 00:11:31,670
So here we are taking the first and the best directions elements which corresponds to the highest maximums

135
00:11:31,670 --> 00:11:38,320
of positive reward and opposite reward because in the way this was are sorted by these maximums.

136
00:11:38,330 --> 00:11:40,640
All right so that's the simple thing we have to do here.

137
00:11:40,820 --> 00:11:46,440
And this gives us exactly what we want or at least what exactly is said in the paper.

138
00:11:46,460 --> 00:11:47,260
Perfect.

139
00:11:47,570 --> 00:11:54,080
So now we have one final thing to do now that we have these indexes but it's not all of course.

140
00:11:54,090 --> 00:12:01,010
Now we want to use these indexes to get Indeed the reward we get by applying the perturbations in the

141
00:12:01,040 --> 00:12:02,930
best directions that are sorted.

142
00:12:03,170 --> 00:12:09,080
And also the reward we get by applying the perturbations in the opposite directions of these resurrections

143
00:12:09,410 --> 00:12:11,180
and of course the perturbations.

144
00:12:11,180 --> 00:12:15,860
Why do we want to get these three things that of course because in the next step will make the update

145
00:12:15,860 --> 00:12:22,670
step which will take the words we get the positive best directions and the rewards we get in the negative

146
00:12:22,700 --> 00:12:30,490
best directions and also the perturbations the values of the perturbations for each of these desperations.

147
00:12:30,740 --> 00:12:37,160
So that's why when together all this now and then we'll be able to make that data step which corresponds

148
00:12:37,160 --> 00:12:43,760
to one step of and descend to approximate the gradient in order to date the weight in these best directions.

149
00:12:43,760 --> 00:12:47,120
All right so now what are we going to do.

150
00:12:47,120 --> 00:12:51,250
Remember we talked about a new concept before which are the rollout.

151
00:12:51,260 --> 00:12:55,650
Well here we go you know that's the update function which takes us in put the roll out.

152
00:12:55,880 --> 00:13:01,250
And therefore since the next step will be to apply the update function to make this one step of gradient

153
00:13:01,250 --> 00:13:01,900
descent.

154
00:13:02,090 --> 00:13:09,800
Well we need to gather these roll out and I remind that these roles are nothing else than the best directions

155
00:13:09,810 --> 00:13:16,250
triplets of the words and the positive resurrections the words in the opposite desecrations and the

156
00:13:16,250 --> 00:13:17,580
perturbation D.

157
00:13:17,750 --> 00:13:23,180
So what we need to do now is prepare those roll out and this will be very easy for us because indeed

158
00:13:23,540 --> 00:13:28,700
we have all the keys of dictionaries for the best directions and therefore will use these keys to get

159
00:13:28,940 --> 00:13:31,980
the positive word the negative reward and the preservation.

160
00:13:32,330 --> 00:13:34,660
So let's do it let's prepare here.

161
00:13:34,700 --> 00:13:44,750
Our rollouts are rollout which will be the least of the following triplets composed of First the positive

162
00:13:45,500 --> 00:13:46,430
rewards.

163
00:13:46,430 --> 00:13:52,030
That is the words attained by playing the perturbations in a positive direction for the best directions.

164
00:13:52,030 --> 00:13:59,210
So here since we to take the best actions of course we're going to add here the index K and K you will

165
00:13:59,210 --> 00:14:05,810
see will be all the case in the order list because the list contains indeed the indexes of the best

166
00:14:05,810 --> 00:14:06,900
directions.

167
00:14:06,930 --> 00:14:08,040
It's puzzle with words.

168
00:14:08,070 --> 00:14:16,790
K then of course I'm going to copy this because the next element of the triplet will be the negative

169
00:14:17,360 --> 00:14:18,630
words OK.

170
00:14:18,920 --> 00:14:21,090
So I'm replacing that here.

171
00:14:21,110 --> 00:14:22,590
Negative three words OK.

172
00:14:22,940 --> 00:14:29,370
And then of course the third element of the triplet as we said is the perturbation deltas.

173
00:14:29,540 --> 00:14:37,010
And again we take in excess K of the best directions and in order to get these preservations the final

174
00:14:37,010 --> 00:14:42,940
touch of what we're doing now is to add of course a new full loop inside the list.

175
00:14:42,950 --> 00:14:46,180
Now I tell you that we can do it inside a dictionary or inside a list.

176
00:14:46,250 --> 00:14:48,170
It's a very classic trick in Python.

177
00:14:48,290 --> 00:14:56,120
And of course we're going to get all the case in the order list composed of the indexes of the best

178
00:14:56,120 --> 00:14:59,510
directions leading to the highest we want.

179
00:14:59,510 --> 00:15:06,680
All right so here we go we have the rollouts now and so we are ready to make that data step to do this

180
00:15:06,680 --> 00:15:10,690
one step of greatness and to have data weight in these best directions.

181
00:15:11,000 --> 00:15:12,650
Let's do that and the next is oil.

182
00:15:12,650 --> 00:15:14,540
And until then enjoy AI.