1 00:00:00,420 --> 00:00:02,640 Hello and welcome this news it's horrible. 2 00:00:02,640 --> 00:00:09,470 So we've gathered the positive words and negative words we computed the standard deviation of all these 3 00:00:09,480 --> 00:00:17,460 we want and now we have to do a step which we haven't done any time so far during the training and I'm 4 00:00:17,460 --> 00:00:25,350 talking of course about that step six just before we made the update step and the steps six consist 5 00:00:25,440 --> 00:00:33,150 of soaring the directions by the maximum of the couple of positive words and negative words meaning 6 00:00:33,450 --> 00:00:38,820 the words we get by applying the perturbations in a positive direction and the reward we get by applying 7 00:00:38,820 --> 00:00:41,470 the perturbation in the opposite direction. 8 00:00:41,520 --> 00:00:49,200 So we have these couples of positive and negative rewards for each of the 16 directions that we're testing 9 00:00:49,430 --> 00:00:54,440 and for each of the 16 directions we're going to get these positive reward and negative word. 10 00:00:54,510 --> 00:01:00,240 We're going to take the next of them and then we'll sort all the directions by the highest of these 11 00:01:00,330 --> 00:01:01,760 maximum of words. 12 00:01:01,950 --> 00:01:03,000 That's what we need to do. 13 00:01:03,000 --> 00:01:09,660 And of course we want to do this because we looking for the directions that increased the most the word 14 00:01:09,990 --> 00:01:16,160 because indeed the higher is the reward we reach the better the I will have the ability to walk. 15 00:01:16,170 --> 00:01:20,310 So it's always about trying to optimize and increase the reward. 16 00:01:20,310 --> 00:01:26,940 And that's why right after the step six you have this update step one step of great the sense which 17 00:01:26,940 --> 00:01:32,910 will update your weight in these best directions then increase the rewards. 18 00:01:32,940 --> 00:01:38,820 And here this is the approximated grade in the sense that we're doing by approximating the gradient 19 00:01:38,970 --> 00:01:41,580 of the word with respect to the weight. 20 00:01:41,580 --> 00:01:43,380 All right so let's do this let's do this. 21 00:01:43,380 --> 00:01:47,800 Step 6 soling the directions by the max of the rewards. 22 00:01:48,230 --> 00:01:50,160 OK so how are we going to do this. 23 00:01:50,160 --> 00:01:56,640 Well first we are going to introduce a variable that we're going to call scores and that will get these 24 00:01:57,020 --> 00:02:03,690 you know maximums of the reward in the positive direction and the reward in negative direction or the 25 00:02:03,690 --> 00:02:04,870 opposite direction. 26 00:02:05,130 --> 00:02:09,810 So we're going to get these maximum first for each of the 16 directions. 27 00:02:09,810 --> 00:02:10,050 All right. 28 00:02:10,050 --> 00:02:12,350 So let's do this let's go back to Python. 29 00:02:12,420 --> 00:02:17,220 So our scores here how are we going to gather all these maximums. 30 00:02:17,430 --> 00:02:24,480 Well since then after gathering this course here we're going to use the Soledad function which is a 31 00:02:24,480 --> 00:02:29,130 function that allows you to sort the values of a dictionary. 32 00:02:29,280 --> 00:02:35,760 Well the thing we're going to do with course here is set it as a dictionary in which the keys will be 33 00:02:36,060 --> 00:02:44,610 just the numbers from 1 to 16 or no from 0 to 15 and the values of this case will be these maximums 34 00:02:45,030 --> 00:02:46,580 that we are trying to sort. 35 00:02:46,590 --> 00:02:52,100 So that then it will be much easier for us to sort these maximums. 36 00:02:52,110 --> 00:02:58,620 So the first thing we're going to do here is add some brackets like that because that's the syntax of 37 00:02:58,620 --> 00:03:00,240 a dictionary in Python. 38 00:03:00,630 --> 00:03:06,570 So we had some brackets and then we first need to define the keys which I remind are just going to be 39 00:03:06,570 --> 00:03:13,950 some integers starting from 0 to 15 because each cue will correspond to one of the 16 directions that 40 00:03:13,950 --> 00:03:15,090 we're testing. 41 00:03:15,090 --> 00:03:20,510 So clear here will represent this key and integer going from zero to 15. 42 00:03:20,790 --> 00:03:28,020 And then you need to add a column and then you specify the value of that key and the value of that key 43 00:03:28,040 --> 00:03:34,680 is going to be exactly the maximum of the reward we get in the positive direction. 44 00:03:34,680 --> 00:03:41,870 For one specific direction then the reward we get in the negative direction meaning the opposite direction 45 00:03:41,890 --> 00:03:43,920 as this positive direction. 46 00:03:44,160 --> 00:03:46,400 That's our values the maximum of this. 47 00:03:46,470 --> 00:03:51,000 We will we get in the positive direction and that's where we get in the negative direction. 48 00:03:51,150 --> 00:03:56,870 And of course we need to get these keys and maximum of these rewards for each of the six indirections. 49 00:03:56,880 --> 00:04:01,150 So now what we're going to do is a for loop inside the dictionary. 50 00:04:01,290 --> 00:04:06,110 We can do that to classical Trigon python instead of putting the dictionary inside a for loop. 51 00:04:06,200 --> 00:04:11,240 You put the for loop inside a dictionary and you can also do that to populate a list. 52 00:04:11,310 --> 00:04:19,130 And so I'm going to add here of four and then you know we want to generate this for each of the 16 keys. 53 00:04:19,260 --> 00:04:22,100 Each of the key corresponding to one of the six indirections. 54 00:04:22,470 --> 00:04:30,030 So since our keys here are the case I'm going to do K but also the word in the positive direction and 55 00:04:30,030 --> 00:04:34,970 the reward in the negative direction because we need to get them as well in. 56 00:04:35,270 --> 00:04:41,660 And now we're going to get everything using two tricks to enumerate function which will enumerate the 57 00:04:41,670 --> 00:04:42,120 integers. 58 00:04:42,120 --> 00:04:47,890 Want to get that is the case from 0 to 15 and the zip function together. 59 00:04:48,000 --> 00:04:53,010 The positive words and the negative words because indeed we need the our post we want in the positive 60 00:04:53,010 --> 00:04:56,410 direction and our nega we work in the negative direction. 61 00:04:56,670 --> 00:04:57,940 So here is what we have to do. 62 00:04:57,960 --> 00:05:05,420 We take first enumerate to generate integers and then inside enumerate we take the zip function that 63 00:05:05,420 --> 00:05:12,770 will gather together the positive reward which we already gathered here all the positive words. 64 00:05:13,010 --> 00:05:23,970 And of course the negative reward this all of this here will get me the case that is integers from 0 65 00:05:23,970 --> 00:05:27,820 to 15 corresponding to the 0 to 15 directions. 66 00:05:27,840 --> 00:05:33,110 Then the puzzle was that is the words we get by applying the perturbations in opposite directions. 67 00:05:33,150 --> 00:05:39,030 In other words these guys here and the negative words that is the reward we get by playing the perturbations 68 00:05:39,060 --> 00:05:42,650 in the negative directions are the opposite directions as these ones. 69 00:05:42,990 --> 00:05:45,610 And these are these ones negative word. 70 00:05:45,630 --> 00:05:46,530 So all good. 71 00:05:46,530 --> 00:05:55,200 We just made a dictionary taking as the keys to integers from 0 to 15 corresponding to the 16 directions 72 00:05:55,200 --> 00:06:01,970 that we're testing and as values the maximum of these words in a positive direction and we words in 73 00:06:01,970 --> 00:06:03,430 a negative direction. 74 00:06:03,630 --> 00:06:04,070 All right. 75 00:06:04,110 --> 00:06:07,270 So that's the first thing done then next step. 76 00:06:07,410 --> 00:06:13,470 Now that we have the scores we want to sort them and because we made a dictionary it's going to be very 77 00:06:13,470 --> 00:06:19,650 easy for us to sort them indeed we're going to use the sorted function which is a function by Peyton 78 00:06:20,100 --> 00:06:27,960 to sort just the keys of the dictionaries by their values you know from their highest value to their 79 00:06:27,960 --> 00:06:31,910 lowest value which means that we're not going to get. 80 00:06:31,920 --> 00:06:37,360 Now this time the maximums themselves we're just going to get these indexes in the list. 81 00:06:37,520 --> 00:06:43,010 But instead of having the list from you know 0 1 2 3 to 16 we'll get the list. 82 00:06:43,050 --> 00:06:47,600 In this specific order sold by the maximum of these we want. 83 00:06:47,610 --> 00:06:53,580 So for example if the highest of these maximums is is next three and if the second highest is in the 84 00:06:53,580 --> 00:07:02,010 next 7 for example well the order list that we're about to get will start with three and seven. 85 00:07:02,130 --> 00:07:07,950 And then because we have a dictionary with all the keys and values well organized well we'll be able 86 00:07:07,950 --> 00:07:13,260 to find the highest word things to the indexes which are the keys. 87 00:07:13,260 --> 00:07:18,810 All right so what I'm going to do now is introduce a new variable which will be order and which will 88 00:07:18,810 --> 00:07:22,880 be a list that will contain these keys of the highest. 89 00:07:22,950 --> 00:07:28,330 Couple of words right of the highest maximum of positive reward and negative reward. 90 00:07:28,590 --> 00:07:32,000 So how are we going to get these keys and indexes. 91 00:07:32,250 --> 00:07:33,380 Well it's easy. 92 00:07:33,420 --> 00:07:39,750 We need to take the sorted function by Python which will sort your dictionary. 93 00:07:40,080 --> 00:07:46,290 And in this sort function the first thing that to specify or input or what you want to get in order 94 00:07:46,630 --> 00:07:49,480 do you want to get the maximum of the rewards. 95 00:07:49,500 --> 00:07:51,590 Or do you want to get the keys. 96 00:07:51,770 --> 00:07:56,790 Are you going to say do we want to get the keys because these are the keys that will help us find everything 97 00:07:57,150 --> 00:08:03,120 meaning the directions and the words in the positive direction and opposite directions for that specific 98 00:08:03,120 --> 00:08:04,100 direction. 99 00:08:04,110 --> 00:08:10,890 So what you want are the keys and therefore the first thing I'm going to specify here is this chorus 100 00:08:11,220 --> 00:08:16,130 that keys that is the keys of my scores dictionary. 101 00:08:16,560 --> 00:08:24,210 Then we add some parenthesis here and then here we go we get the keys and now you add a comma and that's 102 00:08:24,210 --> 00:08:29,670 when you specify by what you want to sort the keys of your dictionary. 103 00:08:29,970 --> 00:08:36,210 So you get the keys of your dictionary but then you want to sort them by these maximums for each of 104 00:08:36,210 --> 00:08:42,030 the directions starting from 0 to 15 because we have 16 directions in total. 105 00:08:42,570 --> 00:08:45,760 And so to do this you're going to do it through function. 106 00:08:45,840 --> 00:08:53,010 It's basically the argument expected by the sort of function here and this argument is called qi but 107 00:08:53,010 --> 00:08:57,300 be careful it's not to be confused with the keys or dictionary. 108 00:08:57,300 --> 00:09:04,620 So we take key and then we add equals and then Lunda which is just to specify that we're about to do 109 00:09:04,620 --> 00:09:11,940 a function then the argument of the function which we call expert which will be the key of the dictionary 110 00:09:12,750 --> 00:09:20,610 then colon and then what this Lunda function will return which are this course right the scores of the 111 00:09:20,610 --> 00:09:22,150 keys xx. 112 00:09:22,380 --> 00:09:31,320 So that right now thanks to this function here we know that we want to sort our scores dictionary by 113 00:09:31,320 --> 00:09:37,440 discours here which are the maximums of the word we get in the positive direction in the word we get 114 00:09:37,440 --> 00:09:44,790 in the negative direction for the 16 different directions and by specifying scores that keys here will 115 00:09:44,790 --> 00:09:52,530 get not these maximums returned in this order list but the keys of these dictionaries correspond into 116 00:09:52,530 --> 00:09:57,810 the keys of the highest word among the 16 positive and opposite directions. 117 00:09:57,810 --> 00:10:04,540 All right so that's a very useful trick in Python if you want to sort a mapping of indexes and values. 118 00:10:04,680 --> 00:10:10,470 You know the indexes here were just the indexes of the directions from 0 to 15 and the values where 119 00:10:10,560 --> 00:10:15,380 the maximums of these words correspond to each of these 16 directions. 120 00:10:15,390 --> 00:10:23,090 So that's a useful trick to know with them that's not over here we get a list here but we won't remember 121 00:10:23,390 --> 00:10:26,890 to get the best directions. 122 00:10:26,930 --> 00:10:27,560 Right. 123 00:10:27,560 --> 00:10:32,100 We have the total number of directions but we also have the total number of best directions. 124 00:10:32,180 --> 00:10:40,100 And remember that in the paper here we're not summing on all the directions but the best directions 125 00:10:40,130 --> 00:10:42,480 be the number of best directions here. 126 00:10:42,920 --> 00:10:49,520 So what we simply need to add here is you know since this is a list well we're going to add some brackets 127 00:10:49,940 --> 00:10:53,200 to take the best directions on. 128 00:10:53,540 --> 00:10:54,680 And how can we do that. 129 00:10:54,680 --> 00:11:02,090 Well since this list is already sorted by the maximums of the reward in the positive direction and we 130 00:11:02,090 --> 00:11:03,690 were in the opposite direction. 131 00:11:03,920 --> 00:11:12,050 Well what we simply need to do here to get the best solutions is just to go from zero to the number 132 00:11:12,920 --> 00:11:20,600 of best directions and you know we don't even have to specify the Zero Year because the lower bound 133 00:11:20,630 --> 00:11:22,650 of a range is by default zero. 134 00:11:22,940 --> 00:11:31,670 So here we are taking the first and the best directions elements which corresponds to the highest maximums 135 00:11:31,670 --> 00:11:38,320 of positive reward and opposite reward because in the way this was are sorted by these maximums. 136 00:11:38,330 --> 00:11:40,640 All right so that's the simple thing we have to do here. 137 00:11:40,820 --> 00:11:46,440 And this gives us exactly what we want or at least what exactly is said in the paper. 138 00:11:46,460 --> 00:11:47,260 Perfect. 139 00:11:47,570 --> 00:11:54,080 So now we have one final thing to do now that we have these indexes but it's not all of course. 140 00:11:54,090 --> 00:12:01,010 Now we want to use these indexes to get Indeed the reward we get by applying the perturbations in the 141 00:12:01,040 --> 00:12:02,930 best directions that are sorted. 142 00:12:03,170 --> 00:12:09,080 And also the reward we get by applying the perturbations in the opposite directions of these resurrections 143 00:12:09,410 --> 00:12:11,180 and of course the perturbations. 144 00:12:11,180 --> 00:12:15,860 Why do we want to get these three things that of course because in the next step will make the update 145 00:12:15,860 --> 00:12:22,670 step which will take the words we get the positive best directions and the rewards we get in the negative 146 00:12:22,700 --> 00:12:30,490 best directions and also the perturbations the values of the perturbations for each of these desperations. 147 00:12:30,740 --> 00:12:37,160 So that's why when together all this now and then we'll be able to make that data step which corresponds 148 00:12:37,160 --> 00:12:43,760 to one step of and descend to approximate the gradient in order to date the weight in these best directions. 149 00:12:43,760 --> 00:12:47,120 All right so now what are we going to do. 150 00:12:47,120 --> 00:12:51,250 Remember we talked about a new concept before which are the rollout. 151 00:12:51,260 --> 00:12:55,650 Well here we go you know that's the update function which takes us in put the roll out. 152 00:12:55,880 --> 00:13:01,250 And therefore since the next step will be to apply the update function to make this one step of gradient 153 00:13:01,250 --> 00:13:01,900 descent. 154 00:13:02,090 --> 00:13:09,800 Well we need to gather these roll out and I remind that these roles are nothing else than the best directions 155 00:13:09,810 --> 00:13:16,250 triplets of the words and the positive resurrections the words in the opposite desecrations and the 156 00:13:16,250 --> 00:13:17,580 perturbation D. 157 00:13:17,750 --> 00:13:23,180 So what we need to do now is prepare those roll out and this will be very easy for us because indeed 158 00:13:23,540 --> 00:13:28,700 we have all the keys of dictionaries for the best directions and therefore will use these keys to get 159 00:13:28,940 --> 00:13:31,980 the positive word the negative reward and the preservation. 160 00:13:32,330 --> 00:13:34,660 So let's do it let's prepare here. 161 00:13:34,700 --> 00:13:44,750 Our rollouts are rollout which will be the least of the following triplets composed of First the positive 162 00:13:45,500 --> 00:13:46,430 rewards. 163 00:13:46,430 --> 00:13:52,030 That is the words attained by playing the perturbations in a positive direction for the best directions. 164 00:13:52,030 --> 00:13:59,210 So here since we to take the best actions of course we're going to add here the index K and K you will 165 00:13:59,210 --> 00:14:05,810 see will be all the case in the order list because the list contains indeed the indexes of the best 166 00:14:05,810 --> 00:14:06,900 directions. 167 00:14:06,930 --> 00:14:08,040 It's puzzle with words. 168 00:14:08,070 --> 00:14:16,790 K then of course I'm going to copy this because the next element of the triplet will be the negative 169 00:14:17,360 --> 00:14:18,630 words OK. 170 00:14:18,920 --> 00:14:21,090 So I'm replacing that here. 171 00:14:21,110 --> 00:14:22,590 Negative three words OK. 172 00:14:22,940 --> 00:14:29,370 And then of course the third element of the triplet as we said is the perturbation deltas. 173 00:14:29,540 --> 00:14:37,010 And again we take in excess K of the best directions and in order to get these preservations the final 174 00:14:37,010 --> 00:14:42,940 touch of what we're doing now is to add of course a new full loop inside the list. 175 00:14:42,950 --> 00:14:46,180 Now I tell you that we can do it inside a dictionary or inside a list. 176 00:14:46,250 --> 00:14:48,170 It's a very classic trick in Python. 177 00:14:48,290 --> 00:14:56,120 And of course we're going to get all the case in the order list composed of the indexes of the best 178 00:14:56,120 --> 00:14:59,510 directions leading to the highest we want. 179 00:14:59,510 --> 00:15:06,680 All right so here we go we have the rollouts now and so we are ready to make that data step to do this 180 00:15:06,680 --> 00:15:10,690 one step of greatness and to have data weight in these best directions. 181 00:15:11,000 --> 00:15:12,650 Let's do that and the next is oil. 182 00:15:12,650 --> 00:15:14,540 And until then enjoy AI.