1
00:00:01,240 --> 00:00:05,830
Hello and welcome back to the course on augmented random search engine today's tutorial we're going

2
00:00:05,830 --> 00:00:09,060
to compare basic versus augmented random search.

3
00:00:09,190 --> 00:00:13,990
So we're going to find out why is augmented random search called augmented random search.

4
00:00:13,990 --> 00:00:17,440
What is this word augmented in the name stands for.

5
00:00:17,470 --> 00:00:18,700
So let's have a look.

6
00:00:18,700 --> 00:00:26,230
We're going to look at three main updates that were done in Erris to make it augmented.

7
00:00:26,370 --> 00:00:32,640
The first one is scale update step by the standard deviation of rewards.

8
00:00:32,850 --> 00:00:35,640
Second one was online normalization of states.

9
00:00:35,640 --> 00:00:41,230
And third one is discarding directions that yield the lowest rewards.

10
00:00:41,400 --> 00:00:51,000
So the first one which is about scaling up data step by step division rewards is as simple as taking

11
00:00:51,060 --> 00:00:57,840
this calculation which we discussed in the previous tutorial and dividing it by the standard deviation

12
00:00:57,840 --> 00:01:01,070
of the rewards involved in this in this calculation.

13
00:01:01,080 --> 00:01:05,330
Now it's not as important for the purposes of our intuition.

14
00:01:05,340 --> 00:01:10,610
There's a technical reason behind this and what we're not going to dive into it.

15
00:01:10,800 --> 00:01:15,200
You can find more information on this in part three point one.

16
00:01:15,230 --> 00:01:20,070
All of the research paper which will reference at the end of that tutorials same research paper that

17
00:01:20,190 --> 00:01:24,700
we referenced in the previous article the main research paper.

18
00:01:24,720 --> 00:01:27,770
So this is not as important for us as the other one.

19
00:01:27,770 --> 00:01:29,570
So let's move on from here.

20
00:01:29,580 --> 00:01:34,410
But just something important something to keep in mind and something to have a look in the paper or

21
00:01:34,700 --> 00:01:37,050
to look out for in the practical tutorial.

22
00:01:37,260 --> 00:01:43,940
The second update is online normalization of states and online basically means real time normalization

23
00:01:43,980 --> 00:01:50,070
of states as the AI is learning as the AI is going through these environments.

24
00:01:50,070 --> 00:01:54,690
The states are normalized and what does that what is their emotional states mean.

25
00:01:54,690 --> 00:02:02,010
It means normalizing these values that are inputs over here and not just normalizing them based on what

26
00:02:02,010 --> 00:02:11,280
they are normalizing them on based on what values we've already seen so that everything is treated in

27
00:02:11,280 --> 00:02:11,960
a similar way.

28
00:02:11,970 --> 00:02:16,350
And so they talk about this in more detail in their research paper and this will be one of the first

29
00:02:16,350 --> 00:02:18,000
things that will code I'd love.

30
00:02:18,000 --> 00:02:21,980
I'll just give a quick overview of how to ensure we think about this.

31
00:02:21,990 --> 00:02:24,940
And this is the same example as they given the research paper as well.

32
00:02:25,080 --> 00:02:32,430
So imagine that these input states for one case are between ranging between 90 or 100 so these values

33
00:02:32,430 --> 00:02:38,010
are between somewhere like between 80 and 100 like 91 92 and maybe 99.

34
00:02:38,100 --> 00:02:40,860
And another case there between minus 1 and 1.

35
00:02:40,860 --> 00:02:48,140
Now the thing is that the weights the fact that they're going to have like changing the way you remember

36
00:02:48,140 --> 00:02:54,870
how we learn we learn by being the weights while changing the way slightly by like 0.1.

37
00:02:55,250 --> 00:03:02,750
In the case when this value is 100 is going to yield a much more drastic change in output as opposed

38
00:03:02,750 --> 00:03:08,810
to when this value is something like 1 or zero point five or something like that.

39
00:03:08,810 --> 00:03:16,880
So basically even though like it's the environment might be so the environment might be described with

40
00:03:16,880 --> 00:03:21,200
different ranging values which is fair environment might be changing might be different it might be

41
00:03:21,200 --> 00:03:26,800
some you know different forces happening different types of terrain or things like that.

42
00:03:26,960 --> 00:03:33,590
And because because of that because these values might be in different ranges what can happen is that

43
00:03:33,590 --> 00:03:39,950
slight perturbation of weights might have different likely different magnitude or impacts on the output

44
00:03:39,950 --> 00:03:47,780
values even though they is the only reason for that is that the is in a different environment so we

45
00:03:47,780 --> 00:03:49,970
want to minimize that effect.

46
00:03:50,120 --> 00:03:58,100
And we want to have cheap perturbations in the way it's to be kind of like fair the perturbations to

47
00:03:58,100 --> 00:04:01,270
be fairly treated across the whole training process.

48
00:04:01,400 --> 00:04:05,930
And that's why we're going to normalize them but not just normalize them because a set of weights I

49
00:04:05,930 --> 00:04:11,600
think is eventually going to normalize them online which is also called widening of states meaning we're

50
00:04:11,600 --> 00:04:18,350
going to take the values that we've already seen for these states and we're going to normalize these

51
00:04:18,350 --> 00:04:20,520
values alongside what we've already seen.

52
00:04:20,690 --> 00:04:28,220
And that way the weights won't be affected as drastically by the different input the different ranges

53
00:04:28,220 --> 00:04:29,390
of these in propellors.

54
00:04:29,600 --> 00:04:37,700
So that was one of the biggest effects if you read the blog post that was in reference reading at the

55
00:04:37,700 --> 00:04:42,500
start of this section you would have noticed that the supervisor of the

56
00:04:45,530 --> 00:04:53,130
Horia money and really a guy he actually mentioned that this was one of the biggest changes that helped

57
00:04:53,570 --> 00:04:58,610
us tackle challenges such as humanoid walking.

58
00:04:58,610 --> 00:05:00,320
All right so that's important to understand.

59
00:05:00,320 --> 00:05:04,970
And again you'll see a bit more of that in the whole practical side of things.

60
00:05:04,970 --> 00:05:13,390
And in the research paper and finally the third update was discarding directions that yield lowest rewards.

61
00:05:13,430 --> 00:05:15,550
Quite a bold move.

62
00:05:15,920 --> 00:05:19,850
And it turned out to be a good bet for the authors.

63
00:05:19,850 --> 00:05:24,890
So here we've got our four or eight actually different perturbations.

64
00:05:24,890 --> 00:05:31,940
So for positive or negative the ones that we talked about the to tutorial and here we've got the results

65
00:05:32,600 --> 00:05:36,140
how they went through and we've got the rewards.

66
00:05:36,320 --> 00:05:40,760
And basically what they like as you recall previously we would use all of them but now they're saying

67
00:05:40,790 --> 00:05:48,320
all right we're only going to take the top k top k of these rewards the top k of these perturbations

68
00:05:48,320 --> 00:05:52,940
are actually going to work with them when are we going to discard the the ones that weren't part of

69
00:05:52,940 --> 00:05:53,290
the top.

70
00:05:53,320 --> 00:05:58,060
OK so we're going to take this one we're going to take this one and we're going to throw away these

71
00:05:58,220 --> 00:06:03,860
these two or these four perturbations we're not actually going to include them in our calculation and

72
00:06:03,860 --> 00:06:05,010
that changes the formula.

73
00:06:05,010 --> 00:06:06,280
We discussed over here.

74
00:06:06,590 --> 00:06:13,160
It's actually not going to have these include basically the weights are going to evolve in the direction

75
00:06:13,160 --> 00:06:18,520
of only the most successful results that we saw.

76
00:06:19,230 --> 00:06:27,440
And yeah and so that was also a significant update that helped this method and it's quite quite an intuitive

77
00:06:27,440 --> 00:06:28,240
one as well.

78
00:06:28,250 --> 00:06:28,510
Why.

79
00:06:28,520 --> 00:06:36,270
Why not take the the highest pathé the directions or revolution that have the highest potential of all.

80
00:06:36,290 --> 00:06:42,630
Just in that direction and discard the ones that have the lowest part of sounds a bit like natural selection.

81
00:06:42,950 --> 00:06:50,330
And now those are the three main differences if you'd like to get a bit more details of course you'll

82
00:06:50,330 --> 00:06:53,300
have them in all of these things in the practical terms.

83
00:06:53,300 --> 00:06:57,710
But if you'd like to read a bit more about them and understand them not just sort of intuitive Bay bases

84
00:06:57,740 --> 00:07:05,810
but like more philosophically or more mathematically then you once again would like to refer you to

85
00:07:06,140 --> 00:07:12,700
the research paper by her money and really a guy on an argument and random searches same research paper

86
00:07:12,710 --> 00:07:13,770
reference the embryos that are.

87
00:07:13,780 --> 00:07:19,310
But just in case you skipped that tural because it was quite advanced then this is a good opportunity

88
00:07:19,370 --> 00:07:23,530
to get back to that research paper and have a look at some of these things.

89
00:07:23,660 --> 00:07:27,530
And on that note I look forward to seeing that next of Tauriel.

90
00:07:27,540 --> 00:07:29,240
And until then enjoy AI.