1
00:00:00,330 --> 00:00:05,400
Hello and welcome back to the course on augmented a random search in today's tutorial we're going to

2
00:00:05,400 --> 00:00:08,920
be comparing Erris versus other AI algorithms.

3
00:00:09,090 --> 00:00:15,600
And the reason why we included this tutorial in the course is to highlight the main differences between

4
00:00:15,600 --> 00:00:20,930
areas and other standard or more conventional AI algorithms.

5
00:00:20,940 --> 00:00:25,590
It's pretty hard to say conventional because all of this is cutting edge technology and cutting edge

6
00:00:26,670 --> 00:00:32,580
models and algorithms but nevertheless we want to contrast areas which is brand new and how is it different

7
00:00:32,970 --> 00:00:37,820
to other Algor algorithms that exist out there that we've seen before.

8
00:00:37,920 --> 00:00:44,130
And hopefully this will give you a good overview of how the contrast and how you'll be more prepared

9
00:00:44,130 --> 00:00:45,150
for these conversations.

10
00:00:45,150 --> 00:00:49,800
Or know what the advantages of areas are with a difference.

11
00:00:50,100 --> 00:00:55,080
So let's have a look on the left you can have a column for features or areas that are going to be three

12
00:00:55,080 --> 00:00:57,550
main distinctions between the two.

13
00:00:57,650 --> 00:01:00,880
There's the full list but those are the ones that we're going to cover.

14
00:01:01,110 --> 00:01:03,610
And those are the ones we find the most important ones.

15
00:01:03,690 --> 00:01:05,810
And on the road we're going to have other area.

16
00:01:06,180 --> 00:01:13,500
So number one is exploration Eris performs the exploration in the policy space weather whereas other

17
00:01:13,500 --> 00:01:17,000
AI usually performs exploration in action space.

18
00:01:17,250 --> 00:01:21,490
So let's have a look at what this actually means in Eris.

19
00:01:21,510 --> 00:01:23,200
We've already discussed that.

20
00:01:23,220 --> 00:01:30,750
We've got this Perceptor on and then we wait until the agent gets to the end of the episode and then

21
00:01:30,750 --> 00:01:38,400
after based on the result whether they successfully got to the end and one or they fell over at some

22
00:01:38,400 --> 00:01:43,390
point we get a reward which we calque which we use to adjust the weights.

23
00:01:43,590 --> 00:01:47,220
Well the thing is environment's actually are set up.

24
00:01:47,370 --> 00:01:52,860
For instance this merger go environment is set up independent of the algorithm that's going to be applied.

25
00:01:52,860 --> 00:01:58,140
So there's this environment and you can apply arrest to it you can apply a different AI or any other

26
00:01:58,140 --> 00:01:59,400
algorithm to it.

27
00:01:59,520 --> 00:02:04,710
So and the thing is that the environment is set up to provide it provide rewards not just at the end

28
00:02:04,710 --> 00:02:07,450
of the episode but after every single action.

29
00:02:07,450 --> 00:02:09,360
Just that's just how the environment works.

30
00:02:09,360 --> 00:02:14,040
Every time the agent does something the environment provides a reward.

31
00:02:14,040 --> 00:02:19,110
For instance if it's getting closer to the target then it's getting a positive reward if it's falling

32
00:02:19,140 --> 00:02:23,060
over or if it's getting further than the target might be getting a negative reward.

33
00:02:23,070 --> 00:02:27,370
In any case there's a reward that's provided after every single action.

34
00:02:27,500 --> 00:02:35,430
What Erez does though is it doesn't take advantage of the opportunity to look at the reward after every

35
00:02:35,430 --> 00:02:42,650
single action arrest accumulates the total reward and then looks at the total reward after the episode.

36
00:02:42,780 --> 00:02:50,460
That's important to understand and remember that it's kind of like not using it could be looking at

37
00:02:50,550 --> 00:02:52,530
the reward after every action but it's not.

38
00:02:52,530 --> 00:03:00,390
So it's actually not using the all of the possibilities that the environment provides and yet is still

39
00:03:00,390 --> 00:03:02,850
stronger than other AI's surprising.

40
00:03:03,690 --> 00:03:11,570
And this is kind in contrast to how other AI usually works and the way they work is the Jesus lives

41
00:03:11,570 --> 00:03:13,000
from artificial intelligence.

42
00:03:13,010 --> 00:03:16,620
It is of course if you've been part of that course then you'll remember them.

43
00:03:16,800 --> 00:03:22,650
Normally AI performs actions and gets rewards so it goes into a new state and gets a reward right away

44
00:03:22,650 --> 00:03:27,060
so that action gets a reward performs an action gets reward and it analyzes the reward.

45
00:03:27,120 --> 00:03:33,890
In contrast to Erris normally the AI reinforcement learning and keep learning things like that.

46
00:03:33,990 --> 00:03:38,060
They analyze the reward right away as soon as the action is performed they get the reward.

47
00:03:38,070 --> 00:03:44,400
They analyze the reward and what this allows them to do is apply the Belman equation and build a value

48
00:03:44,400 --> 00:03:45,730
function for states.

49
00:03:45,750 --> 00:03:52,080
So then they know the values of different states of being or of being in different states and therefore

50
00:03:52,110 --> 00:03:57,930
after that they can use those value functions in order to make their decisions and create the maps are

51
00:03:57,930 --> 00:04:00,470
they going to get through these environments.

52
00:04:00,960 --> 00:04:08,700
So that's the main difference in Erris we focus on the total reward for the whole episode.

53
00:04:08,970 --> 00:04:13,710
And basically we're analyzing we're exploring therefore we're not exploring every single action.

54
00:04:13,710 --> 00:04:15,390
We're exploring the whole policy.

55
00:04:15,390 --> 00:04:18,950
So the whole approach of how to go through an episode.

56
00:04:18,960 --> 00:04:25,530
So we're and we're assessing the whole episode the whole episode and therefore it's we're exploring

57
00:04:25,530 --> 00:04:32,760
in the policy state space whereas in other AI Normally we're exploring actions and therefore we're exploring

58
00:04:32,760 --> 00:04:33,770
in action space.

59
00:04:33,840 --> 00:04:36,800
So that's what we have exploration in the policy space.

60
00:04:36,800 --> 00:04:41,370
We know from start to finish for the whole episode versus exploration action space where we explore

61
00:04:41,370 --> 00:04:46,070
every individual action and that's the difference in a way.

62
00:04:46,070 --> 00:04:51,390
If you have an artificial intelligence that is of course you remember that we had a tutorial on eligibility

63
00:04:51,390 --> 00:04:59,550
trace in in there for AI and that's exactly that's kind of similar That's what IRS does is kind of similar

64
00:04:59,580 --> 00:05:02,260
to disability trace we're in eligibility trace.

65
00:05:02,270 --> 00:05:09,140
We didn't look at every single action but we matched up the actions that we did had five or 10 or 15

66
00:05:09,140 --> 00:05:14,570
actions in a row and then we only looked at the reward and we remember from there that that actually

67
00:05:14,570 --> 00:05:17,540
improved the performance of other.

68
00:05:17,580 --> 00:05:23,420
We won't go into detail on this but just just a hint that if you are aware of eligibility trace arrest

69
00:05:23,420 --> 00:05:29,200
is kind of like similar thinking in the similar action as eligibility trait.

70
00:05:29,390 --> 00:05:29,960
All right.

71
00:05:29,960 --> 00:05:32,700
Next step two is what.

72
00:05:32,720 --> 00:05:40,620
How is that updating of weights performed in as it's the method of finite differences in other AI it's

73
00:05:40,820 --> 00:05:46,900
Grady and the famous or famous gradient descent algorithm.

74
00:05:46,940 --> 00:05:56,070
So we talked about the matter of finite differences or here where we basically have a small point a

75
00:05:56,090 --> 00:06:02,690
positive shift a negative shift or a positive delta negative delta we have the rewards and based on

76
00:06:02,690 --> 00:06:06,660
that we can calculate how to adjust our weights.

77
00:06:06,710 --> 00:06:15,550
Overall the difference is that in other AI AI we don't use Mithal like we actually use gradient Paul's

78
00:06:15,590 --> 00:06:19,200
descent which is proper differentiation.

79
00:06:19,290 --> 00:06:22,560
All of the loss that you're getting.

80
00:06:22,700 --> 00:06:27,940
So in Erris you get a reward in not normally in a you get you get a loss.

81
00:06:28,250 --> 00:06:36,140
So it's a proper differentiation of the last based on the weights and going backwards through through

82
00:06:36,140 --> 00:06:36,920
the neck.

83
00:06:37,490 --> 00:06:46,820
So that is what we do normally I approach skol also known as a back propagation of the error through

84
00:06:46,820 --> 00:06:48,680
the network to update the weights.

85
00:06:48,680 --> 00:06:55,580
The reason why we can do that in Erris why we have to do the method of finite differences is because

86
00:06:55,580 --> 00:06:58,160
we simply don't have a value function.

87
00:06:58,160 --> 00:07:04,130
We just don't because we're not exploring an actual space we don't have that value function that we

88
00:07:04,130 --> 00:07:10,010
can then we're going to apply this method to and therefore we have to deal with what we have.

89
00:07:10,010 --> 00:07:14,160
And there that's why the method of finite differences is used.

90
00:07:14,160 --> 00:07:14,730
All right.

91
00:07:14,810 --> 00:07:17,650
And it works totally fine and works really great.

92
00:07:17,750 --> 00:07:25,100
In fact in that main research paper for Erris they talk about that this is a good enough approximation

93
00:07:25,160 --> 00:07:27,360
of the gradient.

94
00:07:27,830 --> 00:07:28,520
All right.

95
00:07:28,520 --> 00:07:31,650
And number three is shallow learning.

96
00:07:31,670 --> 00:07:35,780
In the case of arrests in the case of other AI it's usually deep learning.

97
00:07:35,780 --> 00:07:43,270
So if we have a look as we recall this was the percept drawn for a dress it might have a lot of inputs.

98
00:07:43,280 --> 00:07:48,080
I have a lot of outputs but the main point is that there's just one the input layer which is Connect

99
00:07:48,080 --> 00:07:54,860
the right way to the output there they're an AI that is based on deep learning usually have a at least

100
00:07:54,860 --> 00:08:02,900
a hidden layer like this at least one hidden layer where these inputs are entered combined and then

101
00:08:02,900 --> 00:08:04,090
they're going to the output.

102
00:08:04,280 --> 00:08:10,550
In fact more often than not you will have multiple hidden layers and therefore it is actually proper

103
00:08:10,550 --> 00:08:17,470
deep learning that is used because as you can see the neural network is very deep in this case.

104
00:08:17,930 --> 00:08:18,850
And so that's the difference.

105
00:08:18,860 --> 00:08:24,410
And obviously that takes more attention more training as more weights that need to adjusted and so on

106
00:08:25,250 --> 00:08:26,570
and so there we go.

107
00:08:26,570 --> 00:08:32,710
We've got arrest exploration the poses space murtherer of finite differences in shallow learning.

108
00:08:32,770 --> 00:08:35,570
Either way I is explosion in action space.

109
00:08:35,570 --> 00:08:42,710
Gradient descent algorithm and back propagation and three usually deportee.

110
00:08:43,040 --> 00:08:45,240
And so what does this sum up to.

111
00:08:45,440 --> 00:08:53,570
Well even no arrests in many ways is kind of like less involved then other AI as we just discussed.

112
00:08:53,960 --> 00:08:56,940
As a result Erris is still much stronger.

113
00:08:56,940 --> 00:09:03,590
It's up to 15 times faster and at the same time yields higher rewards on specific applications.

114
00:09:03,590 --> 00:09:12,380
And I put in specific applications here because airiest has been a critique that is like it finds real

115
00:09:12,380 --> 00:09:13,910
inefficiencies for instance or

116
00:09:16,700 --> 00:09:23,450
inaccuracies in the magical algorithm and it like exploits them and so therefore somebody someone might

117
00:09:23,450 --> 00:09:29,320
argue that deep AI is based on deep learning even though it's slower might be more versatile and be

118
00:09:29,320 --> 00:09:32,880
more applicable and smart in different ways.

119
00:09:32,990 --> 00:09:37,900
But at the end of the day if you have a specific application and you need to get it sorted out and you

120
00:09:37,900 --> 00:09:45,410
need to solve the problem why does it matter if you know what what's how you get to the end result like

121
00:09:45,410 --> 00:09:51,710
you know as long as you are getting the result that you want then and you know the problem is solved

122
00:09:51,770 --> 00:10:00,440
then you have a good outcome and so therefore a arrest can be a great solution in many different applications

123
00:10:00,470 --> 00:10:06,790
just something to keep in mind that you might not be as versatile but that is still yet to discover.

124
00:10:06,790 --> 00:10:10,430
You need to discover that all the different specific application that you're using.

125
00:10:11,090 --> 00:10:13,870
Overall it's been showing some fantastic results.

126
00:10:13,880 --> 00:10:19,940
And on these benchmark tests such as Mujer coaches beating other artificial intelligence algorithms

127
00:10:19,970 --> 00:10:27,830
that exist out there by it's 15 times faster and still gets higher rewards and diminish up today's tutorial

128
00:10:27,860 --> 00:10:29,790
we've got some additional reading.

129
00:10:30,150 --> 00:10:35,350
This paper is called the Bush strategies as a scalable alternative to reinforcement learning by Tim

130
00:10:35,450 --> 00:10:42,950
Salomon's and others published via open AI in 2017.

131
00:10:42,950 --> 00:10:47,920
The link is on the course notes and papers are archived.

132
00:10:48,110 --> 00:10:56,420
So basically this paper talks about ebullitions strategies which is another term for random search.

133
00:10:56,420 --> 00:10:59,930
It's exactly the same thing just different terminology.

134
00:11:00,080 --> 00:11:09,980
And this paper came before the Erris paper and there they kind of compared evolution strategies or random

135
00:11:09,980 --> 00:11:15,970
search to other reinforcement learning algorithms are the artificial intelligence.

136
00:11:15,980 --> 00:11:19,690
So it's in line with the topic of today's tutorial.

137
00:11:19,700 --> 00:11:27,320
If you'd like to have a look and this paper is also referenced in the press paper that we touched on

138
00:11:27,320 --> 00:11:28,150
before.

139
00:11:28,280 --> 00:11:30,310
On that note I hope you enjoyed today's tutorial.

140
00:11:30,350 --> 00:11:32,170
I was forcing him back here next time.

141
00:11:32,250 --> 00:11:34,350
And until then enjoy AI.