1 00:00:00,330 --> 00:00:05,400 Hello and welcome back to the course on augmented a random search in today's tutorial we're going to 2 00:00:05,400 --> 00:00:08,920 be comparing Erris versus other AI algorithms. 3 00:00:09,090 --> 00:00:15,600 And the reason why we included this tutorial in the course is to highlight the main differences between 4 00:00:15,600 --> 00:00:20,930 areas and other standard or more conventional AI algorithms. 5 00:00:20,940 --> 00:00:25,590 It's pretty hard to say conventional because all of this is cutting edge technology and cutting edge 6 00:00:26,670 --> 00:00:32,580 models and algorithms but nevertheless we want to contrast areas which is brand new and how is it different 7 00:00:32,970 --> 00:00:37,820 to other Algor algorithms that exist out there that we've seen before. 8 00:00:37,920 --> 00:00:44,130 And hopefully this will give you a good overview of how the contrast and how you'll be more prepared 9 00:00:44,130 --> 00:00:45,150 for these conversations. 10 00:00:45,150 --> 00:00:49,800 Or know what the advantages of areas are with a difference. 11 00:00:50,100 --> 00:00:55,080 So let's have a look on the left you can have a column for features or areas that are going to be three 12 00:00:55,080 --> 00:00:57,550 main distinctions between the two. 13 00:00:57,650 --> 00:01:00,880 There's the full list but those are the ones that we're going to cover. 14 00:01:01,110 --> 00:01:03,610 And those are the ones we find the most important ones. 15 00:01:03,690 --> 00:01:05,810 And on the road we're going to have other area. 16 00:01:06,180 --> 00:01:13,500 So number one is exploration Eris performs the exploration in the policy space weather whereas other 17 00:01:13,500 --> 00:01:17,000 AI usually performs exploration in action space. 18 00:01:17,250 --> 00:01:21,490 So let's have a look at what this actually means in Eris. 19 00:01:21,510 --> 00:01:23,200 We've already discussed that. 20 00:01:23,220 --> 00:01:30,750 We've got this Perceptor on and then we wait until the agent gets to the end of the episode and then 21 00:01:30,750 --> 00:01:38,400 after based on the result whether they successfully got to the end and one or they fell over at some 22 00:01:38,400 --> 00:01:43,390 point we get a reward which we calque which we use to adjust the weights. 23 00:01:43,590 --> 00:01:47,220 Well the thing is environment's actually are set up. 24 00:01:47,370 --> 00:01:52,860 For instance this merger go environment is set up independent of the algorithm that's going to be applied. 25 00:01:52,860 --> 00:01:58,140 So there's this environment and you can apply arrest to it you can apply a different AI or any other 26 00:01:58,140 --> 00:01:59,400 algorithm to it. 27 00:01:59,520 --> 00:02:04,710 So and the thing is that the environment is set up to provide it provide rewards not just at the end 28 00:02:04,710 --> 00:02:07,450 of the episode but after every single action. 29 00:02:07,450 --> 00:02:09,360 Just that's just how the environment works. 30 00:02:09,360 --> 00:02:14,040 Every time the agent does something the environment provides a reward. 31 00:02:14,040 --> 00:02:19,110 For instance if it's getting closer to the target then it's getting a positive reward if it's falling 32 00:02:19,140 --> 00:02:23,060 over or if it's getting further than the target might be getting a negative reward. 33 00:02:23,070 --> 00:02:27,370 In any case there's a reward that's provided after every single action. 34 00:02:27,500 --> 00:02:35,430 What Erez does though is it doesn't take advantage of the opportunity to look at the reward after every 35 00:02:35,430 --> 00:02:42,650 single action arrest accumulates the total reward and then looks at the total reward after the episode. 36 00:02:42,780 --> 00:02:50,460 That's important to understand and remember that it's kind of like not using it could be looking at 37 00:02:50,550 --> 00:02:52,530 the reward after every action but it's not. 38 00:02:52,530 --> 00:03:00,390 So it's actually not using the all of the possibilities that the environment provides and yet is still 39 00:03:00,390 --> 00:03:02,850 stronger than other AI's surprising. 40 00:03:03,690 --> 00:03:11,570 And this is kind in contrast to how other AI usually works and the way they work is the Jesus lives 41 00:03:11,570 --> 00:03:13,000 from artificial intelligence. 42 00:03:13,010 --> 00:03:16,620 It is of course if you've been part of that course then you'll remember them. 43 00:03:16,800 --> 00:03:22,650 Normally AI performs actions and gets rewards so it goes into a new state and gets a reward right away 44 00:03:22,650 --> 00:03:27,060 so that action gets a reward performs an action gets reward and it analyzes the reward. 45 00:03:27,120 --> 00:03:33,890 In contrast to Erris normally the AI reinforcement learning and keep learning things like that. 46 00:03:33,990 --> 00:03:38,060 They analyze the reward right away as soon as the action is performed they get the reward. 47 00:03:38,070 --> 00:03:44,400 They analyze the reward and what this allows them to do is apply the Belman equation and build a value 48 00:03:44,400 --> 00:03:45,730 function for states. 49 00:03:45,750 --> 00:03:52,080 So then they know the values of different states of being or of being in different states and therefore 50 00:03:52,110 --> 00:03:57,930 after that they can use those value functions in order to make their decisions and create the maps are 51 00:03:57,930 --> 00:04:00,470 they going to get through these environments. 52 00:04:00,960 --> 00:04:08,700 So that's the main difference in Erris we focus on the total reward for the whole episode. 53 00:04:08,970 --> 00:04:13,710 And basically we're analyzing we're exploring therefore we're not exploring every single action. 54 00:04:13,710 --> 00:04:15,390 We're exploring the whole policy. 55 00:04:15,390 --> 00:04:18,950 So the whole approach of how to go through an episode. 56 00:04:18,960 --> 00:04:25,530 So we're and we're assessing the whole episode the whole episode and therefore it's we're exploring 57 00:04:25,530 --> 00:04:32,760 in the policy state space whereas in other AI Normally we're exploring actions and therefore we're exploring 58 00:04:32,760 --> 00:04:33,770 in action space. 59 00:04:33,840 --> 00:04:36,800 So that's what we have exploration in the policy space. 60 00:04:36,800 --> 00:04:41,370 We know from start to finish for the whole episode versus exploration action space where we explore 61 00:04:41,370 --> 00:04:46,070 every individual action and that's the difference in a way. 62 00:04:46,070 --> 00:04:51,390 If you have an artificial intelligence that is of course you remember that we had a tutorial on eligibility 63 00:04:51,390 --> 00:04:59,550 trace in in there for AI and that's exactly that's kind of similar That's what IRS does is kind of similar 64 00:04:59,580 --> 00:05:02,260 to disability trace we're in eligibility trace. 65 00:05:02,270 --> 00:05:09,140 We didn't look at every single action but we matched up the actions that we did had five or 10 or 15 66 00:05:09,140 --> 00:05:14,570 actions in a row and then we only looked at the reward and we remember from there that that actually 67 00:05:14,570 --> 00:05:17,540 improved the performance of other. 68 00:05:17,580 --> 00:05:23,420 We won't go into detail on this but just just a hint that if you are aware of eligibility trace arrest 69 00:05:23,420 --> 00:05:29,200 is kind of like similar thinking in the similar action as eligibility trait. 70 00:05:29,390 --> 00:05:29,960 All right. 71 00:05:29,960 --> 00:05:32,700 Next step two is what. 72 00:05:32,720 --> 00:05:40,620 How is that updating of weights performed in as it's the method of finite differences in other AI it's 73 00:05:40,820 --> 00:05:46,900 Grady and the famous or famous gradient descent algorithm. 74 00:05:46,940 --> 00:05:56,070 So we talked about the matter of finite differences or here where we basically have a small point a 75 00:05:56,090 --> 00:06:02,690 positive shift a negative shift or a positive delta negative delta we have the rewards and based on 76 00:06:02,690 --> 00:06:06,660 that we can calculate how to adjust our weights. 77 00:06:06,710 --> 00:06:15,550 Overall the difference is that in other AI AI we don't use Mithal like we actually use gradient Paul's 78 00:06:15,590 --> 00:06:19,200 descent which is proper differentiation. 79 00:06:19,290 --> 00:06:22,560 All of the loss that you're getting. 80 00:06:22,700 --> 00:06:27,940 So in Erris you get a reward in not normally in a you get you get a loss. 81 00:06:28,250 --> 00:06:36,140 So it's a proper differentiation of the last based on the weights and going backwards through through 82 00:06:36,140 --> 00:06:36,920 the neck. 83 00:06:37,490 --> 00:06:46,820 So that is what we do normally I approach skol also known as a back propagation of the error through 84 00:06:46,820 --> 00:06:48,680 the network to update the weights. 85 00:06:48,680 --> 00:06:55,580 The reason why we can do that in Erris why we have to do the method of finite differences is because 86 00:06:55,580 --> 00:06:58,160 we simply don't have a value function. 87 00:06:58,160 --> 00:07:04,130 We just don't because we're not exploring an actual space we don't have that value function that we 88 00:07:04,130 --> 00:07:10,010 can then we're going to apply this method to and therefore we have to deal with what we have. 89 00:07:10,010 --> 00:07:14,160 And there that's why the method of finite differences is used. 90 00:07:14,160 --> 00:07:14,730 All right. 91 00:07:14,810 --> 00:07:17,650 And it works totally fine and works really great. 92 00:07:17,750 --> 00:07:25,100 In fact in that main research paper for Erris they talk about that this is a good enough approximation 93 00:07:25,160 --> 00:07:27,360 of the gradient. 94 00:07:27,830 --> 00:07:28,520 All right. 95 00:07:28,520 --> 00:07:31,650 And number three is shallow learning. 96 00:07:31,670 --> 00:07:35,780 In the case of arrests in the case of other AI it's usually deep learning. 97 00:07:35,780 --> 00:07:43,270 So if we have a look as we recall this was the percept drawn for a dress it might have a lot of inputs. 98 00:07:43,280 --> 00:07:48,080 I have a lot of outputs but the main point is that there's just one the input layer which is Connect 99 00:07:48,080 --> 00:07:54,860 the right way to the output there they're an AI that is based on deep learning usually have a at least 100 00:07:54,860 --> 00:08:02,900 a hidden layer like this at least one hidden layer where these inputs are entered combined and then 101 00:08:02,900 --> 00:08:04,090 they're going to the output. 102 00:08:04,280 --> 00:08:10,550 In fact more often than not you will have multiple hidden layers and therefore it is actually proper 103 00:08:10,550 --> 00:08:17,470 deep learning that is used because as you can see the neural network is very deep in this case. 104 00:08:17,930 --> 00:08:18,850 And so that's the difference. 105 00:08:18,860 --> 00:08:24,410 And obviously that takes more attention more training as more weights that need to adjusted and so on 106 00:08:25,250 --> 00:08:26,570 and so there we go. 107 00:08:26,570 --> 00:08:32,710 We've got arrest exploration the poses space murtherer of finite differences in shallow learning. 108 00:08:32,770 --> 00:08:35,570 Either way I is explosion in action space. 109 00:08:35,570 --> 00:08:42,710 Gradient descent algorithm and back propagation and three usually deportee. 110 00:08:43,040 --> 00:08:45,240 And so what does this sum up to. 111 00:08:45,440 --> 00:08:53,570 Well even no arrests in many ways is kind of like less involved then other AI as we just discussed. 112 00:08:53,960 --> 00:08:56,940 As a result Erris is still much stronger. 113 00:08:56,940 --> 00:09:03,590 It's up to 15 times faster and at the same time yields higher rewards on specific applications. 114 00:09:03,590 --> 00:09:12,380 And I put in specific applications here because airiest has been a critique that is like it finds real 115 00:09:12,380 --> 00:09:13,910 inefficiencies for instance or 116 00:09:16,700 --> 00:09:23,450 inaccuracies in the magical algorithm and it like exploits them and so therefore somebody someone might 117 00:09:23,450 --> 00:09:29,320 argue that deep AI is based on deep learning even though it's slower might be more versatile and be 118 00:09:29,320 --> 00:09:32,880 more applicable and smart in different ways. 119 00:09:32,990 --> 00:09:37,900 But at the end of the day if you have a specific application and you need to get it sorted out and you 120 00:09:37,900 --> 00:09:45,410 need to solve the problem why does it matter if you know what what's how you get to the end result like 121 00:09:45,410 --> 00:09:51,710 you know as long as you are getting the result that you want then and you know the problem is solved 122 00:09:51,770 --> 00:10:00,440 then you have a good outcome and so therefore a arrest can be a great solution in many different applications 123 00:10:00,470 --> 00:10:06,790 just something to keep in mind that you might not be as versatile but that is still yet to discover. 124 00:10:06,790 --> 00:10:10,430 You need to discover that all the different specific application that you're using. 125 00:10:11,090 --> 00:10:13,870 Overall it's been showing some fantastic results. 126 00:10:13,880 --> 00:10:19,940 And on these benchmark tests such as Mujer coaches beating other artificial intelligence algorithms 127 00:10:19,970 --> 00:10:27,830 that exist out there by it's 15 times faster and still gets higher rewards and diminish up today's tutorial 128 00:10:27,860 --> 00:10:29,790 we've got some additional reading. 129 00:10:30,150 --> 00:10:35,350 This paper is called the Bush strategies as a scalable alternative to reinforcement learning by Tim 130 00:10:35,450 --> 00:10:42,950 Salomon's and others published via open AI in 2017. 131 00:10:42,950 --> 00:10:47,920 The link is on the course notes and papers are archived. 132 00:10:48,110 --> 00:10:56,420 So basically this paper talks about ebullitions strategies which is another term for random search. 133 00:10:56,420 --> 00:10:59,930 It's exactly the same thing just different terminology. 134 00:11:00,080 --> 00:11:09,980 And this paper came before the Erris paper and there they kind of compared evolution strategies or random 135 00:11:09,980 --> 00:11:15,970 search to other reinforcement learning algorithms are the artificial intelligence. 136 00:11:15,980 --> 00:11:19,690 So it's in line with the topic of today's tutorial. 137 00:11:19,700 --> 00:11:27,320 If you'd like to have a look and this paper is also referenced in the press paper that we touched on 138 00:11:27,320 --> 00:11:28,150 before. 139 00:11:28,280 --> 00:11:30,310 On that note I hope you enjoyed today's tutorial. 140 00:11:30,350 --> 00:11:32,170 I was forcing him back here next time. 141 00:11:32,250 --> 00:11:34,350 And until then enjoy AI.