上QQ阅读APP看书，第一时间看更新

Step 1 – MDP in natural language

Step 1 of any artificial intelligence problem is to transpose it into something you know in your everyday life (work or personal). Something you are an SME in. If you have a driver's license, then you are an SME of driving. You are certified. If you do not have a driver's license or never drive, you can easily replace moving around in a car by moving around on foot.

Let's say you are an e-commerce business driver delivering a package in an area you do not know. You are the operator of a self-driving vehicle. You have a GPS system with a beautiful color map on it. The areas around you are represented by the letters A to F, as shown in the simplified map in the following diagram. You are presently at F. Your goal is to reach area C. You are happy, listening to the radio. Everything is going smoothly, and it looks like you are going to be there on time. The following graph represents the locations and routes that you can possibly cover.

The guiding system's state indicates the complete path to reach C. It is telling you that you are going to go from F to B to D and then to C. It looks good!

To break things down further, let's say:

The present state is the letter s.
Your next action is the letter a (action). This action a is not location A.
The next action a (not location A) is to go to location B. You look at your guiding system; it tells you there is no traffic, and that to go from your present state F to your next state B will take you only a few minutes. Let's say that the next state B is the letter B.

At this point, you are still quite happy, and we can sum up your situation with the following sequence of events:

The letter s is your present state, your present situation. The letter a is the action you're deciding, which is to go to the next area; there you will be in another state, s'. We can say that thanks to the action a, you will go from s to s'.

Now, imagine that the driver is not you anymore. You are tired for some reason. That is when a self-driving vehicle comes in handy. You set your car to autopilot. Now you are not driving anymore; the system is. Let's call that system the agent. At point F, you set your car to autopilot and let the self-driving agent take over.

The agent now sees what you have asked it to do and checks its mapping environment, which represents all the areas in the previous diagram from A to F.

In the meantime, you are rightly worried. Is the agent going to make it or not? You are wondering if its strategy meets yours. You have your policy P—your way of thinking—which is to take the shortest paths possible. Will the agent agree? What's going on in its mind? You observe and begin to realize things you never noticed before. Since this is the first time you are using this car and guiding system, the agent is memoryless, which is an MDP feature. This means the agent just doesn't know anything about what went on before. It seems to be happy with just calculating from this state s at area F. It will use machine power to run as many calculations as necessary to reach its goal.

Another thing you are watching is the total distance from F to C to check whether things are OK. That means that the agent is calculating all the states from F to C.

In this case, state F is state 1, which we can simplify by writing s₁. B is state 2, which we can simplify by write s₂. D is s₃ and C is s₄. The agent is calculating all of these possible states to make a decision.

The agent knows that when it reaches D, C will be better because the reward will be higher to go to C than anywhere else. Since it cannot eat a piece of cake to reward itself, the agent uses numbers. Our agent is a real number cruncher. When it is wrong, it gets a poor reward or nothing in this model. When it's right, it gets a reward represented by the letter R. This action-value (reward) transition, often named the Q function, is the core of many reinforcement learning algorithms.

When our agent goes from one state to another, it performs a transition and gets a reward. For example, the transition can be from F to B, state 1 to state 2, or s₁ to s_2.

You are feeling great and are going to be on time. You are beginning to understand how the machine learning agent in your self-driving car is thinking. Suddenly your guiding system breaks down. All you can see on the screen is that static image of the areas of the last calculation. You look up and see that a traffic jam is building up. Area D is still far away, and now you do not know whether it would be good to go from D to C or D to E to get a taxi that can take special lanes. You are going to need your agent!

The agent takes the traffic jam into account, is stubborn, and increases its reward to get to C by the shortest way. Its policy is to stick to the initial plan. You do not agree. You have another policy.

You stop the car. You both have to agree before continuing. You have your opinion and policy; the agent does not agree. Before continuing, your views need to converge. Convergence is the key to making sure that your calculations are correct. This is the kind of problem that persons, or soon, self-driving vehicles (not to speak about drone air jams), delivering parcels encounter all day long to get the workload done. The number of parcels to delivery per hour is an example of the workload that needs to be taken into account when making a decision.

To represent the problem at this point, the best way is to express this whole process mathematically.