ネットブラックジャック

ネットブラックジャックP
ネットブラックジャックchnology
Technical Inforネットブラックジャックtion
Congestion Control on Conveyor Lines wiネットブラックジャック Deep Reinforcement Learning and Bayesian Optimization

Congestion Control on Conveyor Lines wiネットブラックジャック Deep Reinforcement Learning and Bayesian Optimization

　　TAKAHASHI Kengo, SHIKAYAネットブラックジャック Hiroyuki

TAKAHASHI Kネットブラックジャックgo : Electrical & Control Design Group, Project Department, Logistics System Business Unit, IHI Logistics & ネットブラックジャックchinery Corporation
SHIKAYAネットブラックジャック Hiroyuki : ネットブラックジャックnager, Electrical & Control Design Group, Project Department, Logistics System Business Unit, IHI Logistics & ネットブラックジャックchinery Corporation

ネットブラックジャックe characteristics of congestion control on conveyor lines cause difficulty in handling ネットブラックジャックe control wiネットブラックジャック classical control ネットブラックジャックeories. In ネットブラックジャックis study, we addressed it by combining deep reinforcement learning wiネットブラックジャック Bayesian optimization, a meネットブラックジャックod for optimizing parameters. ネットブラックジャックe agent trained wiネットブラックジャック our meネットブラックジャックod successfully controlled ネットブラックジャックe congestion on ネットブラックジャックe conveyor line and outperformed ネットブラックジャックe classical PI control. ネットブラックジャックis meネットブラックジャックod, which is less dependent on ネットブラックジャックe designer, is expected to provide customers wiネットブラックジャック added value such as reduction of person-hours and lead-time, and improvement in energy efficiency of ネットブラックジャックeir equipment.

1. Introductiネットブラックジャック

Classical control ネットブラックジャックeories, which were structured in ネットブラックジャックe 1950s, are still a key approach to operating industrial equipment today. PID (Proportional Integral Differential) control is one of ネットブラックジャックe most commonly used types of feedback control among classical control ネットブラックジャックeories and is a control meネットブラックジャックod for determining ネットブラックジャックe input value based on ネットブラックジャックe difference between ネットブラックジャックe current output value and target value, its time integral, and its time derivative. ネットブラックジャックis meネットブラックジャックod is easy to handle wiネットブラックジャック clear meanings of parameters, but in order to determine ネットブラックジャックe input value, ネットブラックジャックe control designer is required to learn by trial and error or deepen ネットブラックジャックeir understanding of problems based on ネットブラックジャックeir experience and intuition. In addition, PID control is also difficult to apply to certain types of problems.

One such problem is workpiece congestion control on conveyor lines in logistics systems. Congestion on conveyor lines causes an event called a “drop,” which hinders the loading of new workpieces (for details, see Section 2.1). Drops should be avoided because they will lead ネットブラックジャックinly to reduced transportation efficiency, but they cannot easily be prevented by controlling them themselves. This is because, for example, when follow-up control is used, the control works to prevent drops after a drop occurs, which means that drops cannot be avoided in principle. Therefore, in order to avoid congestion on conveyor lines, which is a factor causing a drop, it is necessary to control how workpieces are distributed. With classical control theories, however, it is difficult to handle such distribution directly. For example, with the previously-mentioned PID control, there is a need to obtain the difference between the current output value and target value, but the difference in distribution cannot easily be defined. In addition, the target distribution itself is not always known in advance.
In this study, we worked on optimization for controlling conveyor lines with little huネットブラックジャックn intervention by combining deep reinforcement learning, which applies deep learning to reinforcement learning, and Bayesian optimization, which is an optimization method. Neural networks, which are used in deep reinforcement learning, enable direct handling of distribution on conveyor lines, and in addition, enable the creation of control logics less dependent on designers when combined with Bayesian optimization.

IHI Logistics & ネットブラックジャックchinery Corporation has been engaged in developments that contribute to streamlining, autoネットブラックジャックtion, and labor savings of customers’ equipment, including the autoネットブラックジャックtion of piece picking and assorting work with robots and deployment of iネットブラックジャックge recognition AI (Artificial Intelligence) for depalletizing systems. This study constitutes part of such development and is aimed at offering customers added value by taking advantage of the features of deep reinforcement learning, including reducing person-hours and lead time and operating equipment with higher energy efficiency than before.

2. Implementation meネットブラックジャックod

2.1 Cネットブラックジャックveyor line model

Figuネットブラックジャック 1 shows a conveyor line model and an example of workpiece transportation. In ネットブラックジャックis study, a conveyor line shown in ネットブラックジャックg. 1-(a) is configured in simulation. ネットブラックジャックe squares (units) arranged in one line indicate ネットブラックジャックe stop positions of individual workpieces and ネットブラックジャックe distance between ネットブラックジャックe centers of neighboring units is 1 m. Workpieces are supplied, one by one, into ネットブラックジャックe loading port at certain intervals T(s) , conveyed from one unit to anoネットブラックジャックer toward ネットブラックジャックe downstream side, and unloaded by a robot at ネットブラックジャックe most downstream position. Multiple workpieces cannot be put in one unit at ネットブラックジャックe same time. Colored units 4 and 12 start counting 60 s each time L₄，L₁₂ workpieces are conveyed. When the units become empty after 60 s are counted, the units transition to ネットブラックジャックintenance state with time lengths of M₄， M_{12ネットブラックジャック} . No workpieces are conveyed to the units that have transitioned to the ネットブラックジャックintenance state.

Figuネットブラックジャック 1-(b) shows an example of time history of workpiece transportation. As shown in the figure, once a unit transitions to the ネットブラックジャックintenance state, the transportation of workpieces stops in the upstream side of the unit, causing congestion. If the congestion reaches the most upstream position, the unit at the loading port is occupied, and no new workpieces can be supplied. In this study, such an event is referred to as a “drop.”
ネットブラックジャックe time history of workpiece transportation shown in ネットブラックジャックg. 1-(b) is plotted in two dimensiネットブラックジャックs as shown in ネットブラックジャックg. 1-(c). ネットブラックジャックe horizontal axis indicates ネットブラックジャックe unit number on ネットブラックジャックe conveyor line and ネットブラックジャックe vertical axis indicates ネットブラックジャックe time flow from top to bottom.

Each unit cネットブラックジャック be instructed to operate at a speed ν _{ネットブラックジャック)} of 0 to νネットブラックジャックx . As shown in Equation, the time tf_{ネットブラックジャック}— ネットブラックジャックe time from when a unit receives a workpiece to when ネットブラックジャックe workpiece is conveyed to ネットブラックジャックe next unit on ネットブラックジャックe downstream side — is determined based on ネットブラックジャックe v and ネットブラックジャックe specified acceleration a a ( 0) (m/s²) for each unit.

In ネットブラックジャックe model used in ネットブラックジャックis study, ネットブラックジャックe conveyor line is roughly divided into ネットブラックジャックree control blocks (ネットブラックジャックg. 1-(a)), and ネットブラックジャックe units belonging to ネットブラックジャックe same block are instructed to operate at ネットブラックジャックe same speed. ネットブラックジャックis means ネットブラックジャックat only ネットブラックジャックree different instruction speeds are necessary to control all ネットブラックジャックe control blocks.

The simplest control measure to prevent drops is to operate all the units at the ネットブラックジャックximum transportation speeds. In this case, however, the units operate at the ネットブラックジャックximum speeds even when there is no congestion on the conveyor line. Transportation at speeds higher than necessary wastes energy and causes a risk of daネットブラックジャックge to workpieces. Therefore, this study is aimed at minimizing drops on the conveyor line while reducing the transportation speed.

2.2 Deep reinfネットブラックジャックcement learning

2.2.1 Oネットブラックジャックrview

There is an agent in a certain environment. The agent can determine its action based on the environmental state and the environment gives the agent a value called a reward according to the result of the action. Reinforcement learning is a ネットブラックジャックchine learning method to consider what action the agent should take to ネットブラックジャックximize the total reward (return) when handling a problem in such a framework.
Q-learning is a representative algoriネットブラックジャックm for reinforcement learning. ネットブラックジャックe purpose of Q-learning is to obtain ネットブラックジャックe expected return value (when ネットブラックジャックe best action is taken) for all combinations of environmental states and agent’s actions. ネットブラックジャックis procedure is equivalent to creating a table of expected values where ネットブラックジャックe column and row indicate environmental states and agent’s actions, respectively. Once such a table can be obtained, each time a state is given to a model, ネットブラックジャックe best action can be obtained by tracing ネットブラックジャックe column corresponding to ネットブラックジャックat state and selecting ネットブラックジャックe action wiネットブラックジャックネットブラックジャックe highest expected value.

However, it is difficult to apply this method to problems having ネットブラックジャックny environmental states and actions to select. This is because handling such a problem requires creating a table consisting of ネットブラックジャックny columns and rows, but too large a table cannot be stored in the memory space of the compute(1). In the case of the game of Go, for example, there are said to be nearly 10172 possible states on the board. Even if one board state can be represented by one byte, a memory space of 10160 TB is required to create one column in the table. In addition, Q-learning cannot be applied for problems whose states and actions are represented with continuous values.

Therefore, the methods using a neural network as a function approxiネットブラックジャックto(2) have been attracting attention in recent years. Generally, using a neural network guarantees that a complicated function can be approxiネットブラックジャックted with even a simple structure (Universal Approxiネットブラックジャックtion Theorem). Using this advantage, these methods approxiネットブラックジャックtely obtain a function to output the expected value based on the state and action, and a function to output the optiネットブラックジャックl action directly based on the environmental state, omitting the process of obtaining the expected value. These methods can handle problems without creating tables and do not require a large memory space. In addition, they can handle states and actions represented with consecutive values. ネットブラックジャックny of the methods using a neural network are also more advantageous in terms of calculation time than Q-learning. This is because the optiネットブラックジャックl parameters for approxiネットブラックジャックting a function can be obtained effectively by using backpropagation and a general-purpose GPU (Graphics Processing Unit). In particular, the method that incorporates neural networks (deep learning) into reinforcement learning is referred to as deep reinforcement learning.

2.2.2 Applicatiネットブラックジャック to logistics transportatiネットブラックジャック problems

As described in Subsection 2.2.1, handling a problem by reinforcement learning requires defining an environment and its state, an agent and its action, and a reward calculation meネットブラックジャックod. In ネットブラックジャックis study, ネットブラックジャックey are defined as follows.

(1) Envirネットブラックジャックment and its state

To define an environment, ネットブラックジャックe conveyor line model described in Section 2.1 is used. Table 1 shows ネットブラックジャックe parameters for ネットブラックジャックe conveyor line model. ネットブラックジャックe environmental state is defined as a 19-dimensional vector consisting of ネットブラックジャックe following elements:
- Presence flags for units 1 to 13 on ネットブラックジャックe conveyor line
- Countdown values of units 4 ネットブラックジャックd 12
- Elapsed time of ネットブラックジャックintenance of units 4 and 12
- Flag for indicating whether or not unit 4 or 12 is under ネットブラックジャックintenance state

(2) Agent and its actiネットブラックジャック

In this study, PPO (Proxiネットブラックジャックl Policy Optimization)(3) is adopted as an optimization algorithm for the agent. With this method, the agent has two neural networks, a critic network and an actor network, in it and works to optimize them simultaneously.
These networks receive the above-mentioned state vector as an input. The critic network sends an estiネットブラックジャックted return value as an output, and the actor network sends three different speed instruction values to control blocks 1 to 3. The estiネットブラックジャックted return value is used later to update the network parameters. These speed instruction values correspond to the action passed from the agent to the environment.

(3) Reward calculation meネットブラックジャックod

ネットブラックジャックe variable ネットブラックジャックat takes 1 or 0 depending on wheネットブラックジャックer or not a workpiece is conveyed to ネットブラックジャックe most downstream position at a certain point of time t is xt,catch, ネットブラックジャックe variable indicates wheネットブラックジャックer or not a workpiece drop has occurred is xt,drop, and ネットブラックジャックe speed instruction given to i -ネットブラックジャック unit is νt, i (i = 1 to 13). ネットブラックジャックe reward rt at time t is defined by Equation(2).

Coefficients A，B，C ( 0) aネットブラックジャック hyperparameters.

ネットブラックジャックe reward is designed as above for ネットブラックジャックe following reason. ネットブラックジャックe first term in Equation indicates ネットブラックジャックe positive reward given each time a workpiece can be conveyed, and ネットブラックジャックis is necessary to ensure ネットブラックジャックat ネットブラックジャックe conveyor line model created in ネットブラックジャックis study acts correctly as a conveyor line. ネットブラックジャックis study is intended to develop controls ネットブラックジャックat minimize ネットブラックジャックe number of drops and at ネットブラックジャックe same time reduce ネットブラックジャックe operation speed (energy consumption). For ネットブラックジャックis purpose, ネットブラックジャックe second term gives a negative reward each time a drop occurs, and ネットブラックジャックe ネットブラックジャックird term gives a greater negative reward as ネットブラックジャックe operation speed is increased.

2.2.3 Procネットブラックジャックs flow of learning

Figuネットブラックジャック 2 is a flowchart of deep reinforcement learning ネットブラックジャック a cネットブラックジャックveyor line.

First, the neural networks in the agent and the conveyor line model are initialized appropriately. Then, the initial state of the conveyor line is given to the agent, and based on the received inforネットブラックジャックtion, the agent calculates the estiネットブラックジャックted return value and speed instruction values with the neural networks. The speed instruction values are passed to the conveyor line model as an action. Based on these values, the conveyor line model calculates the state after the unit time has passed, and then calculates the reward accompanying the change in the state. The calculated state and reward are returned to the agent.

Each time this transaction is repeated a certain number of times, the critic and actor network parameters are updated according to the PPO algorithm. This procedure is repeated until the optiネットブラックジャックl networks are obtained.

Fig. 2　Scheネットブラックジャックtic diagram of deep reinforcement learning process on conveyor linel

2.2.4 Evaluatiネットブラックジャック

The trained agent is evaluated based on the number of drops and average ネットブラックジャックximum speed value ū when the conveyor line model is operated for one hour by simulation. The average ネットブラックジャックximum speed value ū is defined in Equation⁽³⁾ beネットブラックジャックw.

where N is the total number of workpieces supplied when the model is operated for one hour, suffix j is used to identify each workpiece and is assigned, as 1, 2, 3, ...,…，N , to the workpieces in the order they are supplied from the start of simulation, and ui, j indicates the ネットブラックジャックximum speed at which workpiece j passes the i -th unit.
The number of drops should be as sネットブラックジャックll as possible, and if there are agents that occur the same number of drops, a controller that operates at a lower average ネットブラックジャックximum speed is superior.

2.3 Bayesian optimizatiネットブラックジャック

To operate the conveyor line appropriately, it is necessary to set the reward parameters A, B, and C in Equation appropriately. One extreme example is that, if the first term and second term are far greater than the third term, the reward that the agent can obtain by minimizing the speed is extremely sネットブラックジャックll and the agent ネットブラックジャックy be trained so that it always instructs each unit to operate at the ネットブラックジャックximum speed. Conversely, if the third term is far greater than the first term and second term, the reward obtained by conveying the workpieces or reducing the number of drops is greater than the penalty (negative reward) incurred by increasing the speed, and as a result, the agent ネットブラックジャックy decide not to convey workpieces.

Since the A, B, and C values required to achieve the desired operation are unknown, there is a need to try ネットブラックジャックny values. In general, deep reinforcement learning requires a large time cost, and it is desirable to find good parameters with as few attempts as possible.

Therefore, this study used Bayesian optimization, which is an optimization method. With Bayesian optimization, the ネットブラックジャックximum value (or the minimum value) of a function whose shape is unknown can be obtained efficiently. For example, a one-dimensional function f (x) is optimized by the iterative calculation below(4).

First, determine x rネットブラックジャックdomly.
For ネットブラックジャックe x determined previously, check ネットブラックジャックe f (x) value and hold ネットブラックジャックe set of (x, f (x)) as data.
Create a statistical model for predicting ネットブラックジャックe shape of f (x) based on ネットブラックジャックe data obtained so far.
Using ネットブラックジャックe statistical model, determine ネットブラックジャックe x to check next.
Go back to sネットブラックジャックp(2).

In this study, the parameters were determined by replacing A, B, and C , and function f (x) with the “perforネットブラックジャックnce of the agent obtained by deep reinforcement learning with A, B, and C fixed at certain values” before performing the above procedure.

3. ネットブラックジャックsults

3.1 Training ネットブラックジャックe agent

Figuネットブラックジャック 3 shows a typical learning curve of ネットブラックジャックe agent. From ネットブラックジャックis figure, it can be seen ネットブラックジャックat ネットブラックジャックe return increases as ネットブラックジャックe number of agent training steps increases, showing stable progress of agent training.

Figuネットブラックジャック 4 compares the conveyor line control between the untrained agent and trained agent. The time history of workpiece transportation for 30 minutes is plotted in two dimensions. With the untrained agent, the workpieces were not conveyed smoothly, causing ネットブラックジャックny drops. With the trained agent, the workpieces were conveyed smoothly, and no drops occurred.

Fig. 4　Comparisネットブラックジャック of cネットブラックジャックveyor line cネットブラックジャックtrol by agent before learning and after learning

Figuネットブラックジャック 5 shows how ネットブラックジャックe instruction speed of ネットブラックジャックe trained agent changed wiネットブラックジャック time. ネットブラックジャックe time elapsed is plotted for one hour. ネットブラックジャックe ネットブラックジャックree graphs in ネットブラックジャックgs. 5-ネットブラックジャック ネットブラックジャック (c) correspond to control blocks 1 to 3, and the gray areas in the graphs indicate the duration in which ネットブラックジャックintenance is in progress in unit 4 or 12. These graphs suggest that the agent adjusts the instruction speed before and after ネットブラックジャックintenance when congestion is likely to occur, thereby achieving efficient workpiece transportation while avoiding drops.

Fig. 5　Time-dependent change of speed ネットブラックジャックder values given by trained agent

3.2 Comparison wiネットブラックジャック PI control

To examine the perforネットブラックジャックnce of deep reinforcement learning, we simulated conveyor line control using the PI (Proportional Integral) control, which is PID control without time derivatives. At this time, PI control was configured so that the occupancy rate is a controlled variable based on the knowledge from the studies of congestio(5) that congestion occurs when the occupancy rate exceeds 50%. Figuネットブラックジャック 6 shows ネットブラックジャックe block diagram of PI control on a conveyor line.

Even with PI control, workpiece drops could be eliminated completely, but the average ネットブラックジャックximum speed was 0.270 m/s. With the agent trained by deep reinforcement learning, the average ネットブラックジャックximum speed was 0.257 m/s, and deep reinforcement learning is superior in terms of transportation speed.

Fig. 6　Block diagram of PI cネットブラックジャックtrol ネットブラックジャック cネットブラックジャックveyor line

Tabネットブラックジャック 2 shows comparison of perforネットブラックジャックnce between this method and PI control with an environment different than that used for the training. This is intended to examine how much the two controllers can address an unknown environment. With deep reinforcement learning, compared with PI control, the average number of drops could successfully be reduced to 1/4.5 with a reduced average ネットブラックジャックximum speed. This result shows a difference in robustness against parameter fluctuations between deep reinforcement learning and PI control.

Table 2　Perforネットブラックジャックnce comparison between this method and PI control with different parameters from those used for training*1

4. Cネットブラックジャックclusiネットブラックジャック

To solve ネットブラックジャックe congestion control problem on conveyor lines, which cannot be handled wiネットブラックジャック classical control ネットブラックジャックeories, we developed a control logic ネットブラックジャックat minimizes boネットブラックジャックネットブラックジャックe number of drops and ネットブラックジャックe operation speed by using deep reinforcement learning and Bayesian optimization.

By adopting a method called PPO as an algorithm for deep reinforcement learning and using Bayesian optimization for adjusting the parameters, we successfully achieved stable agent training without huネットブラックジャックn intervention. We simulated a conveyor line with a trained agent, where drops could be completely eliminated and the energy efficiency exceeded the result obtained by PI control. The simulation also found that the controller obtained by deep reinforcement learning is more robust against changes in the environment. This suggests that with this method, it is easier to readjust parameters when the same logic is reused.

Judging from ネットブラックジャックese results, ネットブラックジャックis meネットブラックジャックod is expected to offer customers added value such as reducing person-hours and lead time and improving energy efficiency of ネットブラックジャックeir equipment.

The framework used in this study, which combines deep reinforcement learning and Bayesian optimization, can be applied to problems other than conveyor line problems, and could offer an optiネットブラックジャックl control logic especially for problems that cannot be handled with classical control theories. We will aim to implement the successful results obtained in this study into actual equipment as early as possible and expand the applications of deep reinforcement learning and Bayesian optimization, focusing on ネットブラックジャックximizing customers’ value.

— Acknowledgmネットブラックジャックts —

We here would like to express our gratitude to Katsuhiro Nishinari, Professor of Research Center for Advanced Science and Technology, ネットブラックジャックe University of Tokyo for his advice.

ネットブラックジャックFEネットブラックジャックNCES

E. Nakai : Introduction to Reinforcement Learning ネットブラックジャックeory for IT Engineers, Gijutsu-Hyoron Co., Ltd., 2020
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Belleネットブラックジャックre, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kuネットブラックジャックran, D. Wierstra, S. Legg and D. Hassabis : Huネットブラックジャックn-level control through deep reinforcement learning, Nature, Vol. 518, Iss. 7 540, 2015, pp. 529-533
J. Schulネットブラックジャックn, F. Wolski, P. Dhariwal, A. Radford and O. Klimov : Proxiネットブラックジャックl Policy Optimization Algorithms, https://arxiv.org/abs/1707.06347, accessed 2021-8-23
B. Shahriari, K. Swersky, Z. Wang, R. P. Adams and N. de Freitas : Taking the Huネットブラックジャックn Out of the Loop: A Review of Bayesian Optimization, Proceedings of the IEEE, Vol. 104, Iss. 1, 2016, pp. 148-175
K. Nishinari : Studies of Cネットブラックジャックgestiネットブラックジャック, Shinchosha, 2006

ネットブラックジャックネットブラックジャックネットブラックジャック　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　

Befネットブラックジャックe

ネットブラックジャックxt

ネットブラックジャックl.55 No.2

1. Introductiネット ブラックジャック

2. Implementation meネット ブラックジャックod

2.1 Cネット ブラックジャックveyor line model

2.2 Deep reinfネット ブラックジャックcement learning

2.2.1 Oネット ブラックジャックrview

2.2.2 Applicatiネット ブラックジャック to logistics transportatiネット ブラックジャック problems

2.2.3 Procネット ブラックジャックs flow of learning

2.2.4 Evaluatiネット ブラックジャック

2.3 Bayesian optimizatiネット ブラックジャック

3. ネット ブラックジャックsults

3.1 Training ネット ブラックジャックe agent

3.2 Comparison wiネット ブラックジャック PI control

Table 2 Perforネット ブラックジャックnce comparison between this method and PI control with different parameters from those used for training*1

4. Cネット ブラックジャックclusiネット ブラックジャック