parallelDES

parallelDES - Copyrights;l [999 IFAC 14th 'l‘ricnniul...

Info iconThis preview shows pages 1–6. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 6
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Copyrights;l [999 IFAC 14th 'l‘ricnniul World Congress, Beijing, PR. China 49 1-30-02-3 DERIVATIVE ESTIMATES PARALLEL SIMULATION ALGORITHM BASED ON PERFORMANCE POTENTIALS THEORY Chang-Chen Zou, HongvSheng Xi, Bao—Qun Yin, Ya—Ping Zhou, , De—Min Sun Department ofA alternation, University of Science and TQChROIOgl’ ofChina Hefei, Anhur‘, 23002?. P. R. China. Email: xihs @ustc. Bdh'. en Abstract: An efficieut derivative estimates parallel simulation algorithm is presented based on the new Performance Potentials theory (Cao and Chen, 1997). Two main ideas are introduced: First, a new processor-partitioning pattern. Scrcwy Partitioning, which can make complete load balance on a time—costing simulation part; Second, modified Common Random Number, which can remove the large amount of broadcasting cost of sample path data at the price of only adding a very little workload. The simulation experiments on an SPMD parallel computer Show that this algorithm can achieve near linear speedup. Comrt'ghr CF 1999 IFAC Keywords: Parallel algorithms, Discrete~evcnt dynamic systems. Performance analysis, Closed queuing networks. 1 . INTRODUCTION The stochastic theory of discrete event dynamic system (DEBS) can successfully describe the structure and performance of systems such as communication networks, manufacturing systems, traffic networks, military C3! systems, etc. Most suecassful results established in DEDS perfonnance analysis are currently dependent on infinitesimal perturbation analysis (IPA) technique, which was first proposed by Ho ln early 30’s and was developed by others later (Ho or at. 1983, Ho and Coo l99l, Cassandras [993). Although the sample derivatives that lPA yields are unbiased or strongly consistent for many systems. it is not true for many others (Cao l98'l'; Ilcidelberger et at, 1988). Recently, Cao and Chen (1997) developed a simple approach in this direction, Markov Performance Potential theory. It provides new derivative formulas of the steady-state performance measure of Markov processes with respect to their infinitesimal generators. It is shown that the quantities involved in the derivative formulas can be easily estimated by analyzing a single sample path of a Markov process. The derivatives obtained by using these formulas can be easily proved to be unbiased and converge to their true values with probability one. Since Markov process is the most fundamental model for many discrete event systems, this potentials theory is widely applicable for sensitivity analysis of BEDS. With respect to the fact that BEDS simulation always needs a great amount of computational effort and spends a lot of time, it is particularly important and promising to use parallel computer to speed up DEDS simulation {Righter and Walrand, I989; Fujimoto, 1990). in this paper, for the new BEDS analysis theory: potentials theory, through analyzing the closed queuing networks simulation algorithm on single-processor, a parallel simulation algorithm is presented that promises a high parallel-efficiency. The rest of the paper is organized as following. Section 2 briefly describes the closed queuing networks problem in Yin er al. (1997). In Section 3, the main formulas used in simulation are first presented, then is the simple depiction of single— processor simulation algorithm, leaving proofs and details of performance potential theory of Cao and Chen (l997) to the original paper. In Section 4, through analyzing the single-processor simulation algorithm and serial experimental results, an insight of the main part of it is derived. Then two ideas are derived with respect to the features of the serial simulation algorithm and the characteristics of its communication. By applying these two policies, the overall parallel algorithm can be very efficient. The parallel experimental results on an SPMD parallel computer, which show that this algorithm can achieve near linear speedup, is shown in section 5. Finally, a conclusion and some discussions are given in Section 6. 2. BRIEF INTRODUCTION OF CLOSED QUEUING NETWORKS Consider a close queuing network consisting of M servers and N customers. It is supposed that each server is equipped with a buffer that has infinite capacity and the service discipline is FCFS. The service requirement of a customer at each server is assumed to be exponentially distributed with a mean value of 1. Once a server i completes its service of a customer, this customer will immediately move from server i to any one of these M servers, including server i itself. The destination sever which the customer will move to is determined by a transition probability matrix Q ~ This is the simple depiction of closed queuing networks (Yin et at, I997) that are going to be discussed in this paper. 3. SERIAL SIMULATION ALGORITHM OF DERIVATIVE ESTIMATES the theory, the derivatives estimate’s values can be got by analyzing potentials From performance a single sample path of closed queuing network. Because of the limits of pages, for the proofs and details of performance potentials theory, please refer to the original paper (Cao and Chen, 1997). All the formulas and symbols in this paper are consistent with Cao and Chen (1997) and Yin et a]. (1997). Let J donate the performance measure. 7r is the steady-state probability vector of the state-transition process, while K is the state space number. g is the performance potential. f donates the performance fimction vector. p,”- is the service rate of the sever i" at state j. .4 is the infinitesimal generator. The 50 superscript 73 represents the simulation estimation value of it. Then by applying the performance potentials theory, from a single state—transition sample path, the unbiased partial derivative estimated value of .1 with respect to Jaw is (Can and Chen, 1997): if =72 5A ism—Ag (l) aflrh/ [3111,] 0-”qu So the unbiased estimated gradient of J is: ,. M. K w” ={jJ } (2) [uh] 1:1, 1:1 The most important work in the simulation is to record the data of events “process starts fi'om state i, first transits to state j ”, where i and j are arbitrary states. In the serial simulation program, three square K ><K dimensional matrices called parameter matrices are used to record these kinds of data. The accordingly matrix entry (m, n) records the overall number, time summation, cost summation of the events “process starts from state i, first transits to state j " on simulation, respectively. These parameter matrices will be used to calculate the estimated value of performance potential g . The serial simulation program based on performance potentials is given below: (Use N, donates the length of sample path in simulation, i.e. the state~transition number) Algorithm 1. Serial simulation program based on Markov performance potentials Step 1. Use the pseudo-random number to simulate one step of state transition. Step 2. Update the three KxK dimensional parameter matrices. Record every state residence time and overall simulation time. Step 3. Return to step 1 to begin next state transition until the state transition number reaches N,. Step 4. Use the three parameter matrices to calculate g. Use every state residence time and overall simulation time to calculate fr. {531 [2111,] gradient matrix VJ (i=l,2...,M;j=l,2,...,K) Step 5. Calculate every to compose the 4. PARALLEL ANALYSIS AND IMPLEMENTATION if. t . Serial simulation program anchors It is well known that in DEDS simulation, the simulation sample path is often required to be very long. The number of state transition N, often reaches hundreds of thousands, cvcn millions, which is one of the reasons why parallel simulation plays an important role in DEBS research. ln order to determine which part of simulation is the pivotal part in parallel implement, many numerical experiments of serial simulation program are made on PC. The typical results are listed in table 1 and table 2. MW M=3,N=S._K=£ N J. State Matrices Q, )3" VJ Trans Compute 10,000 2.6% 37.3% 1.5% 58.6% 100,000 5.0% 82.0% 0.33% l2.6% 1,000,000 5.7% 92.75 0.04% 1.5% TH I H a] ,ZEC::: M N K State Matrices g,e Va} Trans C0131th 2 3 4 29.2% 69.5% 0.71% 0.60% 3 5 2] 5.0% 82.0% 0.33% 12.6% 3 3 45 1.2% 72.8% 0.30% 253% Remark: each column of table 1’ and table 2 is computational time percentage ratio to overall Simulation time. From table I and table 2. it is easy to know that for middle and large Scale simulation problems, there are more than 90% of Overall CPU time concentrating on two parts of matrix computation: " Computation of parameter matrices in step 2. ' Computation of gadicnt matrix V} in step 5. It can be seen that for large-scale simulation, parameter-mam): computation will occupy most of CPU time, even more than 90%. Therefore, the parallel implementation of step 2 and Sle 5, especially of the three-parameter matrices in step 2, will directly determine the overall parallel simulation program’s efficiency. Therefore, these two parts are the focuses of parallel implementation. By observing these two parts, It can be found that they have some good parallelizable feature: « ln parameter matrix computation of step 2, there are only matrix-numeral product and two matrices multiplying corresponding entries. (in, icolumn j row element in a matrix multiplies the teolumn j row clementofanotherrnatrix) - in step 5, each entry in matrix VJ is computed independently. If the parameter matrices are partitioned with the same partitioning pattern (in. the icolumn j rowis elements of these matrices are partitioned into one procesSor} and the gradient matrix V} are partitioned evenly with arbitrary partitioning pattern, there will be no need to exchange matrices data between different processor in matrix computation. This is a very important feature in parallel implementation, because in general parallel program, data communication cost occupies considerable run time. In order to reduce the communication cost, different parallel programs must be made for different structural parallel machines. Therefore, it can be concluded that because of the special feature of the simulation algorithm of this paper, the parallel simulation program here can achieve high parallel efficiency and can be widely applied to different structural parallel machines. Nowadays the Multiple-Instructions Multiple—Data (MIMD) parallel machines are gaining increasingly important role in parallel area and it seems that the near—term future of parallel computing will be dominated by MIMI) machines. The major disadvantages of MIMD machines such as workstation clusters are high communication latency and limited communication bandwidth. So the very low communication cost of the simulation program here makes it fitted M IMD computing well and has practical application value. 4.2. Serewy partitioning algorithm Because the state space number K is generally large in simulation, for practicality of this parallel simulation program, the processor-partitioning pattern on matrix entries must be considered (Chen, 1994}. Since there is no matrix data exchanging between different processors, only the load balance problem needs to be considered. For the gradient matrix in step 5, because each entry of it is computed independently. just dividing these M X K entries of it equally among processors will achieve a complete load balance. The essential work is for parameter matrices in step 2. From the serial simulation program, it can be seen that there are three different types of matrix computing in step 2: some mattix’s entries are computed one by one; some are computed by row and others are computed by column. General speaking, there are three different types of partitioning patterns in matrix parallel computing: block partitioning, row partitioning and column partitioning (Chen, 1994). However, for the parallel simulation algorithm here, neither of them can get a complete load balance. Here a new partitioning method: screwy partitioning, is introduced. By using it, a complete load balance of these parameter matrices in step 2 can be achieved. Principle of this method is illustrated in figl. P, P, P, P, P, o P, \ P3 Pi x \a Fig. 1. Illustration of screwy partitioning. Use a 4-stcp square matrix as example. P, represents the ith processor. The little circles represent entries of matrix. Those circles on one line indicate that they are partitioned in one processor. Thus, the screwy partitioning formula of matrix [afvj](lit‘fll=1)2!‘l'tK} is (“Lt van-1.2 s” ' a “Nor-HI a “Lit—1+2: ' " e al-l.K 3" Pt (3) The above formula requires that the square matrix's dimension K is equal to the number of processors. Since the state number K is often very large, in general, if using K/l processors to do simulation, the matrix entries that will partitioned into processors Pint ‘Pkl+11"'1 Pint t (k : Oil; ": Kl! — 1) according to (3) should be partitioned into processor PM. it should be noted that the screwy partitioning method presented here could be a general partitioning algorithm, General speaking, if the matrix computation in parallel program has both row- calculation and column—calculation, maybe by using the screwy partitioning method, a better load balance can be achieved than those ordinary-partitioning methods. 4.3. Common Random Number (CRN) By using the screwy partitioning method, complete load balance have been achieved on two matrix computation Parts that necupy more than 90% CPU time of serial simulation program. Since there will be tJt ix) no matrix‘s entries to exchange between processors in these parts, it seems that an efficient parallel algorithm has been derived. However, things are not so simple. When developing the parallel program, it is common to let one processor to generate the single sample path, then this processor broadcasts the path's data to all other processors, who (may include the former one or not) will do parallel matrix computation together. But here in DEDS simulation, since the simulated sample path is often very long, even though there are only three number to broadcast in every state transition (old state, new state, old state residence time}, the overall broadcast communication cost of sample path will be very large. This large communication cost will dramatically deteriorate the parallel efficiency of the parallel program. So an alternative method must be bound to get rid of it. In “Ordinal Optimiration” (Ho, 1994), the Common Random Number (CRN) (Heidelberger and lglehart, 1979} theme was presented to generate N’s different sample paths for different parameters with the same random sequence number (Dal, 1996). Enlightened by it, here the CRN series can be used in all processors and let all of them to do the calculation of sample path. In this way, since the random number are identical, every processor will get the same sample path and need not to translate the path data to others. Since the sample path’s computation time is less than 5% of all CPU time for most problems, the large amount of sample patlt’s communication cost can be removed at the price of adding only a little workload. The experimental results in the following (section 5) show that this kind of modified CRN theme is the most important contribution to the high efficiency of overall parallel program. 4. 4. Parallel stmttlation program Now efficient parallel implementation of the main part of serial simulation program has been derived. In order to minimize the communication cost, which will dramatically deteriorate parallel performance, and simplify the parallel program, other pans of serial simulation program are not changed and just used in parallel program. The overall parallel simulation program is listed below: Algorithm 2. Parollel simulation program based or: Markov pcrfirrmance potentials Step 1. In all processors, use a some random “seed” to initialize the pseudo-random number generator in each processor. in this way, the random number series in all processors will be identical to each other (CRN). Step 2. In all processors, use the pseudo—random number to simulate one step of state transition. Since the CRN theme is used, each processor will get the same state transition result, i.e. the same sample path. Step 3. All processors parallel compute corresponding parameter matrices’ entries that are partitioned in them (use screwy partitioning method). Step 4. Return to step 2 to begin next state transition until the state transition number reaches M. Step 5. Calculate g , 7?. Step 6. All processors parallel the corresponding entries of matrix V3. (Partition compute entries of matrix VJ equally among all processors by using any partitioning method.) 5. EXPERIMENTAL RESULTS The computation experiments are performed on an SPMD parallel computer in Univ. Sci. & Tech. China, which is a loosely coupled message passing parallel processing computer. To compare the performance between CRN theme and the common one without the CRN, a small—scale problem is investigated by parallel simulation; M=3, N:3, K=10, the sample path length is Nv=100,000. The results are listed in table 3 below: IahieifloinmmtRaudomflumheLLCRN) Nodes Use CRN Not use CRN number theme sec theme sec 1 17.0 2 ll .8 46.5 5 7.6 41.6 10 6.2 40.2 Obviously, if Common Random Number is not used in parallel program, the large amount of sample path’s broadcasting cost will deteriorate performance drastically, even making the parallel program much more time consuming than serial program. It can be seen from it that the CRN theme is the most important contribution factor to the parallel program. In order to see how well the screwy partitioning method works, a medium-scale problem (M:3, N:4, K=45, N _, =100,000) is tested based on four different partitioning methods: screwy partitioning, block, row and column partitioning. The results are listed in table 4 (time unit: second). I l I E . . . l 1 Nodes Screwy Block Row Column number 1 261.1 5 54.9 56.2 55.6 56.8 15 20.1 20.4 21.5 22.0 It can be seen from table 4 that in practice the screwy partitioning method does not play an important role in parallel program. This is reasonable afier analyzing the parallel program, The screwy partitioning method is used for parallel processing the parameter metrics computation in step 2 of serial program. In step 2, the row and column computation is addition and subtraction, while the one by one computation is multiplication. So the row and column computation workload is much smaller than that of matrix‘s entries one by one computation, for which part these different partitioning methods make no differences. Since in most cases, the parallel simulation program is only used for large-scale problems, it is necessary to do experiments for large—scale problems to see the parallel program’s performance. For a large-scale case: M=S, N:8, K:495, together with the above two cases, the experiment results are summarized in table 5 (Use screwy partitioning method. The sample path lengths in all cases are N,=100,000). IahleiIheexpcrimcntaJLesulimflpamlch Cases Nodes Time Speedup Efficiency scale number (sec) K:10 i 17.0 2 11.8 1.44 72.0% 5 7.6 2.24 44.8% 10 6.2 2.74 27.4% K=45 1 261.1 5 54.9 4.76 95.2% 15 20.8 12.55 83.7% K: 1 32546 495 3 11172 2.91 97.1% 5 6687 4.87 97.3% 9 3755 8.67 96.3% Obviously, the parallel simulation program here exhibits excellent speedup and scalability. Furthermore, the larger the case’s scale is, the better performance it behaves, this makes it suitable and effective to practical usage. CONCLUSION Based on a new DEDS analysis theory: Markov performance potentials (Cao and Chen, 1997), an efficient parallel simulation algorithm is provided when applying the potentials theory on a class of closed queuing networks. The experimental results show that the parallel simulation algorithm of this paper exhibits excellent speedups and scalability. In this paper, two ideas are presented, Screwy Partitioning method and modified Common Random Number theme. Although the screwy partitioning method does not play a very important role in the parallel program, it is a kind of general partitioning method that can he used in many Other parallel implementation areas. There are still many topics for future research. in this parallel simulation program, based on the characteristics of the potential theory and closed queuing network, the matrix computation parts of the counterpart serial program are parallel implemented and get a high efficiency. But for general DEDS parallel simulation, researchers are mainly doing the “sample path’ parallel simulation (Righter and Walrand, 1989; Fujimoto, 1990) and now studying simulation of many different DEDS simultaneously on parallel computers. (Vakili 1991,1992; Hu, 1995) Since the recently developed concept of ordinal optimization (Ho, 1996) has become popular and promising, it is likely a prospective research direction by combining the derivative estimates parallel algorithm here with the many-sample-path DEDS parallel simulation and the ordinal optimization. ACKNOWLEDGEMENTS The authors thank China High Performance Computing Center at Hefei to provide the SPMD parallel computer for the experiments. REFERENCES Cao, X.R. (1987). First-order perturbation analysis of a single multi—class finite Source queue, Performance Evaluation, 7, 31-41. Cao, XR. (1994). Realization Probabilities: the Dynamics of Queuing Systems, Springer-Verlag, New York. Cao, X.R. and H.F. Chen (1997). Perturbation realization, potentials and sensitivity analysis of Markov processes, [EEE Transactions on Automatic Control, 42, 1382-1393. Cassandras, CG. (1993). Disoreie Event Systems: Modeling and Performance Analysis, Aksen Associates, Inc. Chen, G.L. (1994). Design and Analysis of parallel algorithm, High Educational Publishing House, PR China. Dai, LY. (1996). Convergence Properties of Ordinal Comparison in the Simulation of Discrete Event Dynamic Systems, Journal of Optimization Theory and Applications, 91, 363 -388 Fujimoto, R.M. (1990). Parallel discrete event simulation, Commu. ACM, 33, 31—53. Heidelberger, P. and BL. Iglehart (1979). Comparing stochastic systems using regenerative simulation with common random numbers, Adv. Appl. Prod, 11, 804—819. Heidelberger, P., X.R. Cao, M. Zazanis and R. Suri (1988). Convergence properties of infinitesimal perturbation analysis estimates, itianagemeni Science. 34, 1281 — l 302. Ho, Y.C., XR. Cao and C.G.Cassandras (1983). lnfinitesimal and Finite Perturbation Analysis for Queueing Networks, Automatica, 19, 439- 445. Ho, Y.C. and X.R. Cao (1991). In: Perturbation analysis of discrete event dynamic systems, Kluwer Academic Publisher, Boston. Ho, Y.C. (1994). Overview of Ordinal Optimization, Proceedings ofthe 33rd Conference on Decision and Control. Lake Buena Vista, Florida, 1975- 1977. 1-10, Y.C. (1996). Soft Optimization for Hard Problems, In: Computerized Lecture via private communication. Hu, 1Q. (1995). Parallel Simulation of DBDS Via Event Synchronization. DEDS: Theory and Applications, 5, 167-186. Righter, R. and l.C.Walrand (1989). Distributed simulation of discrete event systems, Proceedings oftlte IEEE 77, 99-113. Ross, SM. (1983). Stochastic Processes. Wiley. Vakili, P. (1991). Using a standard Clock technique for efficient simulation, Operations Research Letters, 10, 445—452. Vakili, P. (1992). Massively parallel and distributed simulation of a class of discrete event systems: A different perspective. ACM Trans. On Modeiing and Simulation, 2, 214—238. Whitt, W. (1980). Continuity of generalized semi- Mnrkov processes. Mathematics of Operations Research, 5, 494-501. Yin, B.Q., YP. Zhou, 1-1.8. Xi and D.M. Sun (1997). Sensitivity formulas of performance and realization factors with parameter—dependent performance functions in closed queuing networks, In: Proceedings of C WCICM, 1384- 1889. Xi’an Jiaotong Univ. Press, PR China. ...
View Full Document

Page1 / 6

parallelDES - Copyrights;l [999 IFAC 14th 'l‘ricnniul...

This preview shows document pages 1 - 6. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online