This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Copyrights;l [999 IFAC
14th 'l‘ricnniul World Congress, Beijing, PR. China 49 130023 DERIVATIVE ESTIMATES PARALLEL SIMULATION ALGORITHM BASED
ON PERFORMANCE POTENTIALS THEORY ChangChen Zou, HongvSheng Xi, Bao—Qun Yin, Ya—Ping Zhou, , De—Min Sun Department ofA alternation, University of Science and TQChROIOgl’ ofChina Hefei, Anhur‘, 23002?. P. R. China. Email: xihs @ustc. Bdh'. en Abstract: An efﬁcieut derivative estimates parallel simulation algorithm is presented
based on the new Performance Potentials theory (Cao and Chen, 1997). Two main ideas
are introduced: First, a new processorpartitioning pattern. Scrcwy Partitioning, which
can make complete load balance on a time—costing simulation part; Second, modiﬁed
Common Random Number, which can remove the large amount of broadcasting cost of
sample path data at the price of only adding a very little workload. The simulation
experiments on an SPMD parallel computer Show that this algorithm can achieve near linear speedup. Comrt'ghr CF 1999 IFAC Keywords: Parallel algorithms, Discrete~evcnt dynamic systems. Performance analysis, Closed queuing networks. 1 . INTRODUCTION The stochastic theory of discrete event dynamic
system (DEBS) can successfully describe the
structure and performance of systems such as
communication networks, manufacturing systems,
trafﬁc networks, military C3! systems, etc. Most
suecassful results established in DEDS perfonnance
analysis are currently dependent on inﬁnitesimal
perturbation analysis (IPA) technique, which was
first proposed by Ho ln early 30’s and was developed
by others later (Ho or at. 1983, Ho and Coo l99l,
Cassandras [993). Although the sample derivatives
that lPA yields are unbiased or strongly consistent
for many systems. it is not true for many others (Cao
l98'l'; Ilcidelberger et at, 1988). Recently, Cao and Chen (1997) developed a simple
approach in this direction, Markov Performance
Potential theory. It provides new derivative formulas
of the steadystate performance measure of Markov
processes with respect to their inﬁnitesimal
generators. It is shown that the quantities involved in
the derivative formulas can be easily estimated by analyzing a single sample path of a Markov process.
The derivatives obtained by using these formulas can
be easily proved to be unbiased and converge to their
true values with probability one. Since Markov
process is the most fundamental model for many
discrete event systems, this potentials theory is
widely applicable for sensitivity analysis of BEDS. With respect to the fact that BEDS simulation always
needs a great amount of computational effort and
spends a lot of time, it is particularly important and
promising to use parallel computer to speed up
DEDS simulation {Righter and Walrand, I989;
Fujimoto, 1990). in this paper, for the new BEDS
analysis theory: potentials theory, through analyzing
the closed queuing networks simulation algorithm on
singleprocessor, a parallel simulation algorithm is
presented that promises a high parallelefﬁciency. The rest of the paper is organized as following.
Section 2 brieﬂy describes the closed queuing
networks problem in Yin er al. (1997). In Section 3,
the main formulas used in simulation are ﬁrst
presented, then is the simple depiction of single— processor simulation algorithm, leaving proofs and
details of performance potential theory of Cao and
Chen (l997) to the original paper. In Section 4,
through analyzing the singleprocessor simulation
algorithm and serial experimental results, an insight
of the main part of it is derived. Then two ideas are
derived with respect to the features of the serial
simulation algorithm and the characteristics of its
communication. By applying these two policies, the
overall parallel algorithm can be very efﬁcient. The
parallel experimental results on an SPMD parallel
computer, which show that this algorithm can
achieve near linear speedup, is shown in section 5.
Finally, a conclusion and some discussions are given
in Section 6. 2. BRIEF INTRODUCTION OF CLOSED
QUEUING NETWORKS Consider a close queuing network consisting of M
servers and N customers. It is supposed that each
server is equipped with a buffer that has inﬁnite
capacity and the service discipline is FCFS. The
service requirement of a customer at each server is
assumed to be exponentially distributed with a mean
value of 1. Once a server i completes its service of a
customer, this customer will immediately move from
server i to any one of these M servers, including
server i itself. The destination sever which the
customer will move to is determined by a transition
probability matrix Q ~ This is the simple depiction
of closed queuing networks (Yin et at, I997) that are
going to be discussed in this paper. 3. SERIAL SIMULATION ALGORITHM OF
DERIVATIVE ESTIMATES the theory, the derivatives estimate’s values can be got by analyzing potentials From performance a single sample path of closed queuing network.
Because of the limits of pages, for the proofs and
details of performance potentials theory, please refer
to the original paper (Cao and Chen, 1997). All the
formulas and symbols in this paper are consistent
with Cao and Chen (1997) and Yin et a]. (1997). Let J donate the performance measure. 7r is the
steadystate probability vector of the statetransition
process, while K is the state space number. g is the
performance potential. f donates the performance
ﬁmction vector. p,” is the service rate of the sever i" at state j. .4 is the inﬁnitesimal generator. The 50 superscript 73 represents the simulation estimation
value of it. Then by applying the performance
potentials theory, from a single state—transition sample path, the unbiased partial derivative estimated
value of .1 with respect to Jaw is (Can and Chen, 1997):
if =72 5A ism—Ag (l)
aﬂrh/ [3111,] 0”qu
So the unbiased estimated gradient of J is:
,. M. K
w” ={jJ } (2)
[uh] 1:1, 1:1 The most important work in the simulation is to
record the data of events “process starts ﬁ'om state i,
first transits to state j ”, where i and j are arbitrary
states. In the serial simulation program, three square
K ><K dimensional matrices called parameter
matrices are used to record these kinds of data. The
accordingly matrix entry (m, n) records the overall
number, time summation, cost summation of the
events “process starts from state i, ﬁrst transits to
state j " on simulation, respectively. These parameter
matrices will be used to calculate the estimated value
of performance potential g . The serial simulation program based on performance
potentials is given below: (Use N, donates the length
of sample path in simulation, i.e. the state~transition number) Algorithm 1. Serial simulation program based on
Markov performance potentials Step 1. Use the pseudorandom number to simulate
one step of state transition. Step 2. Update the three KxK dimensional
parameter matrices. Record every state residence
time and overall simulation time. Step 3. Return to step 1 to begin next state transition
until the state transition number reaches N,. Step 4. Use the three parameter matrices to calculate g. Use every state residence time and overall
simulation time to calculate fr. {531
[2111,]
gradient matrix VJ (i=l,2...,M;j=l,2,...,K) Step 5. Calculate every to compose the 4. PARALLEL ANALYSIS AND
IMPLEMENTATION if. t . Serial simulation program anchors It is well known that in DEDS simulation, the
simulation sample path is often required to be very
long. The number of state transition N, often reaches
hundreds of thousands, cvcn millions, which is one
of the reasons why parallel simulation plays an
important role in DEBS research. ln order to determine which part of simulation is the
pivotal part in parallel implement, many numerical
experiments of serial simulation program are made
on PC. The typical results are listed in table 1 and
table 2. MW
M=3,N=S._K=£ N J. State Matrices Q, )3" VJ
Trans Compute 10,000 2.6% 37.3% 1.5% 58.6%
100,000 5.0% 82.0% 0.33% l2.6%
1,000,000 5.7% 92.75 0.04% 1.5% TH I H a] ,ZEC::: M N K State Matrices g,e Va}
Trans C0131th 2 3 4 29.2% 69.5% 0.71% 0.60%
3 5 2] 5.0% 82.0% 0.33% 12.6%
3 3 45 1.2% 72.8% 0.30% 253% Remark: each column of table 1’ and table 2 is
computational time percentage ratio to overall
Simulation time. From table I and table 2. it is easy to know that for
middle and large Scale simulation problems, there are
more than 90% of Overall CPU time concentrating on
two parts of matrix computation: " Computation of parameter matrices in step 2.
' Computation of gadicnt matrix V} in step 5. It can be seen that for largescale simulation,
parametermam): computation will occupy most of
CPU time, even more than 90%. Therefore, the
parallel implementation of step 2 and Sle 5,
especially of the threeparameter matrices in step 2,
will directly determine the overall parallel
simulation program’s efficiency. Therefore, these
two parts are the focuses of parallel implementation. By observing these two parts, It can be found that they have some good parallelizable feature: « ln parameter matrix computation of step 2, there
are only matrixnumeral product and two
matrices multiplying corresponding entries. (in,
icolumn j row element in a matrix multiplies the teolumn j row clementofanotherrnatrix)  in step 5, each entry in matrix VJ is computed
independently. If the parameter matrices are partitioned with the
same partitioning pattern (in. the icolumn j
rowis elements of these matrices are partitioned into
one procesSor} and the gradient matrix V} are
partitioned evenly with arbitrary partitioning pattern, there will be no need to exchange matrices data
between different processor in matrix computation. This is a very important feature in parallel
implementation, because in general parallel program,
data communication cost occupies considerable run
time. In order to reduce the communication cost,
different parallel programs must be made for
different structural parallel machines. Therefore, it
can be concluded that because of the special feature
of the simulation algorithm of this paper, the
parallel simulation program here can achieve high
parallel efﬁciency and can be widely applied to
different structural parallel machines. Nowadays the MultipleInstructions Multiple—Data
(MIMD) parallel machines are gaining increasingly
important role in parallel area and it seems that the
near—term future of parallel computing will be
dominated by MIMI) machines. The major
disadvantages of MIMD machines such as
workstation clusters are high communication
latency and limited communication bandwidth. So
the very low communication cost of the simulation
program here makes it ﬁtted M IMD computing well
and has practical application value. 4.2. Serewy partitioning algorithm Because the state space number K is generally
large in simulation, for practicality of this parallel
simulation program, the processorpartitioning
pattern on matrix entries must be considered (Chen,
1994}. Since there is no matrix data exchanging
between different processors, only the load balance
problem needs to be considered. For the gradient
matrix in step 5, because each entry of it is
computed independently. just dividing these
M X K entries of it equally among processors will
achieve a complete load balance. The essential work
is for parameter matrices in step 2. From the serial
simulation program, it can be seen that there are three different types of matrix computing in step 2:
some mattix’s entries are computed one by one;
some are computed by row and others are computed
by column. General speaking, there are three
different types of partitioning patterns in matrix
parallel computing: block partitioning, row
partitioning and column partitioning (Chen, 1994).
However, for the parallel simulation algorithm here,
neither of them can get a complete load balance. Here a new partitioning method: screwy partitioning,
is introduced. By using it, a complete load balance
of these parameter matrices in step 2 can be
achieved. Principle of this method is illustrated in
ﬁgl. P, P, P, P,
P, o
P, \
P3
Pi x \a Fig. 1. Illustration of screwy partitioning. Use a 4stcp
square matrix as example. P, represents the ith
processor. The little circles represent entries of
matrix. Those circles on one line indicate that
they are partitioned in one processor. Thus, the screwy partitioning formula of matrix
[afvj](lit‘fll=1)2!‘l'tK} is (“Lt van1.2 s” ' a “NorHI a “Lit—1+2: ' " e all.K 3" Pt (3) The above formula requires that the square matrix's
dimension K is equal to the number of processors.
Since the state number K is often very large, in
general, if using K/l processors to do simulation, the
matrix entries that will partitioned into processors
Pint ‘Pkl+11"'1 Pint t (k : Oil; ": Kl! — 1) according to (3) should be partitioned into processor PM. it should be noted that the screwy partitioning
method presented here could be a general partitioning
algorithm, General speaking, if the matrix
computation in parallel program has both row
calculation and column—calculation, maybe by using
the screwy partitioning method, a better load balance
can be achieved than those ordinarypartitioning
methods. 4.3. Common Random Number (CRN) By using the screwy partitioning method, complete
load balance have been achieved on two matrix
computation Parts that necupy more than 90% CPU
time of serial simulation program. Since there will be tJt ix) no matrix‘s entries to exchange between processors
in these parts, it seems that an efﬁcient parallel
algorithm has been derived. However, things are not
so simple. When developing the parallel program, it is common
to let one processor to generate the single sample
path, then this processor broadcasts the path's data to
all other processors, who (may include the former
one or not) will do parallel matrix computation
together. But here in DEDS simulation, since the
simulated sample path is often very long, even
though there are only three number to broadcast in
every state transition (old state, new state, old state
residence time}, the overall broadcast communication
cost of sample path will be very large. This large
communication cost will dramatically deteriorate the
parallel efﬁciency of the parallel program. So an
alternative method must be bound to get rid of it. In “Ordinal Optimiration” (Ho, 1994), the Common
Random Number (CRN) (Heidelberger and lglehart,
1979} theme was presented to generate N’s different
sample paths for different parameters with the same
random sequence number (Dal, 1996). Enlightened
by it, here the CRN series can be used in all
processors and let all of them to do the calculation of
sample path. In this way, since the random number
are identical, every processor will get the same
sample path and need not to translate the path data to
others. Since the sample path’s computation time is
less than 5% of all CPU time for most problems, the
large amount of sample patlt’s communication cost
can be removed at the price of adding only a little
workload. The experimental results in the following
(section 5) show that this kind of modiﬁed CRN
theme is the most important contribution to the high
efﬁciency of overall parallel program. 4. 4. Parallel stmttlation program Now efﬁcient parallel implementation of the main
part of serial simulation program has been derived. In
order to minimize the communication cost, which
will dramatically deteriorate parallel performance,
and simplify the parallel program, other pans of
serial simulation program are not changed and just
used in parallel program. The overall parallel
simulation program is listed below: Algorithm 2. Parollel simulation program based or:
Markov pcrﬁrrmance potentials Step 1. In all processors, use a some random “seed”
to initialize the pseudorandom number generator in
each processor. in this way, the random number
series in all processors will be identical to each other
(CRN). Step 2. In all processors, use the pseudo—random
number to simulate one step of state transition. Since
the CRN theme is used, each processor will get the
same state transition result, i.e. the same sample path. Step 3. All processors parallel compute
corresponding parameter matrices’ entries that are
partitioned in them (use screwy partitioning method). Step 4. Return to step 2 to begin next state transition
until the state transition number reaches M. Step 5. Calculate g , 7?. Step 6. All processors parallel the corresponding entries of matrix V3. (Partition compute entries of matrix VJ equally among all processors
by using any partitioning method.) 5. EXPERIMENTAL RESULTS The computation experiments are performed on an
SPMD parallel computer in Univ. Sci. & Tech. China,
which is a loosely coupled message passing parallel
processing computer. To compare the performance between CRN theme
and the common one without the CRN, a small—scale
problem is investigated by parallel simulation; M=3, N:3, K=10, the sample path length is Nv=100,000.
The results are listed in table 3 below: IahieiﬂoinmmtRaudomﬂumheLLCRN)
Nodes Use CRN Not use CRN
number theme sec theme sec
1 17.0
2 ll .8 46.5
5 7.6 41.6
10 6.2 40.2 Obviously, if Common Random Number is not used
in parallel program, the large amount of sample
path’s broadcasting cost will deteriorate performance
drastically, even making the parallel program much
more time consuming than serial program. It can be
seen from it that the CRN theme is the most
important contribution factor to the parallel program. In order to see how well the screwy partitioning
method works, a mediumscale problem (M:3, N:4,
K=45, N _, =100,000) is tested based on four different partitioning methods: screwy partitioning, block, row
and column partitioning. The results are listed in
table 4 (time unit: second). I l I E . . . l 1
Nodes Screwy Block Row Column
number 1 261.1 5 54.9 56.2 55.6 56.8 15 20.1 20.4 21.5 22.0 It can be seen from table 4 that in practice the screwy
partitioning method does not play an important role
in parallel program. This is reasonable aﬁer
analyzing the parallel program, The screwy
partitioning method is used for parallel processing
the parameter metrics computation in step 2 of serial
program. In step 2, the row and column computation
is addition and subtraction, while the one by one
computation is multiplication. So the row and
column computation workload is much smaller than
that of matrix‘s entries one by one computation, for
which part these different partitioning methods make
no differences. Since in most cases, the parallel simulation program
is only used for largescale problems, it is necessary
to do experiments for large—scale problems to see the
parallel program’s performance. For a largescale
case: M=S, N:8, K:495, together with the above two
cases, the experiment results are summarized in table
5 (Use screwy partitioning method. The sample path
lengths in all cases are N,=100,000). IahleiIheexpcrimcntaJLesulimﬂpamlch
Cases Nodes Time Speedup Efﬁciency
scale number (sec) K:10 i 17.0
2 11.8 1.44 72.0%
5 7.6 2.24 44.8%
10 6.2 2.74 27.4%
K=45 1 261.1
5 54.9 4.76 95.2%
15 20.8 12.55 83.7%
K: 1 32546
495 3 11172 2.91 97.1%
5 6687 4.87 97.3%
9 3755 8.67 96.3% Obviously, the parallel simulation program here
exhibits excellent speedup and scalability.
Furthermore, the larger the case’s scale is, the better
performance it behaves, this makes it suitable and
effective to practical usage. CONCLUSION Based on a new DEDS analysis theory: Markov performance potentials (Cao and Chen, 1997), an
efﬁcient parallel simulation algorithm is provided
when applying the potentials theory on a class of
closed queuing networks. The experimental results
show that the parallel simulation algorithm of this
paper exhibits excellent speedups and scalability. In this paper, two ideas are presented, Screwy
Partitioning method and modiﬁed Common Random
Number theme. Although the screwy partitioning
method do...
View
Full Document
 Spring '08
 Staff

Click to edit the document details