Home arrow Technology arrow Parralel Computing Turorial

Language Translator

Hacking Zone

Hacking Tools
Attacking

Configure Windows

Windows Configuration

Novels

Mix Novels

Human Personality

Body Language
Parralel Computing Turorial PDF Print E-mail
Written by Hemanshu Patel   
Sunday, 23 December 2007
Article Index
Parralel Computing Turorial
Page 2
Page 3
Page 4
Page 5
Page 6
Page 7
Page 8
Page 9
Page 10
Page 11
Page 12
Page 13
Page 14
Page 15
Page 16
Page 17
Page 18

Designing Parallel Programs

Limits and Costs of Parallel Programming

Amdahl's Law:

* Amdahl's Law states that potential program speedup is defined by the fraction of code (P) that can be parallelized:


1
speedup = --------
1 - P


* If none of the code can be parallelized, P = 0 and the speedup = 1 (no speedup). If all of the code is parallelized, P = 1 and the speedup is infinite (in theory).

If 50% of the code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast.

* Introducing the number of processors performing the parallel fraction of work, the relationship can be modeled by:


1
speedup = ------------
P + S
---
N


where P = parallel fraction, N = number of processors and S = serial fraction.

* It soon becomes obvious that there are limits to the scalability of parallelism. For example, at P = .50, .90 and .99 (50%, 90% and 99% of the code is parallelizable):


speedup
--------------------------------
N P = .50 P = .90 P = .99
----- ------- ------- -------
10 1.82 5.26 9.17
100 1.98 9.17 50.25
1000 1.99 9.91 90.99
10000 1.99 9.91 99.02


* However, certain problems demonstrate increased performance by increasing the problem size. For example:


2D Grid Calculations 85 seconds 85%
Serial fraction 15 seconds 15%

We can increase the problem size by doubling the grid dimensions and halving the time step. This results in four times the number of grid points and twice the number of time steps. The timings then look like:


2D Grid Calculations 680 seconds 97.84%
Serial fraction 15 seconds 2.16%

* Problems that increase the percentage of parallel time with their size are more scalable than problems with a fixed percentage of parallel time.

Complexity:

* In general, parallel applications are much more complex than corresponding serial applications, perhaps an order of magnitude. Not only do you have multiple instruction streams executing at the same time, but you also have data flowing between them.

* The costs of complexity are measured in programmer time in virtually every aspect of the software development cycle:
o Design
o Coding
o Debugging
o Tuning
o Maintenance

* Adhering to "good" software development practices is essential when when working with parallel applications - especially if somebody besides you will have to work with the software.

Portability:

* Thanks to standardization in several APIs, such as MPI, POSIX threads, HPF and OpenMP, portability issues with parallel programs are not as serious as in years past. However...

* All of the usual portability issues associated with serial programs apply to parallel programs. For example, if you use vendor "enhancements" to Fortran, C or C++, portability will be a problem.

* Even though standards exist for several APIs, implementations will differ in a number of details, sometimes to the point of requiring code modifications in order to effect portability.

* Operating systems can play a key role in code portability issues.

* Hardware architectures are characteristically highly variable and can affect portability.

Resource Requirements:

* The primary intent of parallel programming is to decrease execution wall clock time, however in order to accomplish this, more CPU time is required. For example, a parallel code that runs in 1 hour on 8 processors actually uses 8 hours of CPU time.

* The amount of memory required can be greater for parallel codes than serial codes, due to the need to replicate data and for overheads associated with parallel support libraries and subsystems.

* For short running parallel programs, there can actually be a decrease in performance compared to a similar serial implementation. The overhead costs associated with setting up the parallel environment, task creation, communications and task termination can comprise a significant portion of the total execution time for short runs.

Scalability:

* The ability of a parallel program's performance to scale is a result of a number of interrelated factors. Simply adding more machines is rarely the answer.

* The algorithm may have inherent limits to scalability. At some point, adding more resources causes performance to decrease. Most parallel solutions demonstrate this characteristic at some point.

* Hardware factors play a significant role in scalability. Examples:
o Memory-cpu bus bandwidth on an SMP machine
o Communications network bandwidth
o Amount of memory available on any given machine or set of machines
o Processor clock speed

* Parallel support libraries and subsystems software can limit scalability independent of your application.


Designing Parallel Programs

Performance Analysis and Tuning


* As with debugging, monitoring and analyzing parallel program execution is significantly more of a challenge than for serial programs.

* A number of parallel tools for execution monitoring and program analysis are available.

* Some are quite useful; some are cross-platform also.

* One starting point: Performance Analysis Tools Tutorial

* Work remains to be done, particularly in the area of scalability.



Parallel Examples
Array Processing

* This example demonstrates calculations on 2-dimensional array elements, with the computation on each array element being independent from other array elements.

* The serial program calculates one element at a time in sequential order.

* Serial code could be of the form:


do j = 1,n
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do


* The calculation of elements is independent of one another - leads to an embarrassingly parallel situation.

* The problem should be computationally intensive.

Embarrassingly parallel array calculation

Array Processing
Parallel Solution 1

* Arrays elements are distributed so that each processor owns a portion of an array (subarray).

* Independent calculation of array elements insures there is no need for communication between tasks.

* Distribution scheme is chosen by other criteria, e.g. unit stride (stride of 1) through the subarrays. Unit stride maximizes cache/memory usage.

* Since it is desirable to have unit stride through the subarrays, the choice of a distribution scheme depends on the programming language. See the Block - Cyclic Distributions Diagram for the options.

* After the array is distributed, each task executes the portion of the loop corresponding to the data it owns. For example, with Fortran block distribution:


do j = mystart, myend
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do


* Notice that only the outer loop variables are different from the serial solution.

Embarrassingly parallel array calculation data decomposition

One Possible Solution:

* Implement as SPMD model.

* Master process initializes array, sends info to worker processes and receives results.

* Worker process receives info, performs its share of computation and sends results to master.

* Using the Fortran storage scheme, perform block distribution of the array.

* Pseudo code solution: red highlights changes for parallelism.


find out if I am MASTER or WORKER

if I am MASTER

initialize the array
send each WORKER info on part of array it owns
send each WORKER its portion of initial array

receive from each WORKER results

else if I am WORKER
receive from MASTER info on part of array I own
receive from MASTER my portion of initial array

# calculate my portion of array
do j = my first column,my last column
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do

send MASTER results

endif

Array Processing

Parallel Solution 2: Pool of Tasks


* The previous array solution demonstrated static load balancing:
o Each task has a fixed amount of work to do
o May be significant idle time for faster or more lightly loaded processors - slowest tasks determines overall performance.

* Static load balancing is not usually a major concern if all tasks are performing the same amount of work on identical machines.

* If you have a load balance problem (some tasks work faster than others), you may benefit by using a "pool of tasks" scheme.

Pool of Tasks Scheme:

* Two processes are employed

Master Process:
o Holds pool of tasks for worker processes to do
o Sends worker a task when requested
o Collects results from workers

Worker Process: repeatedly does the following
o Gets task from master process
o Performs computation
o Sends results to master

* Worker processes do not know before runtime which portion of array they will handle or how many tasks they will perform.

* Dynamic load balancing occurs at run time: the faster tasks will get more work to do.

* Pseudo code solution: red highlights changes for parallelism.


find out if I am MASTER or WORKER

if I am MASTER

do until no more jobs
send to WORKER next job
receive results from WORKER
end do

tell WORKER no more jobs

else if I am WORKER

do until no more jobs
receive from MASTER next job

calculate array element: a(i,j) = fcn(i,j)

send results to MASTER
end do

endif

Discussion:

* In the above pool of tasks example, each task calculated an individual array element as a job. The computation to communication ratio is finely granular.

* Finely granular solutions incur more communication overhead in order to reduce task idle time.

* A more optimal solution might be to distribute more work with each job. The "right" amount of work is problem dependent.

Parallel Examples

PI Calculation


* The value of PI can be calculated in a number of ways. Consider the following method of approximating PI
1. Inscribe a circle in a square
2. Randomly generate points in the square
3. Determine the number of points in the square that are also in the circle
4. Let r be the number of points in the circle divided by the number of points in the square
5. PI ~ 4 r
6. Note that the more points generated, the better the approximation

* Serial pseudo code for this procedure:


npoints = 10000
circle_count = 0

do j = 1,npoints
generate 2 random numbers between 0 and 1
xcoordinate = random1 ; ycoordinate = random2
if (xcoordinate, ycoordinate) inside circle
then circle_count = circle_count + 1
end do

PI = 4.0*circle_count/npoints


* Note that most of the time in running this program would be spent executing the loop

* Leads to an embarrassingly parallel solution
o Computationally intensive
o Minimal communication
o Minimal I/O

One method of determining PI

PI Calculation

Parallel Solution


* Parallel strategy: break the loop into portions that can be executed by the tasks.

* For the task of approximating PI:
o Each task executes its portion of the loop a number of times.
o Each task can do its work without requiring any information from the other tasks (there are no data dependencies).
o Uses the SPMD model. One task acts as master and collects the results.

* Pseudo code solution: red highlights changes for parallelism.


npoints = 10000
circle_count = 0

p = number of tasks
num = npoints/p

find out if I am MASTER or WORKER

do j = 1,num
generate 2 random numbers between 0 and 1
xcoordinate = random1 ; ycoordinate = random2
if (xcoordinate, ycoordinate) inside circle
then circle_count = circle_count + 1
end do

if I am MASTER

receive from WORKERS their circle_counts
compute PI (use MASTER and WORKER calculations)

else if I am WORKER

send to MASTER circle_count

endif

One method of determining PI

Parallel Examples

Simple Heat Equation


* Most problems in parallel computing require communication among the tasks. A number of common problems require communication with "neighbor" tasks.

* The heat equation describes the temperature change over time, given initial temperature distribution and boundary conditions.

* A finite differencing scheme is employed to solve the heat equation numerically on a square region.

* The initial temperature is zero on the boundaries and high in the middle.

* The boundary temperature is held at zero.

* For the fully explicit problem, a time stepping algorithm is used. The elements of a 2-dimensional array represent the temperature at points on the square.

* The calculation of an element is dependent upon neighbor element values. Heat equation

* A serial program would contain code like:


do iy = 2, ny - 1
do ix = 2, nx - 1
u2(ix, iy) =
u1(ix, iy) +
cx * (u1(ix+1,iy) + u1(ix-1,iy) - 2.*u1(ix,iy)) +
cy * (u1(ix,iy+1) + u1(ix,iy-1) - 2.*u1(ix,iy))
end do
end do


Initial heat conditions Heat equation

Simple Heat Equation

Parallel Solution 1


Heat equation - partitioned data

* Implement as an SPMD model

* The entire array is partitioned and distributed as subarrays to all tasks. Each task owns a portion of the total array.

* Determine data dependencies
o interior elements belonging to a task are independent of other tasks
o border elements are dependent upon a neighbor task's data, necessitating communication.

* Master process sends initial info to workers, checks for convergence and collects results

* Worker process calculates solution, communicating as necessary with neighbor processes

* Pseudo code solution: red highlights changes for parallelism.


find out if I am MASTER or WORKER

if I am MASTER
initialize array
send each WORKER starting info and subarray

do until all WORKERS converge
gather from all WORKERS convergence data
broadcast to all WORKERS convergence signal
end do

receive results from each WORKER

else if I am WORKER
receive from MASTER starting info and subarray

do until solution converged
update time
send neighbors my border info
receive from neighbors their border info

update my portion of solution array

determine if my solution has converged
send MASTER convergence data
receive from MASTER convergence signal
end do

send MASTER results

endif

Simple Heat Equation
Parallel Solution 2: Overlapping Communication and Computation

* In the previous solution, it was assumed that blocking communications were used by the worker tasks. Blocking communications wait for the communication process to complete before continuing to the next program instruction.

* In the previous solution, neighbor tasks communicated border data, then each process updated its portion of the array.

* Computing times can often be reduced by using non-blocking communication. Non-blocking communications allow work to be performed while communication is in progress.

* Each task could update the interior of its part of the solution array while the communication of border data is occurring, and update its border after communication has completed.

* Pseudo code for the second solution: red highlights changes for non-blocking communications.


find out if I am MASTER or WORKER

if I am MASTER
initialize array
send each WORKER starting info and subarray

do until all WORKERS converge
gather from all WORKERS convergence data
broadcast to all WORKERS convergence signal
end do

receive results from each WORKER

else if I am WORKER
receive from MASTER starting info and subarray

do until solution converged
update time

non-blocking send neighbors my border info
non-blocking receive neighbors border info

update interior of my portion of solution array
wait for non-blocking communication complete
update border of my portion of solution array

determine if my solution has converged
send MASTER convergence data
receive from MASTER convergence signal
end do

send MASTER results

endif



Last Updated ( Sunday, 23 December 2007 )
 
< Prev
Your Ad Here

Donate us!!

Enter Amount:

RSS socialnet

Add to MyYahoo!
Subscribe in NewsGator Online
Add to Newsburst
Add to Google
Add to My AOL
Add to Pluck
Subscribe in FeedLounge
Add to Windows Live
Add to NetVibes
Subscribe in Rojo
Subscribe in Bloglines
Add to MyMSN
Add to Plusmo for your cellphone
Add to PageFlakes
Add to Technorati
Add to BlinkBits