[Info-vax] OpenMP Performance Problem

Sun Aug 30 08:52:11 EDT 2009

Dear all,
I'm using OpenMP and have big performance problems.

I have a simple program written in Fortran with 4 parallel sectors.
You find the important source code below.

I'm using Microsoft VisualStudio 2005 Professional Edition.
The operating system is Microsoft Windows Server 2003, Standard x64
Edition, Service Pack 2.
It's running on an AMD Oteron with 4 CPUs.

In Microsoft VisualStudio 2005 I have set under "Configuration-
>Fortran->Linker" the following settings:
SubSystem: Console
Heap Reserve Size: 256000000
Heap Commit Size: 128000000
Stack Reserve Size: 256000000
Stack Commit Size: 128000000
Enable Large Adresses: Support Adresses Larger Than 2 GB
Terminal Server: Default

If I parallelize the 4 sectors like you can see below the calculation
for all 4 sectors takes about 80 seconds.
If I run it serial (without OpenMP) it is much quicker, in this case
it only needs about 50 seconds.
I don't understand this difference and I thought/hoped the parallel
OpenMP version is much quicker, but it isn't :-(.
Do you know why?
Have I made something wrong?
Or should I use another OpenMP Directive instead of "SECTIONS"?
Or do you have experinces with another technologies instead of OpenMP
for parallelizing Fortran programs?

Thanks a lot for your help!
Many greetings,
Alex

Soure code start: _____________________
...
...

     INTEGER N, I, J
     PARAMETER (N=1000000)
     REAL A1(N), A2(N), A3(N)
     REAL B1(N), B2(N), B3(N)
     REAL C1(N), C2(N), C3(N)
     REAL D1(N), D2(N), D3(N)
     REAL parallel_time_begin, parallel_time_end
     REAL section1_time_begin, section1_time_end
     REAL section2_time_begin, section2_time_end
     REAL section3_time_begin, section3_time_end
     REAL section4_time_begin, section4_time_end

...
...

!     Some initializations

     DO I = 1, N
       A1(I) = I + 1.5
       A2(I) = I + 22.35
       B1(I) = I + 1.5
       B2(I) = I + 22.35
       C1(I) = I + 1.5
       C2(I) = I + 22.35
       D1(I) = I + 1.5
       D2(I) = I + 22.35
     ENDDO

...
...

     PRINT *, '***** parallel dcal start *****'
     CALL CPU_TIME ( parallel_time_begin )

C$OMP PARALLEL PRIVATE(A1, A2, A3,
    1B1, B2, B3,
    2C1, C2, C3,
    3D1, D2, D3,
    4I, J)
C$OMP SECTIONS

C$OMP SECTION
     PRINT *, '***** 1. Section Start'
     CALL CPU_TIME ( section1_time_begin )
     DO J = 1, 1000
       DO I = 1, N
         A3(I) = A1(I) + A2(I)
       ENDDO
     ENDDO
     CALL CPU_TIME ( section1_time_end )
     PRINT *, '====> time of section1 was              ',
    1section1_time_end - section1_time_begin, ' seconds <===='
     PRINT *, '***** 1. Section End'

C$OMP SECTION
     PRINT *, '***** 2. Section Start'
     CALL CPU_TIME ( section2_time_begin )
     DO J = 1, 2000
       DO I = 1, N
         B3(I) = B1(I) + B2(I)
       ENDDO
     ENDDO
     CALL CPU_TIME ( section2_time_end )
     PRINT *, '====> time of section2 was              ',
    1section2_time_end - section2_time_begin, ' seconds <===='
     PRINT *, '***** 2. Section End'

C$OMP SECTION
     PRINT *, '***** 3. Section Start'
     CALL CPU_TIME ( section3_time_begin )
     DO J = 1, 3000
       DO I = 1, N
         C3(I) = C1(I) + C2(I)
       ENDDO
     ENDDO
     CALL CPU_TIME ( section3_time_end )
     PRINT *, '====> time of section3 was              ',
    1section3_time_end - section3_time_begin, ' seconds <===='
     PRINT *, '***** 3. Section End'

C$OMP SECTION
     PRINT *, '***** 4. Section Start'
     CALL CPU_TIME ( section4_time_begin )
     DO J = 1, 4000
       DO I = 1, N
         D3(I) = D1(I) + D2(I)
       ENDDO
     ENDDO
     CALL CPU_TIME ( section4_time_end )
     PRINT *, '====> time of section4 was              ',
    1section4_time_end - section4_time_begin, ' seconds <===='
     PRINT *, '***** 4. Section End'

C$OMP END SECTIONS NOWAIT
C$OMP END PARALLEL

...
...

Source code end: _____________________