This feels like you're getting a small
memory/cache bandwidth increase for the rkf45_apply level-1-BLAS-like
operations by using multiple cores but the cores are otherwise not
being used effectively. I say this because a state vector 1e6 doubles
long will not generally fit in cache. Adding more cores increases the
amount of cache available.
Hmm... I tentatively take this back on re-thinking how you've added
the #pragma omp lines to the rkf45.c file you attached elsewhere in
this thread. Try using a single
#pragma omp parallel
and then individual lines like
#pragma omp for
at each for loop. Using
#pragma omp parallel for
repeatedly as you've done can introduce excess overhead, depending on
your compiler, because it may incur unnecessary overhead.
- Rhys