Intel® Math Kernel Library Cookbook


Add to my manuals
84 Pages

advertisement

Intel® Math Kernel Library Cookbook | Manualzz

Intel

®

Math Kernel Library

Cookbook

Intel

®

MKL

Document Number: 330244-005US

Legal Information

Contents

Contents

Legal Information................................................................................ 5

Getting Help and Support................................................................... 11

Notational Conventions...................................................................... 13

Related Information........................................................................... 15

Intel

®

Math Kernel Library Recipes..................................................... 17

Chapter 1: Finding an approximate solution to a stationary nonlinear heat equation

Chapter 2: Factoring general block tridiagonal matrices

Chapter 3: Solving a system of linear equations with an LUfactored block tridiagonal coefficient matrix

Chapter 4: Factoring block tridiagonal symmetric positive definite matrices

Chapter 5: Solving a system of linear equations with a block tridiagonal symmetric positive definite coefficient matrix

Chapter 6: Computing principal angles between two subspaces

Chapter 7: Computing principal angles between invariant subspaces of block triangular matrices

Chapter 8: Evaluating a Fourier integral

Chapter 9: Using Fast Fourier Transforms for computer tomography image reconstruction

Chapter 10: Noise filtering in financial market data streams

Chapter 11: Using the Monte Carlo method for simulating European options pricing

Chapter 12: Using the Black-Scholes formula for European options pricing

Chapter 13: Multiple simple random sampling without replacement

Bibliography....................................................................................... 81

3

Intel

®

Math Kernel Library Cookbook

4

Legal Information

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors which may cause deviations from published specifications.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Cilk, Intel, the Intel logo, Intel Atom, Intel Core, Intel Inside, Intel NetBurst, Intel SpeedStep, Intel vPro,

Intel Xeon Phi, Intel XScale, Itanium, MMX, Pentium, Thunderbolt, Ultrabook, VTune and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.

Java is a registered trademark of Oracle and/or its affiliates.

Third Party Content

Intel ® Math Kernel Library (Intel ® MKL) includes content from several 3rd party sources that was originally governed by the licenses referenced below:

Portions

©

Copyright 2001 Hewlett-Packard Development Company, L.P.

Sections on the Linear Algebra PACKage (LAPACK) routines include derivative work portions that have been copyrighted:

© 1991, 1992, and 1998 by The Numerical Algorithms Group, Ltd.

Intel MKL supports LAPACK 3.5 set of computational, driver, auxiliary and utility routines under the following license:

Copyright © 1992-2011 The University of Tennessee and The University of Tennessee Research Foundation. All rights reserved.

Copyright © 2000-2011 The University of California Berkeley. All rights reserved.

Copyright © 2006-2012 The University of Colorado Denver. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer listed in this license in the documentation and/or other materials provided with the distribution.

Neither the name of the copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

5

Intel

®

Math Kernel Library Cookbook

The copyright holders provide no reassurances that the source code provided does not infringe any patent, copyright, or any other intellectual property rights of third parties. The copyright holders disclaim any liability to any recipient for claims brought against recipient by any third party for infringement of that parties intellectual property rights.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR

IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND

FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR

CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL

DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,

DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER

IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF

THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The original versions of LAPACK from which that part of Intel MKL was derived can be obtained from http://www.netlib.org/lapack/index.html. The authors of LAPACK are E. Anderson, Z. Bai, C. Bischof, S.

Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D.

Sorensen.

The original versions of the Basic Linear Algebra Subprograms (BLAS) from which the respective part of

Intel

®

MKL was derived can be obtained from http://www.netlib.org/blas/index.html.

XBLAS is distributed under the following copyright:

Copyright © 2008-2009 The University of California Berkeley. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer listed in this license in the documentation and/or other materials provided with the distribution.

- Neither the name of the copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY

EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF

MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL

THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,

SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,

PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS

INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,

STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF

THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The original versions of the Basic Linear Algebra Communication Subprograms (BLACS) from which the respective part of Intel MKL was derived can be obtained from http://www.netlib.org/blacs/index.html.

The authors of BLACS are Jack Dongarra and R. Clint Whaley.

The original versions of Scalable LAPACK (ScaLAPACK) from which the respective part of Intel

®

MKL was derived can be obtained from http://www.netlib.org/scalapack/index.html. The authors of ScaLAPACK are

L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G.

Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley.

The original versions of the Parallel Basic Linear Algebra Subprograms (PBLAS) routines from which the respective part of Intel ® MKL was derived can be obtained from http://www.netlib.org/scalapack/html/ pblas_qref.html.

PARDISO (PARallel DIrect SOlver)* in Intel

®

MKL was originally developed by the Department of Computer

Science at the University of Basel (http://www.unibas.ch). It can be obtained at http://www.pardisoproject.org.

The Extended Eigensolver functionality is based on the Feast solver package and is distributed under the following license:

6

Legal Information

Copyright

©

2009, The Regents of the University of Massachusetts, Amherst.

Developed by E. Polizzi

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1.

2.

3.

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

Neither the name of the University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE AUTHOR ''AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,

INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR

A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY

DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,

BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR

PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,

WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)

ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF

SUCH DAMAGE.

Some Fast Fourier Transform (FFT) functions in this release of Intel ® MKL have been generated by the

SPIRAL software generation system (http://www.spiral.net/) under license from Carnegie Mellon

University. The authors of SPIRAL are Markus Puschel, Jose Moura, Jeremy Johnson, David Padua,

Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen,

Robert W. Johnson, and Nick Rizzolo.

Open MPI is distributed under the New BSD license, listed below.

Most files in this release are marked with the copyrights of the organizations who have edited them. The copyrights below are in no particular order and generally reflect members of the Open MPI core team who have contributed code to this release. The copyrights for code used under license from other parties are included in the corresponding files.

Copyright

©

2004-2010 The Trustees of Indiana University and Indiana University Research and

Technology Corporation. All rights reserved.

Copyright

©

2004-2010 The University of Tennessee and The University of Tennessee Research

Foundation. All rights reserved.

Copyright

©

2004-2010 High Performance Computing Center Stuttgart, University of Stuttgart. All rights reserved.

Copyright

©

2004-2008 The Regents of the University of California. All rights reserved.

Copyright

©

2006-2010 Los Alamos National Security, LLC. All rights reserved.

Copyright

©

2006-2010 Cisco Systems, Inc. All rights reserved.

Copyright

©

2006-2010 Voltaire, Inc. All rights reserved.

Copyright

©

2006-2011 Sandia National Laboratories. All rights reserved.

Copyright

©

2006-2010 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms.

Copyright

©

2006-2010 The University of Houston. All rights reserved.

Copyright

©

2006-2009 Myricom, Inc. All rights reserved.

Copyright

©

2007-2008 UT-Battelle, LLC. All rights reserved.

Copyright

©

2007-2010 IBM Corporation. All rights reserved.

7

Intel

®

Math Kernel Library Cookbook

Copyright

©

1998-2005 Forschungszentrum Juelich, Juelich Supercomputing Centre, Federal Republic of

Germany

Copyright © 2005-2008 ZIH, TU Dresden, Federal Republic of Germany

Copyright

©

2007 Evergrid, Inc. All rights reserved.

Copyright

©

2008 Chelsio, Inc. All rights reserved.

Copyright

©

2008-2009 Institut National de Recherche en Informatique. All rights reserved.

Copyright

©

2007 Lawrence Livermore National Security, LLC. All rights reserved.

Copyright

©

2007-2009 Mellanox Technologies. All rights reserved.

Copyright

©

2006-2010 QLogic Corporation. All rights reserved.

Copyright

©

2008-2010 Oak Ridge National Labs. All rights reserved.

Copyright

©

2006-2010 Oracle and/or its affiliates. All rights reserved.

Copyright

©

2009 Bull SAS. All rights reserved.

Copyright

©

2010 ARM ltd. All rights reserved.

Copyright

©

2010-2011 Alex Brick . All rights reserved.

Copyright

©

2012 The University of Wisconsin-La Crosse. All rights reserved.

Copyright

©

2013-2014 Intel, Inc. All rights reserved.

Copyright

©

2011-2014 NVIDIA Corporation. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer listed in this license in the documentation and/or other materials provided with the distribution.

- Neither the name of the copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

The copyright holders provide no reassurances that the source code provided does not infringe any patent, copyright, or any other intellectual property rights of third parties. The copyright holders disclaim any liability to any recipient for claims brought against recipient by any third party for infringement of that parties intellectual property rights.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY

EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF

MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL

THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,

SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,

PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS

INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,

STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF

THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The Safe C Library is distributed under the following copyright:

Copyright (c)

8

Legal Information

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,

INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR

PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE

LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT

OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR

OTHER DEALINGS IN THE SOFTWARE.

HPL Copyright Notice and Licensing Terms

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1.

2.

3.

4.

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution.

All advertising materials mentioning features or use of this software must display the following acknowledgement: This product includes software developed at the University of Tennessee,

Knoxville, Innovative Computing Laboratories.

The name of the University, the name of the Laboratory, or the names of its contributors may not be used to endorse or promote products derived from this software without specific written permission.

Copyright

©

2015, Intel Corporation. All rights reserved.

9

Intel

®

Math Kernel Library Cookbook

10

Getting Help and Support

Intel MKL provides a product web site that offers timely and comprehensive product information, including product features, white papers, and technical articles. For the latest information, check: http:// www.intel.com/software/products/support.

Intel also provides a support web site that contains a rich repository of self help information, including getting started tips, known product issues, product errata, license information, user forums, and more (visit http://www.intel.com/software/products/).

Registering your product entitles you to one year of technical support and product updates through Intel

®

Premier Support. Intel Premier Support is an interactive issue management and communication web site providing these services:

Submit issues and review their status.

Download product updates anytime of the day.

To register your product, contact Intel, or seek product support, please visit http://www.intel.com/software/ products/support.

11

Intel

®

Math Kernel Library Cookbook

12

Notational Conventions

This manual uses the following terms to refer to operating systems:

Windows* OS

This term refers to information that is valid on all supported Windows* operating systems.

Linux* OS

OS X*

This term refers to information that is valid on all supported Linux* operating systems.

This term refers to information that is valid on Intel

®

-based systems running the

OS X* operating system.

This manual uses the following notational conventions:

Routine name shorthand (for example,

?ungqr

instead of cungqr

/ zungqr

).

• Font conventions used for distinction between the text and the code.

Routine Name Shorthand

For shorthand, names that contain a question mark "

?

" represent groups of routines with similar functionality. Each group typically consists of routines used with four basic data types: single-precision real, double-precision real, single-precision complex, and double-precision complex. The question mark is used to indicate any or all possible varieties of a function; for example:

?swap

Refers to all four data types of the vector-vector

?swap

routine: sswap

, dswap

, cswap

, and zswap

.

Font Conventions

The following font conventions are used:

UPPERCASE COURIER Data type used in the description of input and output parameters for

Fortran interface. For example,

CHARACTER*1

.

lowercase courier

Code examples: a(k+i,j) = matrix(i,j) and data types for C interface, for example, const float* lowercase courier mixed with

UpperCase courier

Function names for C interface; for example, vmlSetMode

lowercase courier italic

Variables in arguments and parameters description. For example,

incx.

* Used as a multiplication symbol in code examples and equations and where required by the programming language syntax.

13

Intel

®

Math Kernel Library Cookbook

14

Related Information

To reference how to use the library in your application, use this guide in conjunction with the following documents:

• The Intel

®

Math Kernel Library Reference Manual, which provides reference information on routine functionalities, parameter descriptions, interfaces, calling syntaxes, and return values.

• The Intel

®

Math Kernel Library User's Guide.

Web versions of these documents are available in the Intel

®

Software Documentation Library at http:// software.intel.com/en-us/intel-software-technical-documentation.

15

Intel

®

Math Kernel Library Cookbook

16

Intel

®

Math Kernel Library Recipes

The Intel

®

Math Kernel Library (Intel

®

MKL) contains many routines to help you solve various numerical problems, such as multiplying matrices, solving a system of equations, and performing a Fourier transform.

While many problems do not have dedicated Intel MKL routines, you can solve them by assembling the building blocks provided by Intel MKL.

The Intel Math Kernel Library Cookbook includes these recipes to help you to assemble Intel MKL routines for solving some more complex problems:

Matrix recipes using Intel MKL PARDISO, BLAS, Sparse BLAS, and LAPACK routines

Finding an approximate solution to a nonlinear equation demonstrates a method of finding a solution

to a nonlinear equation using Intel MKL PARDISO, BLAS, and Sparse BLAS routines.

Factoring a block tridiagonal matrix uses Intel MKL implementations of BLAS and LAPACK routines.

Solving a system of linear equations with an LU-factored block tridiagonal coefficient matrix

extends the factoring recipe to solving a system of equations.

Factoring block tridiagonal symmetric positive definite matrices using BLAS and LAPACK routines

demonstrates Cholesky factorization of a symmetric positive definite block tridiagonal matrix using

BLAS and LAPACK routines.

Solving a system of linear equations with block tridiagonal symmetric positive definite coefficient matrix extends the factoring recipe to solving a system of equations using BLAS and LAPACK routines.

Computing principal angles between two subspaces uses LAPACK SVD to calculate the principal angles.

Computing principal angles between invariant subspaces of block triangular matrices

extends the use of LAPACK SVD to the case where the subspaces are invariant subspaces of a block triangular matrix and are complementary to each other.

Fast Fourier Transform recipes

Evaluating a Fourier Integral

uses Intel MKL Fast Fourier Transform (FFT) interface to evaluate a continuous Fourier transform integral.

Using Fast Fourier Transforms for computer tomography image reconstruction

uses Intel MKL FFT interface to reconstruct an image from computer tomography data.

Numerics recipes

Noise filtering in financial market data streams

uses Intel MKL summary statistics routines for computing a correlation matrix for streaming data.

Using the Monte Carlo method for simulating European options pricing

computes call and put European option prices with an Intel MKL basic random number generator (BRNG).

Using the Black-Scholes formula for European options pricing

speeds up Black-Scholes computation of

European options pricing with Intel MKL vector math functions.

Multiple simple random sampling without replacement

generates K simple random length-M samples without replacement from a population of size N for a large K.

NOTE

Code examples in the cookbook are provided in Fortran for some recipes and in C for other recipes.

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and

SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-

17

Intel

®

Math Kernel Library Cookbook

Optimization Notice

dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

18

Finding an approximate solution to a stationary nonlinear heat equation

1

Goal

Obtain a solution to a boundary value problem for the thermal equation, with thermal coefficients that depend on the solution.

Solution

Use a fixed-point iteration approach [ Amos10

], utilizing Intel MKL PARDISO for solving linear problems on each external iteration.

1.

Set up the matrix structure in CSR format.

2.

Perform fixed-point iteration until the residual norm becomes lower than the tolerance.

a.

Use the pardiso

routine to solve the linearized system for the current iteration.

b.

Set the solution of the system to the next approximation of the main equation using the dcopy routine.

c.

Based on the new approximation, calculate the new elements of the matrix.

d.

Calculate the residual of the current solution using the mkl_cspblas_dcsrgemv

routine.

e.

Calculate the norm of the residual using the dnrm2

routine and compare it with the tolerance.

3.

Free the internal memory of the solver.

Source code: see the sparse

folder in the samples archive available at http://software.intel.com/en-us/ mkl_cookbook_samples.

Finding an approximate solution using Intel MKL PARDISO, Sparse BLAS, and

BLAS

CONSTRUCT_MATRIX_STRUCTURE (nx, ny, nz, &ia, &ja, &a, &error);

CONSTRUCT_MATRIX_VALUES (nx, ny, nz, us, ia, ja, a, &error);

DO WHILE res > tolerance

phase = 13;

PARDISO (pt, &maxfct, &mnum, &mtype, &phase, &n, a, ia, ja, &idum,

&nrhs, iparm, &msglvl, f, u, &error );

DCOPY (&n, u, &one, u_next, &one);

CONSTRUCT_MATRIX_VALUES (nx, ny, nz, u_next, ia, ja, a, &error);

MKL_CSPBLAS_DCSRGEMV (uplo, &n, a, ia, ja, u, temp );

DAXPY (&n, &minus_one, f, &one, temp, &one);

res = DNRM2 (&n, temp, &one);

END DO

phase = -1;

PARDISO ( pt, &maxfct, &mnum, &mtype, &phase, &n, a, ia, ja, &idum,

&nrhs, iparm, &msglvl, f, u, &error );

19

1

Intel

®

Math Kernel Library Cookbook

Routines Used

Task Routine Description

Solve the linearized system on the current iteration; free internal memory of the solver.

Set the solution found as the next approximation of the main equation.

Calculate the residual of the current nonlinear iteration.

PARDISO

DCOPY

Calculates the solution of a set of sparse linear equations with multiple right-hand sides.

Copies vector to another vector.

MKL_CSPBLAS_DCSRGEMV

DAXPY

Computes matrix - vector product of a sparse general matrix stored in the CSR format

(3-array variation) with zerobased indexing.

Computes a vector-scalar product and adds the result to a vector.

Computes the Euclidean norm of a vector.

Calculate the norm of the residual to compare it with stopping criteria.

DNRM2

Discussion

The stationary nonlinear heat equation can be described as a boundary value problem for a nonlinear partial differential equation:

x

(

μ (v )

v

x

)

v |

D

= 0

y

(

μ (v )

v

y

)

z

(

μ (v )

v

z

)

= 1, (x , y , z )

D

Where the domain D is assumed to be a cube: D = (0, 1) temperature.

3

, and v (x , y , z ) is an unknown function of

For the purpose of demonstration, the problem is restricted to linear dependence of the thermal coefficient on the solution:

μ (v ) = 1 + 10v

To obtain a numerical solution, an equidistant grid with grid step h in the domain D is chosen, and the partial

differential equation is approximated using finite differences. This procedure [ Smith86

] yields a system of nonlinear algebraic equations:

u

i ,j ,k −1

*

mi,j,k−1 + mi,j,k

2

u

i ,j −1,k

*

mi,j−1,k + mi,j,k

2

u

i −1,j ,k

*

mi−1,j,k + mi,j,k

2

= h

2

+ u

i ,j ,k

*

mi,j,k−1 + 2mi,j,k + mi,j,k+1

2

+ u

i ,j ,k

*

mi,j−1,k + 2mi,j,k + mi,j+1,k

2

+ u

i ,j ,k

*

mi−1,j,k + 2mi,j,k + mi+1,j,k

2

u

i ,j ,k +1

*

mi,j,k + mi,j,k+1

2

u

i ,j +1,k

*

mi,j,k + mi,j+1,k

2

u

i +1,j ,k

*

mi,j,k + mi+1,j,k

2

u

0,j ,k

= u

n ,j ,k

= u

i ,0,k

= u

i ,n ,k

= u

i ,j ,0

= u

i ,j ,n

= 0

m

i ,j ,k

= 1 + 10 * u

i ,j ,k

1

h

Each equation ties together the value of the unknown grid function u and the value of the respective right hand side at seven grid points. The left hand sides of the equations can be represented as linear combinations of the grid function values with coefficients which depend on the solution itself. Introducing a matrix composed of these coefficients, the equations can be rewritten in vector-matrix form:

20

Finding an approximate solution to a stationary nonlinear heat equation

1

A (u˜)u˜ = g

Since the coefficient matrix A is sparse (it has only seven nonzero elements in each row), it is suitable to store it in a CSR-format array (see Sparse Matrix Storage Formats in the Intel Math Kernel Library Reference

Manual), and use the PARDISO* solver for solving it using this iterative algorithm:

1.

Set u to initial value u

0

.

2.

Calculate residual r = A(u)u - g.

3.

Do while ||r|| < tolerance:

a.

Solve system A(u)w = g for w.

b.

Set u = w.

c.

Calculate residual r = A(u)u - g.

21

1

Intel

®

Math Kernel Library Cookbook

22

Factoring general block tridiagonal matrices

2

Goal

Perform LU factorization of a general block tridiagonal matrix.

Solution

Intel MKL LAPACK provides a wide range of subroutines for LU factorization of general matrices, including dense matrices, band matrices, and tridiagonal matrices. This recipe extends the range of functionality to general block tridiagonal matrices subject to condition all the blocks are square and have the same order.

To perform LU factorization of a block tridiagonal matrix with square blocks of size NB by NB:

1.

Sequentially apply partial LU factorization to rectangular blocks of size M by N formed by the first two block rows and first three block columns of the matrix (where M = 2NB, N = 3NB, and K = NB), and moving down along the diagonal until the last but one block row is processed.

Partial LU factorization: for LU factorization of a general block tridiagonal matrix it is useful to have separate functionality for partial LU factorization of a rectangular M-by-N matrix. The partial LU factorization algorithm with parameter K, where K≤ min(M, N), consists of

a.

Perform LU factorization of the M-by-K submatrix.

b.

Solve the system with triangular coefficient matrix.

c.

Update the lower right (M - K)-by-(N - K) block.

The resulting matrix is A = P(LU + A1) where L is a lower trapezoidal M-by-K matrix, U is an upper trapezoidal matrix, P is permutation (pivoting) matrix, and A1 is a matrix with nonzero elements only in the intersection of the last M - K rows and N - K columns.

2.

Apply general LU factorization to the last (2NB) by (2NB) block.

Source code: see the

BlockTDS_GE/source/dgeblttrf.f

file in the samples archive available at http:// software.intel.com/en-us/mkl_cookbook_samples.

Performing partial LU factorization

SUBROUTINE PTLDGETRF(M, N, K, A, LDA, IPIV, INFO)

CALL DGETRF( M, K, A, LDA, IPIV, INFO )

DO I=1,K

IF(IPIV(I).NE.I)THEN

CALL DSWAP(N-K, A(I,K+1), LDA, A(IPIV(I), K+1), LDA)

END IF

END DO

CALL DTRSM('L','L','N','U',K,N-K,1D0, A, LDA, A(1,K+1), LDA)

CALL DGEMM('N', 'N', M-K, N-K, K, -1D0, A(K+1,1), LDA,

& A(1,K+1), LDA, 1D0, A(K+1,K+1), LDA)

Factoring a block tridiagonal matrix

DO K=1,N-2

C Form a 2*NB by 3*NB submatrix A with block structure

C (D_K C_K 0 )

23

2

Intel

®

Math Kernel Library Cookbook

C (B_K D_K+1 C_K+1)

C Partial factorization of the submatrix

CALL PTLDGETRF(2*NB, 3*NB, NB, A, 2*NB, IPIV(1,K), INFO)

C Factorization results to be copied back to arrays storing blocks of the tridiagonal matrix

END DO

C Out of loop factorization of the last 2*NB by 2*NB submatrix

CALL DGETRF(2*NB, 2*NB, A, 2*NB, IPIV(1,N-1), INFO)

C Copy the last result back to arrays storing blocks of the tridiagonal matrix

Routines Used

Task Routine Description

LU factorization of the M-by-K submatrix

Permute rows of the matrix

Solving a system with triangular coefficient matrix

Update lower-right (M - K)-by-(N

- K) block

DGETRF

DSWAP

DTRSM

DGEMM

Compute the LU factorization of a general m-by-n matrix

Swap two vectors

Solve a triangular matrix equation

Compute a matrix-matrix product with general matrices.

Discussion

For partial LU factorization, let A be a rectangular m-by-n matrix:

NOTE

For ease of reading, lower-case indexes such as m, n, k, and nb are used in this discussion. These correspond to the upper-case indexes used in the Fortran solution and code samples.

The matrix can be decomposed using LU factorization of the m-by-k submatrix, where 0 < kn. For this application, k< min(m, n), because

?getrf

can be used directly to factor the matrix if mkn or n = km.

A can be represented as a block matrix: where A

11

is a k-by-k submatrix, A an (m - k)-by-(n - k) submatrix.

12

is a k-by-(n - k) submatrix, A

21

is an (m - k)-by-k submatrix, and A

22

is

24

Factoring general block tridiagonal matrices

2

The m-by-k panel A

1

can be defined as

A

1

can be LU factored (using

?getrf

) as A

1

= PLU, where P is a permutation (pivoting) matrix, L is lower trapezoidal with unit elements on the diagonal, and U is upper triangular:

NOTE

Since the diagonal elements of L do not need to be stored, the array used to store A

1 store the elements of L and U.

can be used to

Applying P

T

to the second panel of A gives:

This yields the equation:

Introducing the term A''

12

defined as and substituting it into the equation for P

T

A yields:

25

2

Intel

®

Math Kernel Library Cookbook

Multiplying the previous equation by P gives:

This can be considered a partial LU factorization of the initial matrix.

The product L used for A

12

-1

11

A'

12

can be computed by calling

. The update A'

22

- L

21

(L

-1

11

A'

12

?trsm

and can be stored in place of the array

) can be computed by calling

?gemm

and can be stored in place of the array used for A

22

.

If the submatrices do not have full rank, this method cannot be applied because LU factorization would fail.

Unlike LU factorization of general matrices, for general block tridiagonal matrices the factorization A

= LU described below cannot be written in the form A = PLU (where P is a permutation matrix).

Because of pivoting, the structure of the left factor, L, includes permutations. Pivoting also complicates the right factor, U, which has three diagonals instead of two.

For LU factorization of a block tridiagonal matrix, let A be a block tridiagonal matrix where all blocks are square and of the same order n

b

:

The matrix is to be factored as A = LU.

First, consider 2-by-3 block submatrix which can be decomposed as

This decomposition can be obtained by applying the partial LU factorization described previously. Here P

T

1 a product of n

b

elementary permutations which can be represented as a 2n

b

-by-2n

b

matrix:

is

Introducing an N-by-N block matrix where all blocks are size n

b

-by-n

b

:

This allows the previous decomposition to be rewritten as:

26

Factoring general block tridiagonal matrices

2

Next, factor the 2-by-3 block matrix of the second and third rows of the matrix on the right-hand side of that equation: where P

T

2

is defined as:

The previous decomposition can be continued as:

27

2

Intel

®

Math Kernel Library Cookbook

Introducing this notation for the pivoting matrix simplifies the equations: where P

T

j

is 2n

b

by 2n

b

, and is located at the intersection of the j-th and (j+1)-st rows and columns. This allows the decomposition above to be written more compactly as

28

At step N - 2 the local factorization is:

After this step, multiplying by the pivoting matrix: gives:

Factoring general block tridiagonal matrices

2

29

2

Intel

®

Math Kernel Library Cookbook

At the last (N - 1)-st step the matrix is square and factorization is complete:

The last step differs from previous ones in the structure of the pivoting as well: all previous P

2, ..., N - 2 were products of n

b

permutations (they depend on n

b

applied to a square matrix of order 2n

b

( it depends on 2n

b

T

j

for j = 1,

integer parameters), whereas P

T

N-1

is

parameters). So in order to store all of the pivoting indices an integer array of length (N - 2) n

b

+ 2n

b

= Nn

b

is necessary.

Multiplying the previous decomposition from the left by Π

T

N-1

gives the final decomposition

30

Factoring general block tridiagonal matrices

2

Multiplying this decomposition by Π

1

Π

2

...Π

N-1 allows it to be written in LU factorization form:

31

2

Intel

®

Math Kernel Library Cookbook

While applying this formula it should be taken into account that Π

j

for j = 1, 2, …, N-2 are products of n

b

elementary transpositions applied to block rows with indices j and j+1, but Π

N-1 transpositions applied to the last two block rows N-1 and N.

is the product of 2n

b

32

Solving a system of linear equations with an LU-factored block tridiagonal coefficient matrix

3

Goal

Use Intel MKL LAPACK routines to craft a solution to a system of equations involving a block tridiagonal matrix, since LAPACK does not have routines that directly solve systems with block tridiagonal matrices.

Solution

Intel MKL LAPACK provides a wide range of subroutines for solving systems of linear equations with an LUfactored coefficient matrix. It covers dense matrices, band matrices and tridiagonal matrices. This recipe extends this set to block tridiagonal matrices subject to condition all the blocks are square and have the same order. A block triangular matrix A has the form

Solving a system AX=F with an LU-factored matrix A=LU and multiple right hand sides (RHS) consists of two stages (see

Factoring Block Tridiagonal Matrices

for LU factorization).

1.

Forward substitution, which consists of solving a system of equations LY=F with pivoting, where L is a lower triangular coefficient matrix. For factored block tridiagonal matrices, all blocks of Y except the last one can be found in a loop which consists of

a.

Applying pivoting permutations locally to the right hand side.

b.

Solving the local system of NB linear equations with a lower triangular coefficient matrix, where

NB is the order of the blocks.

c.

Updating the right hand side for the next step.

The last two block components are found outside of the loop because of the structure of the final pivoting (two block permutations need to be applied consecutively) and the structure of the coefficient matrix.

2.

Backward substitution, which consists of solving the system UX=Y. This step is simpler because it does not involve pivoting. The procedure is similar to the first step:

a.

Solving systems with triangular coefficient matrices.

b.

Updating right hand side blocks.

The difference from the previous step is that the coefficient matrix is upper, not lower, triangular, and the direction of the loop is reversed.

Source code: see the

BlockTDS_GE/source/dgeblttrs.f

file in the samples archive available at http:// software.intel.com/en-us/mkl_cookbook_samples.

33

3

Intel

®

Math Kernel Library Cookbook

Forward Substitution

! Forward substitution

! In the loop compute components Y_K stored in array F

DO K = 1, N-2

DO I = 1, NB

IF (IPIV(I,K) .NE. I) THEN

CALL DSWAP(NRHS, F((K-1)*NB+I,1), LDF, F((K-1)*NB+IPIV(I,K),1), LDF)

END IF

END DO

CALL DTRSM('L', 'L', 'N', 'U', NB, NRHS, 1D0, D(1,(K-1)*NB+1), NB, F((K-1)*NB+1,1), LDF)

CALL DGEMM('N', 'N', NB, NRHS, NB, -1D0, DL(1,(K-1)*NB+1), NB, F((K-1)*NB+1,1), LDF, 1D0,

+ F(K*NB+1,1), LDF)

END DO

! Apply two last pivots

DO I = 1, NB

IF (IPIV(I,N-1) .NE. I) THEN

CALL DSWAP(NRHS, F((N-2)*NB+I,1), LDF, F((N-2)*NB+IPIV(I,N-1),1), LDF)

END IF

END DO

DO I = 1, NB

IF(IPIV(I,N)-NB.NE.I)THEN

CALL DSWAP(NRHS, F((N-1)*NB+I,1), LDF, F((N-2)*NB+IPIV(I,N),1), LDF)

END IF

END DO

! Computing Y_N-1 and Y_N out of loop and store in array F

CALL DTRSM('L', 'L', 'N', 'U', NB, NRHS, 1D0, D(1,(N-2)*NB+1), NB, F((N-2)*NB+1,1), LDF)

CALL DGEMM('N', 'N', NB, NRHS, NB, -1D0, DL(1,(N-2)*NB +1), NB, F((N-2)*NB+1,1), LDF, 1D0,

+ F((N-1)*NB+1,1), LDF)

Backward Substitution

! Backward substitution

! Computing X_N out of loop and store in array F

CALL DTRSM('L', 'U', 'N', 'N', NB, NRHS, 1D0, D(1,(N-1)*NB+1), NB, F((N-1)*NB+1,1), LDF)

! Computing X_N-1 out of loop and store in array F

CALL DGEMM('N', 'N', NB, NRHS, NB, -1D0, DU1(1,(N-2)*NB +1), NB, F((N-1)*NB+1,1), LDF, 1D0,

+ F((N-2)*NB+1,1), LDF)

CALL DTRSM('L', 'U', 'N', 'N', NB, NRHS, 1D0, D(1,(N-2)*NB+1), NB, F((N-2)*NB+1,1), LDF)

! In the loop computing components X_K stored in array F

DO K = N-2, 1, -1

CALL DGEMM('N','N',NB, NRHS, NB, -1D0, DU1(1,(K-1)*NB +1), NB, F(K*NB+1,1), LDF, 1D0,

+ F((K-1)*NB+1,1), LDF)

CALL DGEMM('N','N',NB, NRHS, NB, -1D0, DU2(1,(K-1)*NB +1), NB, F((K+1)*NB+1,1), LDF, 1D0,

+ F((K-1)*NB+1,1), LDF)

CALL DTRSM('L', 'U', 'N', 'N', NB, NRHS, 1D0, D(1,(K-1)*NB+1), NB, F((K-1)*NB+1,1), LDF)

END DO

Routines Used

Task

Apply pivoting permutations

Routine

dswap

Description

Swap a vector with another vector

34

Solving a system of linear equations with an LU-factored block tridiagonal coefficient matrix

3

Task

Solve a system of linear equations with lower and upper triangular coefficient matrices

Update the right hand side blocks

Routine

dtrsm dgemm

Description

Solve a triangular matrix equation

Compute a matrix-matrix product with general matrices.

Discussion

NOTE

A general block tridiagonal matrix with blocks of size NB by NB can be treated as a band matrix with bandwidth 4*NB-1 and solved by calling Intel MKL LAPACK subroutines for factoring and solving band matrices (

?gbtrf

and

?gbtrs

). But using the approach described in this recipe requires fewer floating point computations because if the block matrix is stored as a band matrix, many zero elements would be treated as nonzeros in the band and would be processed during computations. The effect increases for bigger NB. Analogously, band matrices can also be treated as block tridiagonal matrices. But this storage scheme is also not very efficient because the blocks would contain many zeros treated as nonzeros. So band storage schemes and block tridiagonal storage schemes and their respective solvers should be considered as complementary to each other.

Given a system of linear equations:

The block tridiagonal coefficient matrix A is assumed to be factored as shown:

See Factoring Block Tridiagonal Matrices for a definition of the terms used.

The system is decomposed into two systems of linear equations:

35

3

Intel

®

Math Kernel Library Cookbook

The second equation can be expanded:

In order to find Y

1

, first the permutation Π blocks of the right hand side:

1

T

must be applied. This permutation only changes the first two

Applying the permutation locally gives

Now Y

1

can be found:

After finding Y

1

, similar computations can be repeated to find Y

2

, Y

3

, ..., and Y

N - 2

.

NOTE

The different structure of Π

N - 1

(see

Factoring Block Tridiagonal

) means that the same equations cannot be used to compute Y

N - 1

and Y

N

and that they must be computed outside of the loop.

The algorithm to use the equations to find Y is:

36

Solving a system of linear equations with an LU-factored block tridiagonal coefficient matrix

3

The UX = Y equations can be represented as

The algorithm for solving these equations is:

37

3

Intel

®

Math Kernel Library Cookbook

38

Factoring block tridiagonal symmetric positive definite matrices

4

Goal

Perform Cholesky factorization of a symmetric positive definite block tridiagonal matrix.

Solution

To perform Cholesky factorization of a symmetric positive definite block tridiagonal matrix, with N square blocks of size NB by NB:

1.

Perform Cholesky factorization of the first diagonal block.

2.

Repeat N - 1 times moving down along the diagonal:

a.

Compute the off-diagonal block of the triangular factor.

b.

Update the diagonal block with newly computed off-diagonal block.

c.

Perform Cholesky factorization of a diagonal block.

Source code: see the

BlockTDS_SPD/source/dpbltrf.f

file in the samples archive available at http:// software.intel.com/en-us/mkl_cookbook_samples.

Cholesky factorization of a symmetric positive definite block tridiagonal matrix

CALL DPOTRF('L', NB, D, LDD, INFO)

DO K=1,N-1

CALL DTRSM('R', 'L', 'T', 'N', NB, NB, 1D0, D(1,(K-1)*NB+1), LDD, B(1,(K-1)*NB+1), LDB)

CALL DSYRK('L', 'N', NB, NB, -1D0, B(1,(K-1)*NB+1), LDB, 1D0, D(1,K*NB+1), LDD)

CALL DPOTRF('L', NB, D(1,K*NB+1), LDD, INFO)

END DO

Routines Used

Task

Perform Cholesky factorization of diagonal blocks

Compute off-diagonal blocks of the triangular factor

Update the diagonal blocks

Routine

DPOTRF

DTRSM

DSYRK

Description

Computes the Cholesky factorization of a symmetric (Hermitian) positive-definite matrix.

Solves a triangular matrix equation.

Performs a symmetric rank-k update.

Discussion

A symmetric positive definite block tridiagonal matrix, with N diagonal blocks D

i

blocks B

i

of size NB by NB is factored as:

and N - 1 sub-diagonal

39

4

Intel

®

Math Kernel Library Cookbook

Multiplying the blocks of the matrices on the right gives:

Equating the elements of the original block tridiagonal matrix to the elements of the multiplied factors yields:

Solving for C

i

and L

i

L i

T

:

40

Factoring block tridiagonal symmetric positive definite matrices

4

Note that the right-hand sides of the equations for L

i

L i

T

is a Cholesky factorization. Therefore a routine chol()

for performing Cholesky factorization can be applied to this problem using code such as:

L

1

=chol(D

1 do i=1,N-1

)

C i

=B i

L end do i + 1

∙L i

-T

//trsm()

D i + 1

:=D i + 1

- C

=chol(D i + 1 i

∙C

) i

T

//syrk()

41

4

Intel

®

Math Kernel Library Cookbook

42

Solving a system of linear equations with a block tridiagonal symmetric positive definite coefficient matrix

5

Goal

Solve a system of linear equations with a Cholesky-factored symmetric positive definite block tridiagonal coefficient matrix.

Solution

Given a coefficient symmetric positive definite block tridiagonal matrix (with square blocks each of the same

NB-by-NB size) is LLT factored, the solving stage consists of:

1.

Solve the system of linear equations with a lower bidiagonal coefficient matrix which is composed of N by N blocks of size NB by NB and with diagonal blocks which are lower triangular matrices:

a.

Solve the N local systems of equations with lower triangular diagonal blocks of size NB by NB which are used as coefficient matrices and respective parts of the right hand side vectors.

b.

Update the local right hand sides.

2.

Solve the system of linear equations with an upper bidiagonal coefficient matrix which is composted of block size N by N blocks of size NB by NB and with diagonal blocks which are upper triangular matrices.

a.

Solve the local systems of equations.

b.

Update the local right hand sides.

Source code: see the

BlockTDS_SPD/source/dpbltrs.f

file in the samples archive available at http:// software.intel.com/en-us/mkl_cookbook_samples.

Cholesky factorization of a symmetric positive definite block tridiagonal matrix

CALL DTRSM('L', 'L', 'N', 'N', NB, NRHS, 1D0, D, LDD, F, LDF)

DO K = 2, N

CALL DGEMM('N', 'N', NB, NRHS, NB, -1D0, B(1,(K-2)*NB+1), LDB, F((K-2)*NB+1,1), LDF, 1D0,

F((K-1)*NB+1,1), LDF)

CALL DTRSM('L','L', 'N', 'N', NB, NRHS, 1D0, D(1,(K-1)*NB+1), LDD, F((K-1)*NB+1,1), LDF)

END DO

CALL DTRSM('L', 'L', 'T', 'N', NB, NRHS, 1D0, D(1,(N-1)*NB+1), LDD, F((N-1)*NB+1,1), LDF)

DO K = N-1, 1, -1

CALL DGEMM('T', 'N', NB, NRHS, NB, -1D0, B(1,(K-1)*NB+1), LDB, F(K*NB+1,1), LDF, 1D0,

F((K-1)*NB+1,1), LDF)

CALL DTRSM('L','L', 'T', 'N', NB, NRHS, 1D0, D(1,(K-1)*NB+1), LDD, F((K-1)*NB+1,1), LDF)

END DO

43

5

Intel

®

Math Kernel Library Cookbook

Routines Used

Task Routine

Solve a local system of linear equations

Update the local right hand sides

DTRSM

DGEMM

Discussion

Consider a system of linear equations described by:

Description

Solves a triangular matrix equation.

Computes a matrix-matrix product with general matrices.

Assume that matrix A is a symmetric positive definite block tridiagonal coefficient matrix with all blocks of

size NB by NB. A can be factored as described in Factorizing block tridiagonal symmetric positive definite matrices uisng BLAS and LAPACK routines

to give:

Then the algorithm to solve the system of equations is:

1.

Solve the system of linear equations with a lower bidiagonal coefficient matrix in which the diagonal blocks are lower triangular matrices:

Y

1

=L

1

-1

F do i=2,N

1

//trsm()

G i

Y i end do

=F

=L i i

- C

-1

G i i - 1

Y i - 1

//trsm()

//gemm()

2.

Solve the system of linear equations with an upper bidiagonal coefficient matrix in which the diagonal blocks are upper triangular matrices:

X

N

=L

N

-T

Y

N

//trsm() do i=N-1,1,-1

Z i

X i end do

=F

=L i i

-C

-T i

T

Z

X i i + 1

//gemm()

//trsm()

44

Computing principal angles between two subspaces

6

Goal

Get information about the relative position of two subspaces in an inner product space.

Solution

Assuming the subspaces are represented as spans of some vectors, the relative position of the subspaces can be obtained by calculating the set of

Principal Angles

between the subspaces. To calculate the angles:

1.

Build orthonormal bases in each subspace and determine the dimensions of the subspaces.

a.

Call an appropriate subroutine to perform a QR factorization with pivoting of matrices, the columns of which span the subspaces.

b.

Using the threshold, determine the dimensions of the subspaces.

c.

Form the orthonormal bases.

2.

Form a matrix of inner products of the basis vectors from the one subspace and the basis vectors of another subspace.

3.

Compute the Singular Value Decomposition of the matrix.

Source code: see the

ANGLES/definition/main.f

file in the samples archive available at http:// software.intel.com/en-us/mkl_cookbook_samples.

Building orthonormal bases and determining subspace dimensions

REAL*8 Y(LDY,K),TAU(N),WORK(3*N)

!

! Apply QR factorization with pivoting

CALL DGEQPF(N, K, Y, LDY, JPVT, TAU, WORK, INFO)

! Process info returned by DGEQPF

! Compute the rank

K1=0

DO WHILE((K1 .LT. K) .AND. (ABS(Y(K1+1,K1+1)) .GT. THRESH))

K1 = K1 + 1

END DO

!

! Form K1 orthonormal vectors via call DORGQR

CALL DORGQR(N, K1, K1, Y, LDY, TAU, WORK, LWORK, INFO)

! Process info returned by DORGQR

Forming matrix of inner products and computing SVD

REAL*8 U(N,KU),V(N,KV),W(KU,KV),VECL(KU,KMIN)

REAL*8 VECRT(KMIN,KV),S(KMIN),WORK(5*KU)

! Form W=U^t*V

CALL DGEMM(‘T’, ‘N’, KU, KV, N, 1D0, U, N, V, N, 0D0, W, KU1)

! SVD of W=U^t*V

45

6

Intel

®

Math Kernel Library Cookbook

CALL DGESVD(‘S’, ‘S’, KU, KV, W, KU, S, VECL, KU, VECRT, KMIN, WORK, LWORK, INFO)

! Process info returned by DGESVD

Discussion

Routines Used

Task Routine Description

QR factorization with pivoting of matrices

Form orthonormal bases dgeqpf dorgqr

Compute the QR factorization of a general m-by-n matrix with pivoting

Generates the real orthogonal matrix Q of the QR factorization formed by

?geqpf

or

?geqp3

Compute a matrix-matrix product with general matrices

Form a matrix of inner products of the basis vectors from the one subspace and the basis vectors of another subspace.

Compute the Singular Value

Decomposition of the matrix dgemm dgesvd

Compute the singular value decomposition of a general rectangular matrix

The first step is to build orthonormal bases in each subspace and determine the dimensions of the subspaces.

Let U be an N-by-k matrix (Nk) with columns representing vectors in some inner product linear space. To construct an orthonormal basis in this space you can use QR factorization of the matrix U, which with pivoting can be represented as UP = QR. If the dimension of the space is l (lk), in the absence of rounding errors occur this yields an orthogonal (unitary for complex-valued matrices) N-by-N matrix Q and upper triangular N-by-k matrix R:

The equation UP = QR means that all columns of U are linear combinations of the first l columns of Q. Due to pivoting, the diagonal elements r

j,j

of R are ordered in non-increasing order of absolute values. In fact, pivoting provides even stronger inequalities:

46

Computing principal angles between two subspaces

6 for jmk.

In actual computations with rounding errors, the elements of the lower right (k - l )-by-(k - l ) triangle of R are small but non-zero, so a threshold is used to determine the rank |r

l,l

| > threshold > |r

l+1,l+1

|.

Now you can determine the set of angles between subspaces.

Let and be two subspaces in the same N-dimensional Euclidean space with dim( )=k, dim( )=l, and kl. To find out the relative position of these subspaces you can use principal angles θ

1 which are defined as follows.

θ

2

≥...≥θ

k

≥ 0,

The first angle is defined as:

The vectors u

1 recursively:

and w

1

are called

principal vectors

. The other principal angles and vectors are defined

The principal vectors from the same subspace are pairwise orthogonal:

(u i

, u j

) = (w i

, w j

) = δ ij

To compute the principal angles you can use Singular Value Decomposition of matrices. Let U and W be matrices of sizes N-by-k and N-by-l respectively, with columns being orthonormal bases in and respectively. Compute the SVD of the k-by-l matrix U

T

W:

U

T

W = PΣQ

T

, P

T

P = I

k

, QQ

T

= I

l

It can be proven that the diagonal elements of Σ are the cosines of the principal angles:

The respective pairs of principal vectors are Up

i

and Wq

i

where p

i

and q

i

are the i-th columns of P and Q.

47

6

Intel

®

Math Kernel Library Cookbook

48

Computing principal angles between invariant subspaces of block triangular matrices

7

Goal

Get information about the relative position of two invariant subspaces of a block triangular matrix.

Solution

Assuming the subspaces are represented as spans of some vectors, the relative position of the subspaces

can be obtained via calculating the set of principal angles between the subspaces (see Computing principal angles between two subspaces ). Additionally, for invariant subspaces of block triangular matrices the

Sylvester matrix equation must be solved. The solver used depends on the matrix characteristics:

If both diagonal blocks of the triangular matrix are upper triangular, use the LAPACK

?trsyl

routine.

If both diagonal blocks of the triangular matrix are not large and not upper triangular, use LAPACK linear solvers.

If both diagonal blocks of the triangular matrix are large, upper triangular, and sparse, use the Intel MKL

PARDISO solver.

Source code: see these files in the samples archive available at http://software.intel.com/en-us/ mkl_cookbook_samples:

ANGLES/uep_subspace1/main.f

ANGLES/uep_subspace2/main.f

ANGLES/uep_subspace3/main.f

Solving Sylvester matrix equation using LAPACK ?trsyl

CALL DTRSYL('N', 'N', -1, K, N-K, AA, N, AA(K+1,K+1), N,

& AA(1,K+1), N, ALPHA, INFO)

IF(INFO.EQ.0) THEN

PRINT *,"DTRSYL completed, SCALE=",ALPHA

ELSE IF(INFO.EQ.1) THEN

PRINT *,"DTRSYL solved perturbed equations"

ELSE

PRINT *,"DTRSYL failed. INFO=",INFO

STOP

END IF

Solving Sylvester matrix equation using LAPACK linear solvers

REAL*8 AA(N,N), FF(K*(N-K)), AAA(K*(N-K),K*(N-K))

! Forming dense coefficient matrix for Sylvester equation

CALL SYLMAT(K, AA, N, N-K, AA(K+1,K+1), N, -1D0, 1D0, AAA, NK,

& INFO)

! Processing INFO returned by SYLMAT

! Forming the right hand side for the system of linear equations that

! correspond to Sylvester equation.

DO I = 1, K

DO J = 1, N-K

49

7

Intel

®

Math Kernel Library Cookbook

FF((J-1)*K+I) = AA(I,J+K)

END DO

END DO

! Solve the system of linear equations

CALL DGESV(NK, 1, AAA, NK, IPIV, FF, NK, INFO )

! Processing INFO returned by DGESV

Solving Sylvester matrix equation using Intel MKL PARDISO

REAL*8 AA(N,N), FF(K*(N-K)), VAL(K*(N-K)*(N-1))

INTEGER ROWINDEX(K*(N-K)+1), COLS(K*(N-K)*(N-1))

! Forming sparse coefficient matrix for Sylvester equation

CALL FSYLVOP(K, AA, N, N-K, AA(K+1,K+1), N, -1D0, 1D0, COLS,

& ROWINDEX, VAL, INFO)

! Processing INFO returned by FSYLVOP

! Form the right hand side of the Sylvester equation

DO I=1,K

DO J=1,N-K

FF((J-1)*K+I) = AA(I,J+K)

END DO

END DO

CALL PARDISOINIT (PT, 1, IPARM)

CALL PARDISO (PT, 1, 1, 11, 13, NK, VAL, ROWINDEX,

& COLS, PERM, 1, IPARM, 1, FF, X, IERR)

! Processing IERR returned by PARDISO

Discussion

Routines Used

Task Routine Description

Solve Sylvester matrix equation for matrix with upper triangular diagonal blocks

Solve Sylvester matrix equation for matrix which is small and not upper triangular dtrsyl dgesv

Solve Sylvester equation for real quasi-triangular or complex triangular matrices.

Computes the solution to the system of linear equations with a square matrix A and multiple right-hand sides.

Solve Sylvester matrix equation for matrix which is not small and not upper triangular pardiso

Calculates the solution of a set of sparse linear equations with single or multiple right-hand sides.

In order to determine the principal angles between invariant subspaces of the matrix, first let an N-by-N matrix be represented in a block triangular form:

Here diagonal blocks A and B are square matrices of order k and N-k, respectively. If I

k

matrix of order k, the equality

denotes the unit

50

Computing principal angles between invariant subspaces of block triangular matrices

7 means the span of first k vectors of the standard basis is invariant with respect to transformations on matrix

A.

Another invariant subspace can be found as a span of columns of the compound matrix

Here X is some rectangular k-by-(N - k) matrix which should be found. Compute the product

If X is a solution of the Sylvester equation XB - AX = F the result in the last equation is

This demonstrates invariance of the subspace spanned by columns of

QR factorization can be used to orthogonalize the basis in the second invariant subspace:

Here C is a k-by-(N-k) matrix and S is an (N-k)-by-(N-k) matrix. C and S satisfy the equation C

T

C + S

T

S =

I

N-k

, where R is an upper triangular square matrix of order N-k. Compute principal angles between these two invariant subspaces using the SVD of C:

Diagonal elements of Σ are the cosines of the principal angles.

Matrix of Sylvester equations

Consider the Sylvester equation αAX + βXB = F.

Here square matrices A and B have orders M and N, respectively, and α and β are scalars. F is a given M-by-

N matrix:

X is the M-by-N matrix to be found:

51

7

Intel

®

Math Kernel Library Cookbook

This matrix equation can be considered as a system components of the vector x and right hand side vector f:

x = (x

11

, x

21

, ..., x

M1

, x

12

, x

22

, ..., x

M2

, ..., x

1N

, x

2N

, ..., x

MN

)

T

f = (f

11

, f

21

, ..., f

M1

, f

12

, f

22

, ..., f

M2

, ..., f

1N

, f

2N

, ..., f

MN

)

T

of MN linear equations for MN unknown

Matrix of order MN can be represented as a sum of two matrices. One corresponds to multiplication of matrix X from the left by matrix A which can be represented in block form with blocks of size M by M. The matrix formsa block diagonal matrix with N blocks on the diagonal:

The other matrix in the sum corresponds to multiplication of matrix X from the right by matrix B. Using the same block form representation yields

Here I

M

represents the unit matrix of order M. Thus the coefficient matrix is

This matrix is sparse, with M + N - 1 nonzero elements in each MN-element row. Therefore the Intel MKL

PARDISO sparse solver can be used effectively (see the code

ANGLES/source/fsylvop.f

, which forms the coefficient matrix in CSR format for Intel MKL PARDISO). However, for comparatively small M and N the Intel

MKL LAPACK linear solvers are more efficient (see the code

ANGLES/source/sylmat.f

, which forms the coefficient matrix as a dense matrix for use with dgesv

).

52

Evaluating a Fourier integral

Goal

Use a fast Fourier transform (FFT) to numerically evaluate the continuous Fourier transform integral

8

Solution

Let’s assume that the real-valued function f(x) is zero outside the interval [a, b] and is sampled at N equidistant points x

n

= a + nT/N, where T = |b - a| and n = 0, 1, ... , N-1. An FFT will be used to evaluate the integral at points ξ

k

= k2π/T, where k = 0, 1, ... , N/2.

Using Intel

®

Math Kernel Library FFT Interface in C/C++

float *f; // input: f[n] = f(a + n*T/N), n=0...N-1 complex *F; // output: F[k] = F(2*k*PI/T), k=0...N/2

DFTI_DESCRIPTOR_HANDLE h = NULL;

DftiCreateDescriptor(&h,DFTI_SINGLE,DFTI_REAL,1,(MKL_LONG)N);

DftiSetValue(h,DFTI_CONJUGATE_EVEN_STORAGE,DFTI_COMPLEX_COMPLEX);

DftiSetValue(h,DFTI_PLACEMENT,DFTI_NOT_INPLACE);

DftiCommitDescriptor(h);

DftiComputeForward(h,f,F); for (int k = 0; k <= N/2; ++k)

{

F[k] *= (T/N)*complex( cos(2*PI*a*k/T), -sin(2*PI*a*k/T) );

}

Discussion

The evaluation follows this derivation, based on step-function approximation of the integral:

53

8

Intel

®

Math Kernel Library Cookbook

The sum in the last line is an FFT by definition. When the support of the function f extends symmetrically around zero, that is, [a, b] = [-T/2, T/2], the factor before the sum turns into (T/N)(-1)

k

.

When the function f is real-valued, F(ξ

k

) = conj(F(ξ

N-k

)). The first N/2 + 1 complex values of the real-tocomplex FFT occupy approximately the same memory as the real input, and they suffice to compute the whole result by conjugation. If the FFT computation is configured to perform a real-to-complex transform, it also takes approximately half as much time as a complex-to-complex FFT.

54

Using Fast Fourier Transforms for computer tomography image reconstruction

9

Goal

Reconstruct the original image from the Computer Tomography (CT) data using fast Fourier transform (FFT) functions.

Solution

Notation:

Specification of index ranges adopts the notation used in MATLAB*.

For example: k=-q : q means k=-q, -q+1, -q+2,…, q-1, q.

While f(x) means the value of the function f at point x, f[n] means the value of nth element of the discrete data set f.

Assumptions:

The density f(x, y) of a two-dimensional (2D) image vanishes outside the unit circle:

f = 0 when x

2

+ y

2

> 1.

The CT data consists of p projections of the image taken at angles θ

j

= jπ/p, where j = 0 : p - 1.

Each projection contains 2q + 1 density values g[j, l] = g(θ

j

along the line

, s

l

) approximating the integral of the image

(x, y) = (-t sinθ

j

+ s

l

cosθ

j

, t cosθ

j

+ s

l

sinθ

j

), where l = -q : q, s

l

= l /q, and t is the integration parameter.

The discrete image reconstruction algorithm consists of the following steps:

1.

Evaluate p one-dimensional (1D) Fourier transforms (for j = 0 : p - 1 and r = -q : q):

2.

Interpolate g obtaining f

2

1

[j, r] from radial grid (πr/q)(cosθ

(πξ/q, πη/q).

j

, sinθ

j

) onto Cartesian grid (ξ, η) = (-q : q, -q : q),

3.

Evaluate one inverse two-dimensional complex-to-complex FFT to obtain a complex-valued reconstruction f

1

of the image: where f(m/q, n/q) ≈f

1

[m, n] for m = -q : q and n = -q : q.

55

9

Intel

®

Math Kernel Library Cookbook

Computations in steps 1 and 3 call Intel MKL FFT interfaces. Computations in step 2 implement a simple version of interpolation tailored to the data layout used by Intel MKL FFT.

Reconstructing the original CT image in C/C++

// Declarations int Nq = 2*(q+1); // space for in-place r2c FFT void *gmem = mkl_malloc( sizeof(float)*p*Nq, 64 ); float *g = gmem; // g[j*Nq + ell+q] complex *g1 = gmem; // g1[j*Nq/2 + r+q]

// Initialize g with the CT data for (int j = 0; j < p; ++j) for (int ell = 0; ell < 2*q+1; ++ell) {

g[j*Nq + ell+q] = get_g(theta_j,s_ell);

}

// Step 1: Configure and compute 1D real-to-complex FFTs

DFTI_DESCRIPTOR_HANDLE h1 = NULL;

DftiCreateDescriptor(&h1,DFTI_SINGLE,DFTI_REAL,1,(MKL_LONG)2*q);

DftiSetValue(h1,DFTI_CONJUGATE_EVEN_STORAGE,DFTI_COMPLEX_COMPLEX);

DftiSetValue(h1,DFTI_NUMBER_OF_TRANSFORMS,(MKL_LONG)p);

DftiSetValue(h1,DFTI_INPUT_DISTANCE,(MKL_LONG)Nq);

DftiSetValue(h1,DFTI_OUTPUT_DISTANCE,(MKL_LONG)Nq/2);

DftiSetValue(h1,DFTI_FORWARD_SCALE,fscale);

DftiCommitDescriptor(h1);

DftiComputeForward(h1,g); // now gmem contains g1

// Step 2: Interpolate g1 to f2 - omitted here complex *f = mkl_malloc( sizeof(complex) * 2*q * 2*q, 64 );

// Step 3: Configure and compute 2D complex-to-complex FFT

DFTI_DESCRIPTOR_HANDLE h3 = NULL;

MKL_LONG sizes[2] = {2*q, 2*q};

DftiCreateDescriptor(&h3,DFTI_SINGLE,DFTI_COMPLEX,2,sizes);

DftiCommitDescriptor(h3);

DftiComputeBackward(h3,f); // now f is complex-valued reconstruction

Source code, image file, and makefiles: see the fft-ct

folder in the samples archive available at http:// software.intel.com/en-us/mkl_cookbook_samples.

Discussion

The code first configures the Intel MKL FFT descriptor for computing a batch of the one-dimensional Fourier transforms in a single call to the

DftiComputeForward

function and then computes the batch transform.

The distance for the multiple transforms is set in terms of elements of the corresponding domain (real on input and complex on output). The transforms are in-place by default.

To have a smaller memory footprint, the FFT is computed in place, that is, the result of the computation overwrites the input data. With an in-place real-to-complex FFT the input array reserves extra space because the result of the FFT takes slightly more memory than the input.

On input to step 1, array g

contains p x (2q+1) real-valued data elements g

j

, s

l

). The same memory on output of this step contains p x (q + 1) complex-valued output elements g

1 part of the result is not stored, and therefore array g1

j

, πr/q). The complex-conjugate

refers to only q + 1 values of r.

To interpolate from g

1

to f

2

, an additional array f

is allocated to store complex-valued data f

2

(ξ, η) and complex-valued output f

1

(x, y) of inverse FFT in step 3. The interpolation step does not call Intel MKL functions, but you can find its C++ implementation in the function step2_interpolation

of the source code for this recipe (file main.cpp

). The simplest implementation of interpolation is:

For every (ξ, η) inside the unit circle, find the closest (θ

j

, πr/q) and use the value of g

1

j

, πr/q) for f

2

.

56

Using Fast Fourier Transforms for computer tomography image reconstruction

9

For every (ξ, η) outside the unit circle, set f

2

to 0.

In the case of (ξ, η) corresponding to the interval -π < θ

j

< 0, use conjugate even property of the result of a real-to-complex transform: g

1

(θ, ω)=conj(g(-θ, -ω)).

Notice that the FFT in step 1 is applied to the data offset by half the representation interval, which causes the computed output be multiplied by e

ir/q)q

= (-1)

r

. Instead of correcting this in a separate pass, the interpolation takes the multiplier into account.

Similarly, the 2D FFT in step 3 produces an output that shifts the center of the image to the corner, and step

2 prevents this by phase shifting the input to step 3.

Step 3 computes the two-dimensional (2q) x (2q) complex-to-complex FFT on the interpolated data contained in array f

. This computation is followed by a conversion of the complex-valued image f

1

to a visual picture. You can find a complete C++ program that implements the CT image reconstruction in the source code for this recipe (file main.cpp

).

57

9

Intel

®

Math Kernel Library Cookbook

58

Noise filtering in financial market data streams

10

Goal

Detect how the price movements of some stocks influences the price movements of others in a large stock portfolio.

Solution

Split a correlation matrix representing the overall dependencies in data into two components, a signal matrix and a noise matrix. The signal matrix gives an accurate estimate of dependencies between stocks. The algorithm (

[Zhang12]

,

[Kargupta02]

) relies on an eigenstate-based approach that separates noise from useful information by considering the eigenvalues of the correlation matrix for the accumulated data.

Intel MKL Summary Statistics provides functions to calculate correlation matrix for streaming data. Intel MKL

LAPACK contains a set of computational routines to compute eigenvalues and eigenvectors for symmetric matrices of various properties and storage formats.

The online noise filtering algorithm is:

1.

Compute λ min

and λ max

, the boundaries of the interval of the noise eigenstates.

2.

Get a new block of data from the data stream.

3.

Update the correlation matrix using the latest data block.

4.

Compute the eigenvalues and eigenvectors that define the noise component, by searching the eigenvalues of the correlation matrix belonging to the interval [λ min

, λ max

].

5.

Compute the correlation matrix of the noise component by combining the eigenvalues and eigenvectors computed in Step 4.

6.

Compute the correlation matrix of the signal component by subtracting the noise component from the overall correlation matrix. If there is more data, go back to Step 2.

Source code: see the nf

folder in the samples archive available at http://software.intel.com/en-us/ mkl_cookbook_samples.

Initialization

Initialize a correlation analysis task and its parameters.

VSLSSTaskPtr task; double *x, *mean, *cor; double W[2];

MKL_INT x_storage, cor_storage;

...

scanf("%d", &m); // number of observations in block scanf("%d", &n); // number of stocks (task dimension)

...

/* Allocate memory */ nfAllocate(m, n, &x, &mean, &cor, ...);

/* Initialize Summary Statistics task structure */ nfInitSSTask(&m, &n, &task, x, &x_storage, mean, cor, &cor_storage, W);

59

10

Intel

®

Math Kernel Library Cookbook

...

/* Allocate memory */ void nfAllocate(MKL_INT m, MKL_INT n, double **x, double **mean, double **cor,

...)

{

*x = (double *)mkl_malloc(m*n*sizeof(double), ALIGN);

CheckMalloc(*x);

*mean = (double *)mkl_malloc(n*sizeof(double), ALIGN);

CheckMalloc(*mean);

*cor = (double *)mkl_malloc(n*n*sizeof(double), ALIGN);

CheckMalloc(*cor);

...

}

/* Initialize Summary Statistics task structure */ void nfInitSSTask(MKL_INT *m, MKL_INT *n, VSLSSTaskPtr *task, double *x,

MKL_INT *x_storage, double *mean, double *cor,

MKL_INT *cor_storage, double *W)

{

int status;

/* Create VSL Summary Statistics task */

*x_storage = VSL_SS_MATRIX_STORAGE_COLS;

status = vsldSSNewTask(task, n, m, x_storage, x, 0, 0);

CheckSSError(status);

/* Register array of weights in the task */

W[0] = 0.0;

W[1] = 0.0;

status = vsldSSEditTask(*task, VSL_SS_ED_ACCUM_WEIGHT, W);

CheckSSError(status);

/* Initialization of the task parameters using full storage

for correlation matrix computation */

*cor_storage = VSL_SS_MATRIX_STORAGE_FULL;

status = vsldSSEditCovCor(*task, mean, 0, 0, cor, cor_storage);

CheckSSError(status);

}

Computation

Perform noise filtering steps for each block of data.

/* Set threshold that define noise component */ sqrt_n_m = sqrt((double)n / (double)m); lambda_min = (1.0 - sqrt_n_m) * (1.0 - sqrt_n_m); lambda_max = (1.0 + sqrt_n_m) * (1.0 + sqrt_n_m);

...

/* Loop over data blocks */ for (i = 0; i < n_block; i++)

{

/* Read next portion of data */

nfReadDataBlock(m, n, x, fh);

/* Update "signal" and "noise" covariance estimates */

nfKernel(m, n, lambda_min, lambda_max, x, cor, cor_copy,

task, eval, evect, work, iwork, isuppz,

cov_signal, cov_noise);

60

Noise filtering in financial market data streams

10

}

...

void nfKernel(…)

{

...

/* Update correlation matrix estimate using FAST method */

errcode = vsldSSCompute(task, VSL_SS_COR, VSL_SS_METHOD_FAST);

CheckSSError(errcode);

...

/* Compute eigenvalues and eigenvectors of the correlation matrix */

dsyevr(&jobz, &range, &uplo, &n, cor_copy, &n, &lmin, &lmax,

&imin, &imax, &abstol, &n_noise, eval, evect, &n, isuppz,

work, &lwork, iwork, &liwork, &info);

/* Calculate "signal" and "noise" part of covariance matrix */

nfCalculateSignalNoiseCov(n, n_signal, n_noise,

eval, evect, cor, cov_signal, cov_noise);

}

...

static int nfCalculateSignalNoiseCov(int n, int n_signal, int n_noise,

double *eval, double *evect, double *cor, double *cov_signal,

double *cov_noise)

{

int i, j, nn;

/* SYRK parameters */

char uplo, trans;

double alpha, beta;

/* Calculate "noise" part of covariance matrix. */

for (j = 0; j < n_noise; j++) eval[j] = sqrt(eval[j]);

for (i = 0; i < n_noise; i++)

for (j = 0; j < n; j++)

evect[i*n + j] *= lambda[i];

uplo = 'U';

trans = 'N';

alpha = 1.0;

beta = 0.0;

nn = n;

if (n_noise > 0)

{

dsyrk(&uplo, &trans, &nn, &n_noise, &alpha, evect, &nn,

&beta, cov_noise, &nn);

}

else

{

for (i = 0; i < n*n; i++) cov_noise[i] = 0.0;

}

/* Calculate "signal" part of covariance matrix. */

if (n_signal > 0)

{

for (i = 0; i < n; i++)

61

10

Intel

®

Math Kernel Library Cookbook

for (j = 0; j <= i; j++)

cov_signal[i*n + j] = cor[i*n + j] - cov_noise[i*n + j];

}

else

{

for (i = 0; i < n*n; i++) cov_signal[i] = 0.0;

}

return 0;

}

Deinitialization

Delete the task and release associated resources.

errcode = vslSSDeleteTask(task);

CheckSSError(errcode);

MKL_Free_Buffers();

Routines Used

Task

Initialize a summary statistics task and define the objects for analysis: dataset, its sizes (number of variables and number of observations), and the storage format.

Specify the memory to hold the correlation matrix.

Specify the two-element array intended to hold accumulated weights of observations processed so far (necessary for correct computation of estimates for data streams)

Call the major compute driver by specifying computation type

VSL_SS_COR

, and computation method,

VSL_SS_METHOD_FAST

.

De-allocate resources associated with the task.

Compute eigenvalues and eigenvectors of the correlation matrix.

Routine

vsldSSNewTask vsldSSEditCovCor vsldSSEditTask vsldSSCompute vslSSDeleteTask dsyevr

Perform a symmetric rank-k update.

dsyrk

Description

Creates and initializes a new summary statistics task descriptor.

Modifies the pointers to covariance/ correlation/cross-product parameters.

Modifies address of an input/output parameter in the task descriptor.

Computes Summary Statistics estimates.

Destroys the task object and releases the memory.

Computes selected eigenvalues and, optionally, eigenvectors of a real symmetric matrix using the Relatively Robust

Representations.

Performs a symmetric rank-k update.

62

Noise filtering in financial market data streams

10

Discussion

Step 4 of the algorithm involves solving an eigenvalue problem for a symmetric matrix. The online noise filtration algorithm requires computation of eigenvalues that belong to the predefined interval [λ min which define noise in the data. The LAPACK driver routine

, λ max

],

?syevr

is the default routine for solving this type of problem. The

?syevr

interface allows the caller to specify a pair of values, in this case corresponding to

λ

min

and λ max

, as the lower and upper bounds of the interval to be searched for eigenvalues.

The eigenvectors found are returned as columns of an array containing an orthogonal matrix A, and eigenvalues are returned in an array containing elements of the diagonal matrix Diag. The correlation matrix for the noise component can be obtained by computing A*Diag*A

T

. However, instead of constructing a noise correlation matrix using two general matrix multiplications, this can be more efficiently computed with one diagonal matrix multiplication and one rank-n update operation:

For the rank-n update operation, Intel MKL provides the BLAS function

?syrk

.

63

10

Intel

®

Math Kernel Library Cookbook

64

Using the Monte Carlo method for simulating European options pricing

11

Goal

Compute nopt call and put European option prices based on nsamp independent samples.

Solution

Use Monte Carlo simulation to compute European option pricing. The computation for a pair of call and put options can be described as:

1.

Initialize.

2.

Compute option prices in parallel.

3.

Divide computation of call and put prices pair into blocks.

4.

Perform block computation.

5.

Deinitialize.

NOTE

On OS X*, this solution requires Intel MKL version 11.2 update 3 or higher.

65

11

Intel

®

Math Kernel Library Cookbook

Source code: see the mc

folder in the samples archive available at http://software.intel.com/en-us/ mkl_cookbook_samples.

Initialize in OpenMP section

Create an OpenMP parallel section and initialize the MT2203 random number generator.

#pragma omp parallel

{

...

VSLStreamStatePtr stream;

j = omp_get_thread_num();

/* Initialize RNG */

vslNewStream( &stream, VSL_BRNG_MT2203 + j, SEED );

...

}

This initialization model ensures independent random numbers streams in each thread.

66

Using the Monte Carlo method for simulating European options pricing

11

Compute option prices in parallel

Distribute options across available threads.

#pragma omp parallel

{

...

/* Price options */

#pragma omp for

for(i=0;i<nopt;i++)

{

MonteCarloEuroOptKernel( ... );

}

...

}

Divide computation of call and put prices pair into blocks

Divide generation of paths into blocks to maintain data locality for optimal performance.

const int nbuf = 1024;

nblocks = nsamp/nbuf;

...

/* Blocked computations */

for ( i = 0; i < nblocks; i++ )

{

/* Make sure that tail is correctly computed */

int block_size = (i != nblocks-1)?(nbuf):(nsamp - (nblocks-1)*nbuf);

...

}

Perform block computation

In the main computation, generate random numbers and perform reduction.

/* Blocked computations */

for ( i = 0; i < nblocks; i++ )

{

...

/* Generating block of random numbers */

vdRngLognormal( VSL_RNG_METHOD_LOGNORMAL_ICDF, stream,

block_size, rn, a, nu, 0.0, 1.0 );

/* Reduction */

#pragma vector aligned

#pragma simd

for ( j=0; j<block_size; j++ )

{

st = s0*rn[j];

vc = MAX( st-x, 0.0 );

vp = MAX( x-st, 0.0 );

sc += vc;

sp += vp;

}

}

*vcall = sc/nsamp * exp(-r*t);

*vput = sp/nsamp * exp(-r*t);

67

11

Intel

®

Math Kernel Library Cookbook

Deinitialize

Delete the RNG stream.

#pragma omp parallel

{

...

VSLStreamStatePtr stream;

...

/* Deinitialize RNG */

vslDeleteStream( &stream );

}

Routines Used

Task

Creates and initializes a random stream.

Generates lognormally distributed random numbers.

Deletes a random stream.

Routine

vslNewStream vdRngLogNormal vslDeleteStream

Discussion

Monte Carlo simulation is a widely used technique based on repeated random sampling to determine the properties of some model. The Monte Carlo simulation of European options pricing is a simple financial benchmark which can be used as a starting point for real-life Monte Carlo applications.

Let S

t

represent the stock price at a given moment t that follows the stochastic process described by:

dS t

= μS

t

dt + σS

t dW

t

, S

0 whereμ is the

drift

and σ is the

volatility

, which are assumed to be constants, W = (W

t

)

t≥ 0

is the

Wiener process

, dt is a time step, and S

0

(the stock price at t = 0) does not depend on X.

By definition the expected value is E(S

t

) = S

0

exp(rt), where r is the risk-neutral rate. The previous definition of S

t

gives E(S

t

) = S

0

exp((μ + σ

2

/2)t) , and combining them yields μ = r - σ

2

/2.

The value of a European option V(t, S

t

) defined for 0 ≤tT depends on the price S

t

of the underlying stock.

The option is issued at t = 0 and exercised at a time t = T is called the

maturity

. For European

call

and

put

options the value of the option at maturity V(T, S

T

) is defined as:

Call option: V(T, S

T

) = max(S

T

- X, 0)

Put option: V(T, S

T

) = max(X - S

T

, 0) where X is the

strike price

. The problem is to estimate V(0, S

0

).

The Monte Carlo approach to the solution of this problem is a simulation of n possible realizations of S

T

followed by averaging V(T, S

T

option V(0, S

0

), and then discounting the average by factor exp(-rt) to get present value of

). From the first equation S

T

follows the log-normal distribution: where ξ is a random variable of the standard normal distribution.

Intel MKL provides a set of basic random number generators (BRNGs) which support different models for parallel computations such as using different parameter sets, block-splitting, and leapfrogging.

This example illustrates MT2203 BRNG which supports 6024 independent parameter sets. In the stream initialization function, a set is selected by adding j to the BRNG identifier

VSL_BRNG_MT2203

: vslNewStream( &stream, VSL_BRNG_MT2203 + j, SEED );

68

Using the Monte Carlo method for simulating European options pricing

11

See Notes for Intel

®

MKL Vector Statistical Library for a list of the parallelization models supported by Intel

MKL VSL BRNG implementations.

The choice of size for computational blocks, which is 1024 in this example, depends on the amount of memory accessed within a block and is typically chosen so that all the memory accessed within a block fits in the target processor cache.

69

11

Intel

®

Math Kernel Library Cookbook

70

Using the Black-Scholes formula for European options pricing

12

Goal

Speed up Black-Scholes computation of European options pricing.

Solution

Use Intel MKL vector math functions to speed up computation.

The Black-Scholes model describes the market behavior as a system of stochastic differential equations

[ Black73

]. Call and put European options issued in this market are then priced according to the Black-

Scholes formulae: where

V

call

/V put

are the present values of the call/put options,S

0

is the present price of the stock , X is the strike price, r is the risk-neutral rate, σ is the volatility, T is the maturity and CDF is the cumulative distribution function of the standard normal distribution.

Alternatively, you can use the error function ERF which has a simple relationship with the cumulative normal distribution function:

Source code: see the black-scholes

folder in the samples archive available at http:// software.intel.com/en-us/mkl_cookbook_samples.

Straightforward implementation of Black-Scholes

This code implements the closed form solution for pricing call and put options.

void BlackScholesFormula( int nopt,

tfloat r, tfloat sig, const tfloat s0[], const tfloat x[],

const tfloat t[], tfloat vcall[], tfloat vput[] )

{

tfloat d1, d2;

int i;

for ( i=0; i<nopt; i++ )

{

71

12

Intel

®

Math Kernel Library Cookbook

d1 = ( LOG(s0[i]/x[i]) + (r + HALF*sig*sig)*t[i] ) /

( sig*SQRT(t[i]) );

d2 = ( LOG(s0[i]/x[i]) + (r - HALF*sig*sig)*t[i] ) /

( sig*SQRT(t[i]) );

vcall[i] = s0[i]*CDFNORM(d1) - EXP(-r*t[i])*x[i]*CDFNORM(d2);

vput[i] = EXP(-r*t[i])*x[i]*CDFNORM(-d2) - s0[i]*CDFNORM(-d1);

}

}

The number of options is specified as the

nopt

parameter. The tfloat

type is either float

or double depending on the precision you want to use. Similarly

LOG

,

EXP

,

SQRT

, and

CDFNORM

map to single or double precision versions of the respective math functions. The constant

HALF

is either 0.5f or 0.5 for single and double precision, respectively.

In addition to

nopt

, the input parameters for the algorithm are

s0

(present stock price),

r

(risk-neutral rate),

sig

(volatility),

t

(maturity), and

x

(strike price). The result is returned in

vcall

and

vput

(the present value of the call and put options, respectively).

It is assumed that

r

and

sig

are constant for all options being priced; the other parameters are arrays of floating-point values. The

vcall

and

vput

parameters are output arrays.

Discussion

Transcendental functions are at the core of the Black-Scholes formula benchmark. However, each option value depends on five parameters and as the math is computed faster, the memory effects become more pronounced. Thus the number of array parameters is an important factor that can change the computations from being compute-limited to memory bandwidth limited. Additionally, memory size constraints should be considered when pricing hundred millions of options.

The Intel C++ Compiler provides vectorization and parallelization controls that might help uncover the SIMD and multi-core potential of Intel Architecture with respect to the Black-Scholes formula. Optimized vectorized math functionality is available with the Short Vector Math Library (SVML) runtime library.

The Intel MKL Vector Mathematical Functions Library (VML) component provides highly tuned transcendental math functionality that can be used to further speed up formula computations.

There are several opportunities for optimization of the straightforward code: hoisting common subexpressions out of loops, replacing the

CDFNORM

function with the

ERF

function (which is usually faster), exploiting the relationship

ERF(-x) = -ERF(x)

, and replacing the division by

SQRT

with multiplication by the reciprocal square root

INVSQRT

(which is usually faster).

Optimized implementation of Black-Scholes

void BlackScholesFormula( int nopt,

tfloat r, tfloat sig, tfloat s0[], tfloat x[],

tfloat t[], tfloat vcall[], tfloat vput[] )

{

int i;

tfloat a, b, c, y, z, e;

tfloat d1, d2, w1, w2;

tfloat mr = -r;

tfloat sig_sig_two = sig * sig * TWO;

for ( i = 0; i < nopt; i++ )

{

a = LOG( s0[i] / x[i] );

b = t[i] * mr;

z = t[i] * sig_sig_two;

c = QUARTER * z;

e = EXP ( b );

72

Using the Black-Scholes formula for European options pricing

12

y = INVSQRT( z );

w1 = ( a - b + c ) * y;

w2 = ( a - b - c ) * y;

d1 = ERF( w1 );

d2 = ERF( w2 );

d1 = HALF + HALF*d1;

d2 = HALF + HALF*d2;

vcall[i] = s0[i]*d1 - x[i]*e*d2;

vput[i] = vcall[i] - s0[i] + x[i]*e;

}

}

In this code

INVSQRT(x)

is either

1.0/sqrt(x)

or

1.0f/sqrtf(x)

depending on precision;

TWO

and

QUARTER

are the floating-point constants 2 and 0.25 respectively.

Discussion

A few optimizations help generating effective code using the Intel

®

C/C++ compiler. See the User and

Reference Guide for the Intel

®

C++ Compiler for more details about the compiler pragmas and switches suggested in this section.

Apply

#pragma simd

to tell the compiler to vectorize the loop and

#pragma vector aligned

to notify the compiler that the arrays are aligned (you need to properly align vectors at the memory allocation stage) and that it is safe to rely on aligned load and store instructions. Efficient vectorization, such as that available with

SVML, can achieve a speedup of several times versus scalar code.

#pragma simd

#pragma vector aligned for ( i = 0; i < nopt; i++ )

With these changes the code can take advantage of all available CPU cores. The simplest way is to add the

autopar

switch to the compilation line so that the compiler attempts to parallelize the code automatically.

Another option is to use the standard OpenMP* pragma:

#pragma omp parallel for for ( i = 0; i < nopt; i++ )

Further performance improvements are possible if you can relax the accuracy of math functions using Intel C

++ Compiler options such as

-fp-model fast

,

-no-prec-div

,

-ftz

,

-fimf-precision

,

-fimf-maxerror

, and

-fimf-domain-exclusion

.

NOTE

Linux* OS specific syntax is given for the compiler switches. See the User and Reference Guide for the

Intel

®

C++ Compiler for more detail.

In massively parallel cases, the compute time of math functions can be low enough for memory bandwidth to emerge as the limiting factor for loop performance. This can impair the otherwise linear speedup from parallelism. In such cases memory bandwidth friendly non-temporal load/store instructions can help:

#pragma vector nontemporal for ( i = 0; i < nopt; i++ )

73

12

Intel

®

Math Kernel Library Cookbook

The Intel MKL VML component provides highly tuned transcendental math functions that can help further improving performance. However, using them requires refactoring of the code to accommodate for the vector nature of the VML APIs. In the following code example, non-trivial math functions are taken from VML, while remaining basic arithmetic is left to the compiler.

A temporary buffer is allocated on the stack of the function to hold intermediate results of vector math computations. It is important for the buffer to be aligned on the maximum applicable SIMD register size. The buffer size is chosen to be large enough for the VML functions to achieve their best performance

(compensating for vector function startup cost), yet small enough to maximize cache residence of the data.

You can experiment with buffer size; a suggested starting point is

NBUF=1024

.

Intel MKL VML implementation of Black-Scholes

void BlackScholesFormula_MKL( int nopt,

tfloat r, tfloat sig, tfloat * s0, tfloat * x,

tfloat * t, tfloat * vcall, tfloat * vput )

{

int i;

tfloat mr = -r;

tfloat sig_sig_two = sig * sig * TWO;

#pragma omp parallel for \

shared(s0, x, t, vcall, vput, mr, sig_sig_two, nopt) \

default(none)

for ( i = 0; i < nopt; i+= NBUF )

{

int j;

tfloat *a, *b, *c, *y, *z, *e;

tfloat *d1, *d2, *w1, *w2;

__declspec(align(ALIGN_FACTOR)) tfloat Buffer[NBUF*4];

// This computes vector length for the last iteration of the loop

// in case nopt is not exact multiple of NBUF

#define MY_MIN(x, y) ((x) < (y)) ? (x) : (y)

int nbuf = MY_MIN(NBUF, nopt - i);

a = Buffer + NBUF*0; w1 = a; d1 = w1;

c = Buffer + NBUF*1; w2 = c; d2 = w2;

b = Buffer + NBUF*2; e = b;

z = Buffer + NBUF*3; y = z;

// Must set VML accuracy in each thread

vmlSetMode( VML_ACC );

VDIV(nbuf, s0 + i, x + i, a);

VLOG(nbuf, a, a);

#pragma simd

for ( j = 0; j < nbuf; j++ )

{

b[j] = t[i + j] * mr;

a[j] = a[j] - b[j];

z[j] = t[i + j] * sig_sig_two;

c[j] = QUARTER * z[j];

}

VINVSQRT(nbuf, z, y);

VEXP(nbuf, b, e);

#pragma simd

for ( j = 0; j < nbuf; j++ )

74

Using the Black-Scholes formula for European options pricing

12

{

tfloat aj = a[j];

tfloat cj = c[j];

w1[j] = ( aj + cj ) * y[j];

w2[j] = ( aj - cj ) * y[j];

}

VERF(nbuf, w1, d1);

VERF(nbuf, w2, d2);

#pragma simd

for ( j = 0; j < nbuf; j++ )

{

d1[j] = HALF + HALF*d1[j];

d2[j] = HALF + HALF*d2[j];

vcall[i+j] = s0[i+j]*d1[j] - x[i+j]*e[j]*d2[j];

vput[i+j] = vcall[i+j] - s0[i+j] + x[i+j]*e[j];

}

}

}

For comparable precisions, Intel MKL VML can deliver 30-50% better performance versus Intel Compiler and

SVML-based solutions if the problem is compute bound (the data fits in the L2 cache). In this case the latency of cache read/write operations is masked by computations. Once the memory bandwidth emerges as a factor with the growth of the problem size, it becomes more important to optimize the memory usage, and the Intel VML solution based on intermediate buffers can lose its advantage to the no-buffering one-pass solution with SVML.

Routines Used

Task

Sets a new accuracy mode for VML functions according to the mode parameter and stores the previous VML mode to oldmode.

Performs element by element division of vector a by vector b

Computes natural logarithm of vector elements.

Computes an inverse square root of vector elements.

Computes an exponential of vector elements.

Computes the error function value of vector elements.

Routine

vmlSetMode vsdiv

/ vddiv vsln

/ vdln vsinvsqrt

/ vdinvsqrt vsexp

/ vdexp vserf

/ vderf

75

12

Intel

®

Math Kernel Library Cookbook

76

Multiple simple random sampling without replacement

13

Goal

Generate K>>1 simple random length-M samples without replacement from a population of size N (1 ≤MN).

Solution

For exact definitions and more details of the problem, see [

SRSWOR

].

Use the following implementation of a partial Fisher-Yates Shuffle algorithm [ KnuthV2

] and Intel MKL random number generators (RNG) to generate each sample:

Partial Fisher-Yates Shuffle algorithm

A2.1: (Initialization step) let PERMUT_BUF contain natural numbers 1, 2, ..., N

A2.2: for i from 1 to M do:

A2.3: generate random integer X uniform on {i,...,N}

A2.4: interchange PERMUT_BUF[i] and PERMUT_BUF[X]

A2.5: (Copy step) for i from 1 to M do: RESULTS_ARRAY[i]=PERMUT_BUF[i]

End.

The program that implements the algorithm conducts 11 969 664 experiments. Each experiment, which generates a sequence of M unique random natural numbers from 1 to N, is actually a partial length-M random shuffle of the whole population of N elements. Because the main loop of the algorithm works as a real lottery, each experiment is called "lottery M of N" in the program.

The program uses M=6 and N=49, stores result samples (sequences of length M) in a single array

RESULTS_ARRAY

, and uses all available parallel threads.

Source code: see the lottery6of49

folder in the samples archive available at http://software.intel.com/enus/mkl_cookbook_samples.

Parallelization

#pragma omp parallel

{

thr = omp_get_thread_num(); /* the thread index */

VSLStreamStatePtr stream;

/* RNG stream initialization in this thread */

vslNewStream( &stream, VSL_BRNG_MT2203+thr, seed );

... /* Generation of experiment samples (in thread number thr) */

vslDeleteStream( &stream );

}

The code exploits all CPUs with all available processor cores by using the OpenMP*

#pragma parallel directive. The array of experiment results

RESULTS_ARRAY

is broken down into

THREADS_NUM

portions, where

THREADS_NUM

is the number of available CPU threads, and each thread (parallel region) processes its own portion of the array.

77

13

Intel

®

Math Kernel Library Cookbook

Intel MKL basic random number generators with the

VSL_BRNG_MT2203

parameter easily support a parallel independent stream in each thread.

Generation of experiment samples

/* A2.1: Initialization step */

/* Let PERMUT_BUF contain natural numbers 1, 2, ..., N */

for( i=0; i<N; i++ ) PERMUT_BUF[i]=i+1; /* using the set {1,...,N} */

for( sample_num=0; sample_num<EXPERIM_NUM/THREADS_NUM; sample_num++ ){

/* Generate next lottery sample (steps

A2.2

, A2.3

, and A2.4

): */

Fisher_Yates_shuffle(...);

/* A2.5: Copy step

*/

for(i=0; i<M; i++)

RESULTS_ARRAY[thr*ONE_THR_PORTION_SIZE + sample_num*M + i] = PERMUT_BUF[i];

}

This code implements the partial Fisher-Yates Shuffle algorithm in each thread.

In the case of simulating many experiments, the

Initialization step

is only needed once because at the

beginning of each experiment, the order of natural numbers 1...N in the

PERMUT_BUF

array does not matter

(like in a real lottery).

Fisher_Yates_shuffle function

/* A2.2

: for i from 0 to M-1 do */

Fisher_Yates_shuffle (...)

{

for(i=0; i<M; i++) {

/* A2.3

: generate random natural number X from {i,...,N-1} */

j = Next_Uniform_Int(...);

/* A2.4

: interchange PERMUT_BUF[i] and PERMUT_BUF[X] */

tmp = PERMUT_BUF[i];

PERMUT_BUF[i] = PERMUT_BUF[j];

PERMUT_BUF[j] = tmp;

}

}

Each iteration of the loop

A2.2

works as a real lottery step: it extracts a random item

X

from the bin with remaining items

PERMUT_BUF[i], ..., PERMUT_BUF[N]

and puts the item

X

at the end of the results row

PERMUT_BUF[1],...,PERMUT_BUF[i]

. The algorithm is partial because it does not generate the full permutation of length N, but only a part of length M.

NOTE

Unlike the pseudocode that describes the algorithm, the program uses zero-based arrays.

Discussion

In step

A2.3

, the program calls the

Next_Uniform_Int

function to generate the next random integer

X

, uniform on {i, ..., N-1} (see the source code for details). To exploit the full power of vectorized RNGs from

Intel MKL, but to minimize vectorization overheads, the generator must generate a sufficiently large vector

D_UNIFORM01_BUF

of size

RNGBUFSIZE

that fits the L1 cache. Each thread uses its own buffer

D_UNIFORM01_BUF

and the index

D_UNIFORM01_IDX

pointing to after the last used random number from that buffer. In the first call to

Next_Uniform_Int

function (or in the case all random numbers from the buffer have been used), the full buffer of random numbers is regenerated again by calling the vdRngUniform function with the length

RNGBUFSIZE

and the index

D_UNIFORM01_IDX

set to zero (earlier in the program): vdRngUniform( ... RNGBUFSIZE, D_UNIFORM01_BUF ... );

78

Multiple simple random sampling without replacement

13

Because Intel MKL only provides generators of random values with the same distribution, but step

A2.3

requires random integers on different intervals, the buffer is filled with double-precision random numbers uniformly distributed on [0;1) and then, in the

Integer scaling step, these double-precision values are

converted to fit the needed integer intervals: number 0 distributed on {0,...,N-1} = 0 + {0,...,N-1} number 1 distributed on {1,...,N-1} = 1 + {0,...,N-2}

...

number M-1 distributed on {M-1,...,N-1} = M-1 + {0,...,N-M}

(then repeat previous M steps) number M distributed on: see (0) number M+1 distributed on: see (1)

...

number 2*M-1 distributed on: see (M-1)

(then again repeat previous M steps)

...

and so on

Integer scaling

/* Integer scaling step */ for(i=0;i<RNGBUFSIZE/M;i++)

for(k=0;k<M;k++)

I_RNG_BUF[i*M+k] =

k + (unsigned int)(D_UNIFORM01_BUF[i*M+k] * (double)(N-k));

Here

RNGBUFSIZE

is a multiple of M.

See [ SRSWOR

] for performance notes related to this code.

Routines Used

Task

Creates and initializes an RNG stream.

Generates double-precision numbers uniformly distributed over the interval [0;1).

Deletes an RNG stream.

Allocates memory buffers aligned on 64-byte boundaries for the results and population.

Frees memory allocated by mkl_malloc

.

Routine

vslNewStream vdRngUniform vslDeleteStream mkl_malloc mkl_free

79

13

Intel

®

Math Kernel Library Cookbook

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and

SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

80

Bibliography

Sparse Solvers

[Amos10]

[Smith86]

Ron Amos. Lecture 3: Solving Equations Using Fixed Point Iterations, University of Wisconsin CS412: Introduction to Numerical Analysis, 2010 (available at http://pages.cs.wisc.edu/~holzer/cs412/lecture03.pdf).

G.D. Smith. Numerical Solution of Partial Differential Equations: Finite Difference

Methods, Oxford Applied Mathematics & Computing Science Series, Oxford

University Press, USA, 1986.

Numerics

[Black73]

[Kargupta02]

[KnuthV2]

[SRSWOR]

[Zhang12]

Fischer Black and Myron S. Scholes. The Pricing of Options and Corporate

Liabilities. Journal of Political Economy, v81 issue 3, 637-654, 1973.

Hillol Kargupta, Krishnamoorthy Sivakumar, and Samiran Ghost. A Random

Matrix-Based Approach for Dependency Detection from Data Streams,

Proceedings of Principles of Data Mining and Knowledge Discovery, PKDD 2002:

250-262. Springer, New York, 2002.

D. Knuth. The Art of Computer Programming, Volume 2: Seminumerical

Algorithms, section 3.4.2: Random Sampling and Shuffling. Algorithm P, 3rd

Edition, Addison-Wesley Professional, 2014.

Using Intel

®

C++ Composer XE for Multiple Simple Random Sampling without

Replacement, available at https://software.intel.com/en-us/articles/using-intel-ccomposer-xe-for-multiple-simple-random-sampling-without-replacement.

Zhang Zhang, Andrey Nikolaev, and Victoriya Kardakova. Optimizing Correlation

Analysis of Financial Market Data Streams Using Intel ®

Math Kernel Library, Intel

Parallel Universe Magazine, 12: 42-48, 2012.

81

Intel

®

Math Kernel Library Cookbook

82

A

Index

aligning data, example45, 49

D

data alignment, example45, 49

data type

shorthand13

F

font conventions13

Fourier integral, evaluate with Intel MKL FFT, example53

I

image reconstruction, with Intel MKL FFT, example55

L

lottery M x N, implementation77

N

naming conventions13

S

sampling, multiple simple random without replacement,

implementation77

Index

83

Intel

®

Math Kernel Library Cookbook

84

advertisement

Was this manual useful for you? Yes No
Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Related manuals