Case Study
Intel® Xeon® Processor E5 Family, Intel® Cluster Studio XE
High-Performance Computing
Delivering High-Speed Supercomputer Services
Kyoto University builds a 1,202-socket cluster supercomputer to deliver advanced information services
to research institutes throughout Japan using the Intel® Xeon® processor E5 family
• Delivery of high-speed supercomputer services to researchers and research institutions
inside and outside academia
• Installation of cluster supercomputer with superior node and network performance
• Lower heat generation and improved power efficiency
• Cluster supercomputer using Intel® Xeon® processor E5 family
• Intel® Cluster Studio XE suite of tools for MPI developers
Highly Versatile Supercomputer for
Wide Variety of Research Applications
called the Collaborative Research
Laboratories to work on creating new IT
platforms and services.
The Academic Center for Computing and
Academic Center for Computing and
Media Studies, Kyoto University
Media Studies at Kyoto University (ACCMS)
At the Department of Computing Research,
undertakes research and development
to which Professor Nakashima belongs,
Yoshida-Honmachi, Sakyo-ku, Kyoto
aimed at advanced applications of IT
work on supercomputers extends beyond
platforms and media. It feeds the results of
research and development of hardware and
this research back into enhancements to the
software, and includes system operation
educational environment. As one of Japan’s
and user support by technical staff in the IT
Established April 1964
Research and development aimed at
national centers for shared access to IT
Infrastructure Group of the IT Section. The
advanced applications of IT platforms and
infrastructure, it also provides sophisticated
shared access service supplies computing
media and providing, operating,
computing services to researchers and other
services, not only to researchers at Kyoto
users at other universities and research
University, but also to academic researchers,
institutions throughout Japan. Director of
private-sector businesses, and other users
and administering IT infrastructure
within the university
ACCMS, Professor Hiroshi Nakashima, PhD,
described the center by saying, “Our two key
roles are to provide IT infrastructure across
Kyoto University and to conduct academic
research into the building of advanced IT
platforms for the future. In terms of joint
research, we have an important mission as
part of a network of eight computing centers
at major national universities.” ACCMS is
actively involved in joint research and other
collaborations with both private-sector
companies and research institutions, with
research and development organized into
five departments. Besides the existing
Department of Network Research,
Department of Computing Research,
Department of Educational Support, and
Department of Digital Content Research,
the center has also added a new department
Figure 1. Computed image from simulation of major
Extracting Maximum Performance
from Large Cluster Supercomputers
Intel Xeon Processor Family
“Adoption of the Intel Xeon
processor family provides users
throughout Japan. Applications include
essential requirement when selecting the
large scientific and technical computations,
processor was that it use the highly versatile
computational chemistry, structural analysis,
x86 64-bit architecture. This led us to the
statistical processing, and visualization.
Intel Xeon processor E5 family, which allows
Professor Nakashima commented that, “The
high-speed exchange of data between nodes
supercomputers made available by ACCMS
using a network I/O bus that supports PCI
can be used by all sorts of different research
Express 3.0.”
institutions in a variety of fields and
disciplines. This means, rather than being
PCI Express 3.0 is a high-speed bus
mission-oriented, our service must maintain
transmission standard able to achieve
a highly versatile approach. The majority
transfer speeds up to 8GT/s (giga transfers
of use comes from science and engineering
per second) per lane. At the time of writing,
research, with examples of large simulations
the Intel Xeon processor E5 family was the
on which the center has collaborated
only processor that supported PCI Express 3.0.
It also reduces power consumption
including simulations of the cycle of major
Since it allows faster network access than
and dramatically enhances IT
earthquake events (Figure 1), and particle
the PCI Express 2.1 standard that currently
simulations of plasma environments in space
is the mainstream (maximum transfer speed:
(Figure 2).”
5GT/s), the advantages of PCI Express 3.0
with a high-speed computing
environment, with network
performance that improves in
proportion with node performance.
service quality.”
Professor Hiroshi Nakashima, Ph.D
Director of Academic Center for
Computing and Media Studies,
Kyoto University
cannot be overestimated.
ACCMS installed its first supercomputer in
1985. While the early machines were vector
This system upgrade also included the
supercomputers, they have switched to
installation of the Intel Cluster Studio XE
more versatile scalar machines since 2004.
suite of tools for message-passing interface
Discussing their policies for supercomputer
(MPI) developers. Intel Cluster Studio XE
installation and operation, Professor
is a package of tools for HPC clusters,
Nakashima said, “To keep up with other
including C/C++ and Fortran compilers,
academic institutions around the world, it
performance analysis tools, an MPI library,
is important that we progress in line with
and MPI application performance analysis
the latest developments. In order to do this,
tools. Professor Nakashima explained the
we configure systems as Linux* clusters,
reasons for selecting Intel Cluster Studio
an architecture that is in widespread use
XE by saying, “We were impressed by the
internationally. Along with its versatility,
software’s high degree of affinity with Intel
other major advantages of this approach
processors and its compatibility with other
include performance, cost, and application
x86 processors. Benchmark testing of the
development efficiency.”
various tools also gave favorable results
that were at a level we found satisfactory.
Evaluation of High-Speed I/O Bus Using
PCI Express* 3.0 and Adoption of Intel®
Xeon® Processor E5 Family,
and Implementation of Compilers and
Other Developer Tools
The usability of the tools was also attractive,
as was the extensive range of software
included in the suite.”
To provide its university and other users
with an advanced computing environment,
ACCMS has regularly updated its
supercomputers every few years. In 2012,
with the HPC server installed in 2008 coming
up for replacement, the center undertook
evaluation work to select its latest
supercomputer system. After comparing
and testing machines from a number of
vendors, they configured a large cluster
system using an HPC server fitted with the
Professor Hiroshi Nakashima, Ph.D
Director of Academic Center for
Computing and Media Studies, Kyoto
Intel Xeon processor E5 family. Explaining
their reasons for choosing Intel processorbased CPUs, Professor Nakashima said, “An
Figure 2. Large-scale particle simulation of interaction
between ion engine and plasma
Intel Cluster Studio XE
8GB DDR3 -1600 x 8 = 64GB ; 102.4GB/s
Intel® CPU
Intel Cluster Studio XE is a suite of
Intel® CPU
development tools for MPI applications.
Combining a number of highly reliable
Intel Xeon processor E5-2670
8C x 2 = 16C
tools, including Intel’s cluster software,
advanced threading/memory consistency
detection, and performance profiling.
• No. of nodes: 601
• No. of processors (cores): 16 (2 × 8)
• Theoretical peak computational
performance: 242.5 TFlops
• Total memory capacity: 38 TB
The software delivers significant
improvements in the performance and
scalability of cluster applications.
Products Included in Intel Cluster
Studio XE
Figure 3. Configuration of Laurel Subsystem with 601 nodes
•16C x 601 = 9616C
Subsystem and High (+64 x NVIDIA M2090
=42.6TFlops) Subsystem with 1.5TB
per Node
memory capacity of 38 TB (Figure 3). The
Cinnamon subsystem has 16 nodes with
1.5 TB of memory and 32 cores per node.
Despite the small number of nodes, the large
The new supercomputer system entered full
amount of memory per node means it will be
operation . The system was made up of one
used for applications that demand a large
massively parallel processor (MPP) system
memory capacity. It has a theoretical peak
and two cluster systems, each comprising an
computational performance of 10.6 TFlops
InfiniBand* network and HPC server fitted
and total memory capacity of 24 TB (Figure 4).
with the Intel Xeon processor E5 family.
Improvements in Node Performance
and Power Efficiency Deliver HighSpeed Analysis at Low Cost
The peak computational performance of the
overall system was 553.9 TFlops.
The Laurel subsystem has a high degree of
compatibility with PC clusters, comprising
With 7.9 times the computational
601 nodes with 64 GB of memory and 16
performance, 6.1 times the memory capacity,
cores per node. It has a peak computational
and 5.7 times the overall physical capacity of
performance of 242.5 TFlops and total
its predecessor, the new system represents
• Intel® Composer XE
Includes C/C++ and Fortran compilers,
performance libraries (for numerical
calculation [Intel® MKL], graphics
processing [Intel® IPP], and a
multithreading library [Intel® TBB]).
• Intel® VTune™ Amplifier XE
An analysis tool for the rapid diagnosis
of performance bottlenecks. Use
of templates lets you retrieve the
information you need with a few mouse
clicks. The intuitive user interface keeps
operation simple.
• Intel® Inspector XE
A utility with advanced functions for
detecting memory and threading errors.
Supports the dynamic detection of
memory problems such as memory leaks
or corruption, and multithreading errors
such as data conflicts or deadlocks.
32GB DDR3 -1066 x 48 = 1.5TB ; 136.4GB/s
• Intel® MPI Library
Intel® CPU
An MPI library with the scalability to
Intel® CPU
handle more than 90,000 processes.
Enhances the execution of applications
on Intel® platform clusters.
Intel Xeon processor E5-4650L
8C x 4 = 32C
Intel® CPU
Intel® CPU
• Intel® Trace Analyzer/Collector
A performance analysis tool for MPI
• No. of nodes: 16
• No. of processors (cores):
32 (4 × 8)
• Theoretical peak computational
performance: 10.6 TFlops
• Total memory capacity: 24 TB
applications. Supports event-based
tracing of applications executing
in parallel. Collected trace data are
displayed graphically to simplify
the identification of performance
Figure 4. Configuration of Cinnamon Subsystem with 16 nodes, each with 1.5 TB of memory
a major step up in scale. The two clusters
The new supercomputer system also brings
have a peak computational performance
fitted with the Intel Xeon processor E5
benefits for research. The improvement in
of 400 TFlops. Combined with the existing
family also deliver significant improvements
underlying performance means users can
system, this will result in a supercomputer
in performance and power consumption.
obtain the results of even large and complex
with performance approaching the 1
The 601-node Laurel subsystem more
calculations quickly and cost-efficiently. In
PFlops range. Looking to the future, the
than doubles node performance while
an academic context, the faster speed will
center is currently at the stage of testing
reducing power consumption by more than
prove valuable because it allows calculations
the next generation of technology, with a
half. This corresponds to a roughly six-fold
to be executed for more parameters than
presumption that the processors used will be
improvement in power efficiency. Referring
would otherwise be possible in the limited
made by Intel. “I have been entirely satisfied
to another major success, Professor
time available. The new system also has
with the Intel Xeon processor E5 family and
Nakashima said, “Of particular significance
benefits for ACCMS in its role as a service
Intel Cluster Studio XE products used in our
is how the benefits of using PCI Express
provider. In particular, Professor Nakashima
new system,” said Professor Nakashima. “I
3.0 have seen network performance
notes that, “Under current operating
look forward to Intel’s ongoing technological
improve roughly in proportion with node
practices in which usage fees are calculated
innovation and its development of
based on the amount of power that users
fascinating products.”
consume, a major benefit is the near sixWhile the Cinnamon subsystem, with its
fold improvement in power efficiency,
For its part, through ongoing technological
large memory capacity, has only improved
which means that we can provide users
innovations in its development tools and
node performance by about 20 to 30
with roughly six times as much computing
the Intel Xeon processor, Intel intends to
percent, power consumption has been cut to
capacity within the allocated budget.”
contribute to further enhancements to the
one-tenth that of the system it is replacing.
IT infrastructure that ACCMS is seeking to
Also, the ability of the entire system to fit
Regarding their future plans, ACCMS has
into a single rack means it takes up only one-
already decided to install an additional
tenth as much space as the previous system.
new supercomputer fitted with the next
This cut installation costs significantly.
generation of Intel Xeon processors in
2014. The new subsystem is expected to
Find the solution that’s right for your organization. Contact your Intel representative,
visit Intel’s Business Success Stories for IT Managers (,
or explore the IT Center (
For more information on the Intel Xeon processor, visit
erformance tests and ratings contained within this document are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured
by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance
of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations. Intel
does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others
where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.
hen used on compatible microprocessors, Intel® compilers will not necessarily achieve the same level of optimization as achieved on Intel microprocessors. This includes optimization for the Intel®
Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Intel® Supplemental Streaming SIMD Extensions 3 (SSSE3) instruction sets, as well other optimization.
Intel assumes no responsibility for the provision, functions, or effects of optimization on microprocessors not made by Intel. The microprocessor-specific optimization performed by this product is
intended solely for Intel microprocessors. Certain optimization that is not specific to the Intel® microarchitecture is reserved for use with Intel microprocessors. For more information about the specific
instruction sets to which this disclaimer applies, please refer to the user reference guides for the respective products.
any proprietary rights, relating to use of information in this specification. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted herein.
Intel, the Intel logo, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries.
Microsoft and Windows are trademarks of Microsoft Corporation in the U.S. and other countries.
* Other names and brands may be claimed as the property of others.
2012 Intel Corporation. All rights reserved
327412 -001US
Download PDF