arXiv:1108.5124v1 [astro-ph.IM] 25 Aug 2011 - Last modified

arXiv:1108.5124v1 [astro-ph.IM] 25 Aug 2011 - Last modified
Challenges for LSST scale data sets
arXiv:1108.5124v1 [astro-ph.IM] 25 Aug 2011
Michael J. Way
Abstract The Large Synoptic Survey Telescope (LSST) simulator being built
by Andy Connolly and collaborators is an impressive undertaking and should
make working with LSST in the beginning stages far more easy than it was
initially with the Sloan Digital Sky Survey (SDSS). However, I would like to
focus on an equally important problem that has not yet been discussed here,
but in the coming years the community will need to address – can we deal
with the flood of data from LSST and will we need to rethink the way we
work?
1 Changing the way we work: From 2MASS to SDSS
Perhaps the best way to start things is to compare two large area sky surveys
implementing the “standard way” of distributing data in their own time: The
1990s era Two Micron All Sky Survey (2MASS; Skrutskie et al., 2006) and
the 2000s era Sloan Digital Sky Survey (SDSS; York et al., 2000).
Initially if a researcher wanted to access the 2MASS1 survey data one
could obtain a 5 DVD set (double-sided) of the catalog data. The data were
bar-delimited ascii text which could be easily read by everything from legacy
scripting programs like awk to SQL databases like MySQL or Postgres. The
ascii source catalog was about 43GB in size if copied from the DVDs to local
disk. The full-fidelity Atlas Images (∼10 terabytes in size and not available
via DVD) were later accessible via the 2MASS Image Services website2 .
Michael J. Way
NASA Goddard Institute for Space Studies, 2880 Broadway, New York, NY, USA,
e-mail: michael.j.way@nasa.gov
1
2
http://www.ipac.caltech.edu/2mass
http://irsa.ipac.caltech.edu/applications/2MASS/IM
1
2
Michael J. Way
In essence, the average astronomer had to change almost nothing about
the way that they or their Ph.D. advisor had worked over the previous 30
years. For example, instead of ordering 9-track (1/2 inch = 12.7mm), exabyte
(8mm), or DDS/DAT (4mm) tapes from the observatory (or bringing them
along after an observing run) one simply ordered the 2MASS DVDs. This
was possible due to increases in computer cpu speed and memory capacity
combined with modest input/output (IO) improvements over the previous
3-4 decades.
All of this changed with the SDSS. It may have been possible to distribute
DVD copies of the SDSS in a similar way to that of the 2MASS, but the scale
had moved from gigabytes to terabytes. Having a few terabytes on a local
computer in the early 2000s was not common, so the SDSS took a different
route. Working with top-notch computer scientists such as Jim Gray of Microsoft they decided that much of the SDSS should be accessible via SQL
query. There was certainly some anxiety amongst much of the community
when they realized that their mode of obtaining data from the SDSS was
going to be markedly different than in the past. Hence, it took the community some time to learn this new way of working, and certainly some early
publications with SDSS data not published by the SDSS team were problematic because, for example, they did not realize that they could decide the
quality of the photometry at a fine level, unlike that of the 2MASS which
was relatively straightforward.
The SDSS is probably the most similar data set today compared with what
the LSST will look like and how we will interact with it. Currently most users
of SDSS use the casjobs3 interface to obtain their data. The back end is a
SQL database tied to a front end presented to the user as a web interface
where SQL queries are entered. The database comes with a Schema Browser4
that allows one to explore items of possible interest. There are also a host of
on-line tutorials that one can go through to understand how make the queries,
and many authors also publish their SDSS queries in the appendices of their
publications. However, not enough authors do the latter in my opinion, and
hence it is often impossible to replicate the data that people are using if the
original author does not publish or cannot recall their original query.
In the ten years since the creation of the SDSS disk data storage density
has continued its inexorable rise (see Sect. 2) and today one could in fact
host the entire SDSS database relatively easily and cheaply on a modestly
sized desktop computer (e.g. an off-the-shelf workstation with 4 × 2TB disks
would do the trick). Again, one could (in theory) dump the entire casjobs
catalog to a giant ascii file akin to that of the 2MASS and use awk or your
favorite fortran program on it. One would need a system that can use 64bit
addressing, but that is fairly standard today (2011). Of course the IO will
make things slow (∼ 4 hours to read a 1TB disk sequentially), but nonetheless
3
4
http://casjobs.sdsss.org
http://casjobs.sdss.org/dr7/en/help/browser/browser.asp
Challenges for LSST scale data sets
3
it is in theory quite possible. The question then arises, will one be able to
work with LSST in the same manner as the SDSS given the inexorable rise
of faster CPUs, Memory, and IO?
2 Changing the way we work: From SDSS to LSST
As we consider the move from SDSS sized data sets to LSST the questions
that people like us might ask at this stage are fairly straightforward:
1. Will one be able to have a copy of the LSST data-set on a local desktop?
This will allow researchers to continue their pre-SDSS era methods of data
interrogation. This is what we like to call the 2MASS mode.
2. Will one still utilize a casjobs type web-query interface to obtain LSST
data of interest and then use legacy home-grown tools to work on the
data? This is what we call the SDSS mode.
3. Must one completely change the way one works with LSST scale data sets?
Will “data locality” be required – will one have to do all of the operations
to obtain a project’s scientific goals on the database directly? This may
be called the LSST mode.
To attempt to answer these questions we have to look at several other
factors discussed in the following subsections.
2.1 LSST Database Size and possible architecture
We heard from Andy Connolly and Kirk Borne at this conference that the
LSST query database will be of order 10 petabytes (PB) in size, while there
will be around 60PB of images available after 10 years of operation. It turns
out that query scales almost linearly with the size of the database. Given
historical trends in CPU, memory, storage and IO this means one should be
able to derive catalogs and do joins on tables in a future LSST database as
we do today with SDSS casjobs. However, there are caveats related to IO
that will be discussed later.
While query scales linearly with database size, the same cannot be said
of the kinds of operations that scientists would prefer to do on the data. For
example, classfication, clustering, density estimation are all normally O(N2 )
or worse. However, earlier today Alex Gray showed us that his group has
managed to make a host of algorithms O(N) that are normally considered to
be O(N2 ).
Regardless, this points to some interesting issues. Assuming petabyte
database sizes, the needs to do operations that are O(N2 ), and the need
to look at a large fraction of the stored data (that will not fit in RAM) how
4
Michael J. Way
are researchers going to do these things on the LSST database of tomorrow?
Let’s touch on the possible need for “data locality”. Normally one should
only consider moving the data from the source of the data if one needs more
than 100,000 CPU cycles per byte of data (Bell et al., 2006). The kinds of
applications this brings to mind are Seti@HOME or cryptography. Thankfully most science applications are more information intensive with CPU to
IO ratios less than 10,000 to 1. This means that we may have two reasons to
consider the possibility that we will not actually download the data to our
local machine/data-center: The size of the database is too large (petabyte
scale) and we have CPU to IO ration of less than 100,000 to 1. We will address these issues in some detail in the next section, but for now let’s assume
we will need to do some calculations at the site of the database itself.
The LSST has teamed up with several industry partners to develop a new
database to host the LSST called SciDB5 . The current plan is to host this
database in several different geographic locations (obviously to avoid a single
point of failure and to handle the anticipated load), but they also currently
plan to have an R interface for “expert users” As I mentioned during my
talk, I think this is an excellent idea, but I hope the designers will consider
adding additional languages such as Python which currently has wrappers to
support a host of useful tools commonly used by the current generation of
younger astronomers6
2.2 Moving the data around – can I have a local copy
and make use of it?
Will one be able to download and store the LSST database to a desktop
computer in 10 years time? If one wishes to download 1 petabyte over a
dedicated 1 gigabit/second line (in common use today) it will take ∼100 days.
In 9 years let’s assume everyone will have 10 gigabit/second connections (the
growth in desktop network speed has not grown at the same accelerated rate
of storage or CPU) so that means it will only take about 10 days. That
doesn’t sound unreasonable. Now one has to ask, how much will it cost to
own 1 petabyte of storage? One can look at historical trends documented in
several places on the internet to get some idea7 . In Table 1 you can see what
disks costs were 10 years ago, today and by extrapolation in 10 years time.
Looking at Table 1 one comes to the conclusion that if one wants to keep
a copy of the LSST data locally it should not be a problem given the drop in
price over time. After all, who who would have imagined 15 years ago that
5
6
7
http://www.scidb.org
For example, numpy, scipy, Rpy (R interface), mlabwrap (MatLab), etc.
e.g. http://www.mkomo.com/cost-per-gigabyte
Challenges for LSST scale data sets
5
Table 1 Storage cost historical trendsa
Year
Cost/GB
Cost/TB
Cost/PB
2000
2010
2020
$19.00000
$00.06000
$00.00002
$19,000
$62
$0.2
$19 million
$62,000
$200
a
Extrapolated from http://www.mkomo.com/cost-per-gigabyte
they would be able to purchase a 1 terabyte drive for their desktop computer
for under $100?
Unfortunately things are not this simple. To illustrate my point I want to
recall some more of Amdahl’s rules of thumb for a balanced system.
1. Bandwidth (BW): One bit of sequential IO/second per instruction/second
2. Memory: α=1=MB/MIPS8 : one byte of memory per one instruction/sec
3. One IO operation per 50,000 instructions
Looking at Table 2 in the context of Amdahl’s ROT perhaps the biggest
problem with high performance computer centers today and into the near
future is that they are CPU rich, but IO poor. The cpus may spend a lot of
time sitting idle while waiting for IO to send them more to work on because
not everything can be stored in RAM. This problem is not going to go away,
but there are (thankfully) people aware of the issue who believe that it is
currently possible to keep power consumption low while increasing sequential
read IO throughput by an order of magnitude using what are called Amdahl
blades (Szalay et al., 2010). Note that power consumption is becoming an
issue for mid-level data centers found at Universities and some goverment research labs. Most of these don’t have Google’s electricity budget for powering
them and in fact many government data centers are even being shutdown to
save money9 .
Table 2 Conclusions from Amdahl’s rules of thumb for a balanced system today
Operations
per second
RAM
Disk I/O
bytes/s
Giga/109
Tera/1012
Peta/1015
GB
TB
PB
108
1011
1014
8
No. of Disks for that
BW at 100MB/s/disk
→
→
→
1
1000
1,000,000
Million Instructions Per Second
http://www.nytimes.com/2011/07/20/technology/us-to-close-800-computer-datacenters.html
9
6
Michael J. Way
Table 2 tells one a couple of other interesting things.10 First, for a Petascale balanced system 100TB/s of IO bandwidth (last row of column three =
1014 ) would be required. It will take approximately 1,000,000 disks to deliver
this bandwidth today assuming they are capable of 100MB/s/disk. Note that
the rate of disk IO growth has not been remarkable in the past 10 years (see
Szalay et al., 2010).
3 Conclusions
In the beginning of Section 2 I posed three questions and I would like to pose
some answers:
1. Will one be able to have a copy of the LSST data-set on a local desktop?
Yes, I think the average researcher will be able to have a copy of the data
on their local system assuming disk storage density continues its historical
trend (note that there are a number of technical arguments against this, as
there are for continuing Moore’s law into the future (Esmaeilzadeh et al.,
2011)11 . However, even if one has a copy of the LSST it is unlikely one
will be able to make much use of it using traditional 2MASS mode tools
given the issues with sequential IO that were outlined above.
2. Will one still utilize a casjobs type web-query interface to obtain LSST
data of interest and then use legacy home-grown tools to work on the data
(The SDSS mode.)? Yes, the LSST team seems interested in continuing the
use of a casjobs type interface with an SQL backend. Whether a researcher
will then be able to use their traditional home-grown tools will depend on
the data sizes they download as discussed above.
3. Must one completely change the way one works with LSST scale data
sets? I believe that many of the scientific goals will only be achievable by
running on the database locally as an “expert user”. This points to the
need to have a multitude of robust data/computational centers hosting the
LSST data. Today the best place for these (in the United States) would
probably be at the national level supercomputing centers such as PSC12 ,
NSCA13 or NAS14 to name a few in the USA.
Acknowledgements Thanks to Andy Connolly for taking the time to discus his
LSST simulator with me prior to the conference and for encouraging me in my belief that a commentary focused on computational challenges would be appropriate. I
10
This table is a modified version of one given in a talk by Alex Szalay that the
author became aware of recently.
11
http://www.nytimes.com/2011/08/01/science/01chips.html
12
Pittsburg Supercomputing Center
13 National Center for Supercomputer Applications in Illinois
14
NASA Advanced Supercomputing center at NASA/Ames in California
Challenges for LSST scale data sets
7
would also like to thank the Astrophysics Department at Uppsala University in Sweden for their gracious hospitality while part of this manuscript was being completed.
References
Bell, G., Gray, J., & Szalay, A. (2006) Petascale Computational Systems. In:
Computer, vol 39, no 1, pp. 110-112, doi:10.1109/MC.2006.29
Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K. &
Burger, D. (2011) Dark Silicon and the end of Multicore Scaling. In:
Proc. of the 38th International Symposium on Computer Architecture,
doi:10.1145/2000064.2000108
Skrutskie, M.F. et al. (2006) The Two Micron All Sky Survey (2MASS). In:
The Astronomical Journal, 131, 1163, doi:10.1086/498708
York, D.G., et al. (2000) The Sloan Digital Sky Survey: Technical Summary.
In: The Astronomical Journal, 120, 1579, doi:10.1086/301513
Szalay, A., Bell, G.C, Huang, H.H., Terzis, A. & White, A. (2009) Low-Power
Amdahl-Balanced Blades for Data Intensive Computing. In: ACM SIGOPS
Operating Systems Review archive, vol 44, issue 1, ACM New York, NY,
USA doi:10.1145/1740390.1740407
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising