The HPC Challenge
High-performance computing technology generates and processes vast quantities of information, particularly in reservoir simulations and similar procedures used in the energy industry. The growth of high-performance computing — with today’s data sets ranging from tens of terabytes to petabytes — has created two significant challenges in storage and data management: Not only must organizations find simple and economical ways to store and manage their data — even though only a small portion of that data may be used at any point in time — but they must achieve high I/O performance as well.

To reap the biggest benefits from the power of HPC, you need to do more than process large amounts of data — you also need to store that data securely, and file it properly, so that it can be available on demand. Some data needs to be accessed constantly; other files can be kept in short-term storage, and still other information can be archived for the long run. The data must be moved in and out of the HPC cluster at high rates to prevent the compute cluster sitting idle. Maximum resource utilization is a crucial principle of the IT infrastructure’s architecture, particularly for high-performance computing. In a best-in-class solution, the data handling and compute functions are decoupled, however, they remain extremely complementary and must scale together for balance, architected for millions — or even billions — of files to be stored, managed, and secured.
Data storage can be complex, due to continuous changes in application and business requirements, as well as the evolving capabilities of the storage infrastructure. And as Chris Gahagan of the EMC Corporation puts it, “All data is not created equal. Data importance continually varies, according to when and where it is created and needed.” Charlie Garry, senior program director at Meta Group, concurs: “Data is growing at 125 percent a year, yet up to 80 percent of this data remains inactive in production systems where it cripples performance.” Storage complexity rises as different kinds of data are assigned different importance and different priorities. A number of open-source file management systems have appeared on the scene to help address these problems.
Data storage can be expensive. According to Gartner, a significant component of IT budgets aligned with data management and integration initiatives goes toward software licensing and maintenance. Open-source data storage and management helps greatly reduce this cost, by diminishing the costs of support and often completely eliminating the costs of licensing.
Data storage I/O is critical — to varying degrees, depending on the changing importance of data and the shifting needs of an organization. There are different levels of how important any given data’s accessibility should be — but the ability to access stored data is always crucial. For example, a financial organization has constant need to access key transactional data, a petrochemical company requires geological data on a project-by-project basis, and a governmental body may need to maintain archived information on millions of individuals for years at a time. Every kind of data requires accessibility, and bottlenecks in I/O can limit the availability of important data at crucial times. A number of factors play a part in the growing demands of storage I/O:
- Larger data sets
- More complex analysis
- More jobs run
- Simulations with more iterations
- Greater numbers of systems in a grid, powered by more processors
An increasing number of open-source communities are addressing the question of preventing I/O problems and promoting smoother, faster exchanges of stored data.
According to IDC, the HPC disk storage market is now well over $4 billion in size and expanding rapidly. The volume of data involved in interactions multiplies 24/7, on a global basis, and has catapulted data storage into a central position of importance for organizations. Large data sets must be processed quickly and backed up securely with equal speed. Building an efficient and effective information storage and archiving system is a vital adjunct to an HPC solution.
Without it, important records and crucial data can be difficult to process, regulatory compliance can be jeopardized, and risk might be unnecessarily high. To keep information stored and archived correctly, the IT infrastructure should be designed with dependable and scalable storage systems, careful backups, and an efficient file management system. And while the HPC user wants the best possible HPC to run every test as fast as possible — for the best time to result — there are significant budgetary constraints. Open data storage technology is one key to resolving these challenges.
The Open Source Answer
As increasingly powerful computing systems emerge, data is generated at an astronomical rate. A combination of grid technology and associated storage resources are the standard answer for these petabyte-scale data management challenges, which includes network-attached storage (NAS)/network file system (NFS) products, new software, and parallel storage. This solution requires handling and storing very large data sets accessed simultaneously by thousands of users, and both the growing volume of data and the growing need for accessibility demands scalability and manageability.
According to Anurag Shankar, manager of distributed storage services at Indiana University, the desired goal for HPC storage is to “[I]mplement a bottomless, deep data store with ubiquitous, native, and secure access to data, [an] excellent file sharing mechanism, interoperability with current and future IT services, and a long-term view of data storage and the needs of the masses, not just a select few.” This is the opposite of proprietary solutions, which have the tendency to lock users into a specific vendor and reduce interoperability and limit accessibility.
A Gartner report, The State of Open Source 2008, predicts that, “[B]y 2012, more than 90 percent of enterprises will use open source in direct or embedded forms.” The value of open source for any enterprise-class application lies in its cost effectiveness and flexibility, but the greatest need in the field of high-performance computing is for handling larger and larger data sets. How scalable is open-source technology in general? According to Deb Shinder’s recent analysis of open-source technology for small and large enterprises at Open Source Academy, “Properly chosen and deployed, open-source operating systems and applications can scale to meet almost any need in both the server and desktop space.” And Mark Taylor, president of promotion group the Open Source Consortium, states that, “Open source gives massive scalability at no transaction cost — for whatever you are doing.”
This is a significant factor for organizations undertaking massive projects, such as the Large Hadron Collider, for example. The purely academic research conducted with the LHC will generate enormous volumes of data which, even when recorded, “[W]ill take years of careful sifting and sorting, which will require massive amounts of computing power to extract the final scientific results," according to Frederick Luehring, a senior research scientist at Indiana University. Similarly large quantities of data are gathered and generated by commercial and academic projects, ranging from seismic exploration and pharmaceutical research to social networking and security information gathering. And on a more universal level, regulatory compliance for businesses may require the retrieval of many kinds of information — financial records, environmental or safety reports, email, legal memoranda, contracts, and more — that can be extremely complex to locate among vast numbers of similar files.
Open-source technology is readily acknowledged as being capable of answering these vital needs for organizations. According to a Gartner survey, 49.7% of open-source technology usage is used for mission-critical applications, in a growing share that compares favorably with a 59% share for proprietary software and 58.5% for internally developed solutions. Major technology players, such as IBM and Sun, have made large portions of their HPC IP open source, as well. In Sun’s case, an innovative new system — the Sun Fire X4500 server, the first data server — has blazed new trails by running the OpenSolaris OS and open Solaris ZFS. For more details, go to: sun.com/servers/x64/x4500.
It is generally understood that complete control and access to underlying infrastructure is essential for getting the most performance out of a compute cluster, which is vital to HPC — and the storage it requires. Open-source communities, including OpenSolaris storage and the Storage Networking Industry Association (SNIA), are crucial to sharing the information that helps maximize HPC storage performance. Another open-source community that offers pertinent information is the OpenSPARC community — to learn more, go to: http://opensolaris.org/os/community/hpcdev.
According to integration firms such as Network Resource Group (NRG), organizations can realize substantial savings by choosing open-source storage software such as Lustre, Amanda, OpenAFS, DBAN, Hypertable, SAMBA, and Sun Open Archive solutions. There are also open-source data recovery tools such as SystemRescueCD, dd, Partedmagic, and BackTrack. Each solution is designed for different data storage purposes, but all share the benefits of long-term protection through source code availability and the opportunity to sidestep proprietary products and their related costs.
Sharing and Storing Data With Open-Source Technology
Highly resilient, easy-to-manage, open-source, file-based data storage and management solutions capably address the growing need to digitize and preserve business images, records, consumer- and corporate-created digital content, e-science work, and other HPC data. They provide a cost-effective and efficient alternative to closed, proprietary offerings. An open-source solution can revolutionize the economics of storing, managing, and archiving data.






























Please see our Guidlines regarding public discolsure concerns.
blog comments powered by Disqus