DeepAligner

From HP-SEE Wiki

(Difference between revisions)
Jump to: navigation, search
(General Information)
(Infrastructure Usage)
Line 89: Line 89:
** Applied for access on: ''09.2010''
** Applied for access on: ''09.2010''
** Access granted on: ''10.2010''
** Access granted on: ''10.2010''
-
** Achieved scalability: ''59 cores''
+
** Achieved scalability: ''96 cores''
* Porting activities: ''The application has been successfully ported,the core workflow was successfully created, the GUI portlet was designed and created.''
* Porting activities: ''The application has been successfully ported,the core workflow was successfully created, the GUI portlet was designed and created.''
-
* Scalability studies: ''Tests on 32 and 59 cores''
+
* Scalability studies: ''Tests on 32, 59 and 96 cores''
== Running on Several HP-SEE Centres ==
== Running on Several HP-SEE Centres ==

Revision as of 13:39, 27 March 2012

Contents

General Information

  • Application's name: Deep sequencing for short fragment alignment
  • Application's acronym: DeepAligner
  • Virtual Research Community: Life Sciences
  • Scientific contact: Kozlovszky Miklos, Windisch Gergely; m.kozlovszky at sztaki.hu
  • Technical contact: Kozlovszky Miklos, Windisch Gergely; m.kozlovszky at sztaki.hu
  • Developers: Windisch Gergely, Biotech Group, Obuda University – John von Neumann Faculty of Informatics
  • Web site: http://ls-hpsee.nik.uni-obuda.hu:8080

Short Description

Mapping short fragment reads to open-access eukaryotic genomes is solvable by BLAST and BWA and other sequence alignment tools - BLAST is one of the most frequently used tool in bioinformatics and BWA is a relative new fast light-weighted tool that aligns short sequences. Local installations of these algorithms are typically not able to handle such problem size therefore the procedure runs slowly, while web based implementations cannot accept high number of queries. SEE-HPC infrastructure allows accessing massively parallel architectures and the sequence alignment code is distributed free for academia. Due to the response time and service reliability requirements grid can not be an option for the DeepAligner application.

Problems Solved

The recently used deep sequencing techniques present a new data processing challenge: mapping short fragment reads to open-access eukaryotic (animal: focusing on mouse and rat) genomes at the scale of several hundred thousands.

Scientific and Social Impact

The aim of the task is threefold, the first task is to port the BLAST/BWA algorithms to the massively parallel HP-SEE infrastructure create a BLAST/BWA service, which is capable to serve the short fragment sequence alignment demand of the regional bioinformatics communities, to do sequence analysis with high throughput short fragment sequence alignments against the eukaryotic genomes to search for regulatory mechanisms controlled by short fragments.

Collaborations

Ongoing collaborations so far: Hungarian Bioinformatics Association, Semmelweis University Planned collaboration with the MoSGrid consortium (D-GRID based project, Germany)

Beneficiaries

Serve the short fragment sequence alignment demand of the regional bioinformatics communities. People who are interested in using short fragment alignments will greatly benefit from the availability of this service. The service will be freely available to the LS community. We estimate that a number of 5-15 scientific groups worldwide will use our service.

Number of users

5

Development Plan

  • Concept: Done before the project started.
  • Start of alpha stage: Done before the project started.
  • Start of beta stage: M9
  • Start of testing stage: M13
  • Start of deployment stage: M16
  • Start of production stage: M18

Resource Requirements

  • Number of cores required for a single run: 128-256
  • Minimum RAM/core required: 4-8 Gb
  • Storage space during a single run: 2-5 GB
  • Long-term data storage: 1-2 TB
  • Total core hours required: 1 500 000

Technical Features and HP-SEE Implementation

  • Primary programming language: C, perl
  • Parallel programming paradigm: Master-slave, MPI, + Multiple serial jobs (data-splitting, parametric studies)
  • Main parallel code: WS-PGRADE/gUSE and C/C++
  • Pre/post processing code: Perl/BioPerl (in-house development)
  • Application tools and libraries: Perl/BioPerl (in-house development)

Usage Example

1. HP-SEE’S BIOINFORMATICS ESCIENCE GATEWAY

The Bioinformatics eScience Gateway based on gUSE and operates within the Life Science VO of the HP-SEE infrastructure. It provides unified GUI of different bioinformatics applications (such as BLAST, BWA, or gene mapper applications) and enables end-user access indirectly to some open European bioinformatics databases. gUSE is basically a virtualization environment providing large set of high-level DCI services by which interoperation among classical service and desktop grids, clouds and clusters, unique web services and user communities can be achieved in a scalable way. gUSE has a graphical user interface, which is called WS-PGRADE. All part of gUSE is implemented as a set of Web services. WS-PGRADE uses the client APIs of gUSE services to turn user requests into sequences of gUSE specific Web service calls. Our bioinformaticians need application specific portlets to make the usage of the portal more customized for their work. In order to support the development of such application specific UI we have used the Application Specific Module (ASM) API of the gUSE by which such customization can easily and quickly be done. Some other remaining features were included from WS-PGRADE. Our GUI is built up from JSR168 compliant portlets and can be accessed via normal Web browsers (shown in Fig. 1.).

Login screen of the HP-SEE Bioinformatics eScience Gateway

2. IMPLEMENTATION OF THE GENERIC BLAST WORKFLOW

Normal applications need to be firstly ported for use with gUSE/WS-PGRADE. Our used porting methodology includes two main steps: workflow development and user specific web interface development based on gUSE’s ASM (shown in Fig. 2.). gUSE is using a DAG (directed acyclic graph) based workflow concept. In a generic workflow, nodes represent jobs, which are basically batch programs to be executed on one of the DCI’s computing element. Ports represent input/output files the jobs receiving or producing. Arcs between ports represent file transfer operations. gUSE supports Parameter Study type high level parallelization. In the workflow special Generator ports can be used to generate the input files for all parallel jobs automatically while Collector jobs can run after all parallel execution to collect all parallel outputs. During the BLAST porting, we have exploited all the PS capabilities of gUSE.

Porting steps of the application

Parallel job submission into the DCI environment needs to have parameter assignment of the generated parameters. gUSE’s PS workflow components were used to create a DCI-aware parallel BLAST application and realize a complex DCI workflow as a proof of concept. Later on the web-based DCI user interface was created using the Application Specific Module (ASM) of gUSE. On this web GUI, end-users can configure the input parameter like the “e” value or the number of MPI tasks and they can submit the alignment into the DCI environment with arbitrary large parameter fields. During the development of the workflow structure, we have aimed to construct a workflow that will be able to handle the main properties of the parallel BLAST application. To exploit the mechanism of Parameter Study used by gUSE the workflow has developed as a Parameter Study workflow with usage of autogenerator port (second small box around left top box in Fig 5.) and collector job (right bottom box in Fig. 5). The preprocessor job generates a set of input files from some pre-adjusted parameter. Then the second job (middle box in Fig. 5) will be executed as many times as the input files specify. The last job of the workflow is a Collector which is used to collect several files and then process them as a single input. Collectors force delayed job execution until the last file of the input file set to be collected has arrived to the Collector job. The workflow engine computes the expected number of input files at run time. When all the expected inputs arrived to the Collector it starts to process all the incoming inputs files as a single input set. Finally output files will be generated, and will be stored on a Storage Element of the DCI shown as little box around the Collector in.

Internal architecture of the generic blast workflow

Due to the strict HPC security constraints, end users should posses valid certificate to utilize the HP-SEE Bioinformatics eScience Gateway. Users can utilize seamlessly the developed workflows on ARC based infrastructure (like the NIIF’s Hungarian supercomputing infrastructure) or on gLite/EMI based infrastructure (Service Grids like SEE-GRID-SCI, or SHIWA). After login, the users should create their own workflow based application instances, which are derived from pre-developed and well-tested workflows.

Infrastructure Usage

  • Home system: OE cluster/HU
    • Applied for access on: 08.2010
    • Access granted on: 08.2010
    • Achieved scalability: 4 nodes 8 cores
  • Accessed production systems:
NIIF's infrastructure/HU
    • Applied for access on: 09.2010
    • Access granted on: 10.2010
    • Achieved scalability: 96 cores
  • Porting activities: The application has been successfully ported,the core workflow was successfully created, the GUI portlet was designed and created.
  • Scalability studies: Tests on 32, 59 and 96 cores

Running on Several HP-SEE Centres

  • Benchmarking activities and results: At initial phase the application was benchmarkedand optimized on the OE's cluster. After successfull deployment on 32 cores benchmaring was initiated for 59 cores
  • Other issues: There were painful authentication problems and access issues with the supercomputing infrastructure's local storage during porting. Some input parameter assignment optimisation and further study for higher scaling is still required.

Achieved Results

The DeepAligner application was tested with parallel short DNA sequence searches successfully. So far publications are targeting mainly the porting of the application, publication of more scientific results is planned.

Publications

  • M. Kozlovszky, G. Windisch, Á. Balaskó;Short fragment sequence alignment on the HP-SEE infrastructure;MIPRO 2012, accepted
  • M. Kozlovszky, G. Windisch; Supported bioinformatics applications of the HP-SEE project’s infrastructure; Networkshop 2012, accepted

Foreseen Activities

Parameter assignements optimisation of the GUI, more scientific publications about short sequence alignment.

Personal tools