DiseaseGene

From HP-SEE Wiki

(Difference between revisions)
Jump to: navigation, search
(Infrastructure Usage)
(Running on Several HP-SEE Centres)
Line 81: Line 81:
== Running on Several HP-SEE Centres ==
== Running on Several HP-SEE Centres ==
-
* Benchmarking activities and results: ''At initial phase the application was benchmarkedand optimized on the OE's cluster. After successfull deployment on 8 cores benchmaring was initiated for 16 cores, further scaling is planned to higher number of cores. ''
+
* Benchmarking activities and results: ''At initial phase the application was benchmarkedand optimized on the OE's cluster. After successfull deployment on 8 cores benchmaring was initiated for 16 and 96 cores, further scaling is planned to higher number of cores. ''
* Other issues: ''There were painful (ARC) authentication problems/access issues and with the supercomputing infrastructure's local storage during porting. Further study for higher scaling is still required.''
* Other issues: ''There were painful (ARC) authentication problems/access issues and with the supercomputing infrastructure's local storage during porting. Further study for higher scaling is still required.''

Revision as of 13:43, 27 March 2012

Contents

General Information

  • Application's name: In-silico Disease Gene Mapper
  • Application's acronym: DiseaseGene
  • Virtual Research Community: Life Sciences
  • Scientific contact: Kozlovszky Miklos, Windisch Gergely; m.kozlovszky at sztaki.hu
  • Technical contact: Kozlovszky Miklos, Windisch Gergely; m.kozlovszky at sztaki.hu
  • Developers: Windisch Gergely, Biotech Group, Obuda University – John von Neumann Faculty of Informatics
  • Web site: http://ls-hpsee.nik.uni-obuda.hu:8080

Short Description

Complex data mining and data processing tool using large-scale external open-access databases. The aim of the task is to port a data mining tool to the SEE-HPC infrastructure, which can help researchers to do comparative analysis and target candidate genes for further research of polygene type diseases. The implemented solution is capable to target candidate genes for various diseases such as asthma, diabetes, epilepsy, hypertension or schizophrenia using external online open-access eukaryotic (animal: mouse, rat, B. rerio, etc.) databases. The application does an in-silico mapping between the genes coming from the different model animals and search for unexplored potential target genes. With small modification the application is useful to target human genes too.

Problems Solved

The implemented solution is capable to target candidate genes for various diseases such as asthma, diabetes, epilepsy, hypertension or schizophrenia using external online open-access eukaryotic (animal: mouse, rat, B. rerio, etc.) databases. The application does an in-silico mapping between the genes coming from the different model animals and search for unexplored potential target genes. With small modification the application is useful to target human genes too. Grid's reliability parameters and response time (1-5 min) is not suitable for such service.

Scientific and Social Impact

Researchers in the region will be able to target candidate genes for further research of polygene type diseases. Create a data mining a service to the SEE-HPC infrastructure, which can help researchers to do comparative analysis.

Collaborations

Ongoing collaborations so far: Hungarian Bioinformatics Association, Semmelweis University

Beneficiaries

People who are interested in using short fragment alignments will greatly benefit from the availability of this service. The service will be freely available to the LS community. We estimate that a number of 2-5 scientific groups (5-15 researchers) world wide will use our service.

Number of users

6

Development Plan

  • Concept: Done before the project started.
  • Start of alpha stage: Done before the project started.
  • Start of beta stage: M9
  • Start of testing stage: M13
  • Start of deployment stage: M16
  • Start of production stage: M19 (delayed for storage access issues)

Resource Requirements

  • Number of cores required for a single run: 128 – 256
  • Minimum RAM/core required: 4 - 8 GB
  • Storage space during a single run: 2-5 GB
  • Long-term data storage: 5-10TB
  • Total core hours required: 1 300 000

Technical Features and HP-SEE Implementation

  • Primary programming language: C/C++
  • Parallel programming paradigm: Clustered multiprocessing (ex. using MPI) + Multiple serial jobs (data-splitting, parametric studies)
  • Main parallel code: WS-PGRADE/gUSE and C/C++
  • Pre/post processing code: Perl/BioPerl (in-house development)
  • Application tools and libraries: Perl/BioPerl (in-house development)

Usage Example

1. HP-SEE’S BIOINFORMATICS ESCIENCE GATEWAY

The Bioinformatics eScience Gateway based on gUSE and operates within the Life Science VO of the HP-SEE infrastructure. It provides unified GUI of different bioinformatics applications (such as a gene mapper applications and sequence alignment applications) and enables end-user access indirectly to some open European bioinformatics databases. gUSE is basically a virtualization environment providing large set of high-level DCI services by which interoperation among classical service and desktop grids, clouds and clusters, unique web services and user communities can be achieved in a scalable way. gUSE has a graphical user interface, which is called WS-PGRADE. All part of gUSE is implemented as a set of Web services. WS-PGRADE uses the client APIs of gUSE services to turn user requests into sequences of gUSE specific Web service calls. Our bioinformaticians need application specific portlets to make the usage of the portal more customized for their work. In order to support the development of such application specific UI we have used the Application Specific Module (ASM) API of the gUSE by which such customization can easily and quickly be done. Some other remaining features were included from WS-PGRADE. Our GUI is built up from JSR168 compliant portlets and can be accessed via normal Web browsers (shown in Fig. 1.).

Login screen of the HP-SEE Bioinformatics eScience Gateway

Infrastructure Usage

  • Home system: OE cluster/HU
    • Applied for access on: 08.2010
    • Access granted on: 08.2010
    • Achieved scalability: 8 cores
  • Accessed production systems:
  1. NIIF's infrastructure/HU
    • Applied for access on: 09.2010
    • Access granted on: 10.2010
    • Achieved scalability: 16 cores-->96 cores
  • Porting activities: The application has been successfully ported,the core workflow was successfully created, the GUI portlet was designed and created.
  • Scalability studies: Tests on 8 and 16 and 96 cores

Running on Several HP-SEE Centres

  • Benchmarking activities and results: At initial phase the application was benchmarkedand optimized on the OE's cluster. After successfull deployment on 8 cores benchmaring was initiated for 16 and 96 cores, further scaling is planned to higher number of cores.
  • Other issues: There were painful (ARC) authentication problems/access issues and with the supercomputing infrastructure's local storage during porting. Further study for higher scaling is still required.

Achieved Results

In-silico Disease Gene Mapper was tested with some poligene diseases (e.g.:asthma) successfully. So far publications are targeting mainly the porting of the application, publication of more scientific results is planned.

Publications

  • M. Kozlovszky, G. Windisch; Supported bioinformatics applications of the HP-SEE project’s infrastructure; Networkshop 2012, accepted

Foreseen Activities

More scientific publications about the porting of the data mining tool (show results of some comparative data analysis targeting polygene type diseases).

Personal tools