MorphCheck MC

From HP-SEE Wiki

(Difference between revisions)
Jump to: navigation, search
(Created page with "== General Information == * Application's name: ''Medical high resolution tissue image classifier'' * Application's acronym: ''MorphcheckMC'' * Virtual Research Community: ''Lif...")
Line 57: Line 57:
* Pre/post processing code: ''BASH script (in-house development)''
* Pre/post processing code: ''BASH script (in-house development)''
* Application tools and libraries: ''BASH script (in-house development)''
* Application tools and libraries: ''BASH script (in-house development)''
 +
* External required software: ''LIBTIFF, FFTW''
== Usage Example ==
== Usage Example ==
-
1. HP-SEE’S BIOINFORMATICS ESCIENCE GATEWAY
+
1.1 HP-SEE’S BIOINFORMATICS ESCIENCE GATEWAY
The Bioinformatics eScience Gateway based on gUSE and operates within the Life Science VO of the HP-SEE infrastructure. It provides unified GUI of different bioinformatics applications (such as a gene mapper applications and sequence alignment applications) and enables end-user access indirectly to some open European bioinformatics databases. gUSE is basically a virtualization environment providing large set of high-level DCI services by which interoperation among classical service and desktop grids, clouds and clusters, unique web services and user communities can be achieved in a scalable way. gUSE has a graphical user interface, which is called WS-PGRADE. All part of gUSE is implemented as a set of Web services. WS-PGRADE uses the client APIs of gUSE services to turn user requests into sequences of gUSE specific Web service calls. Our bioinformaticians need application specific portlets to make the usage of the portal more customized for their work. In order to support the development of such application specific UI we have used the Application Specific Module (ASM) API of the gUSE by which such customization can easily and quickly be done. Some other remaining features were included from WS-PGRADE. Our GUI is built up from JSR168 compliant portlets and can be accessed via normal Web browsers (shown in Fig. 1.).
The Bioinformatics eScience Gateway based on gUSE and operates within the Life Science VO of the HP-SEE infrastructure. It provides unified GUI of different bioinformatics applications (such as a gene mapper applications and sequence alignment applications) and enables end-user access indirectly to some open European bioinformatics databases. gUSE is basically a virtualization environment providing large set of high-level DCI services by which interoperation among classical service and desktop grids, clouds and clusters, unique web services and user communities can be achieved in a scalable way. gUSE has a graphical user interface, which is called WS-PGRADE. All part of gUSE is implemented as a set of Web services. WS-PGRADE uses the client APIs of gUSE services to turn user requests into sequences of gUSE specific Web service calls. Our bioinformaticians need application specific portlets to make the usage of the portal more customized for their work. In order to support the development of such application specific UI we have used the Application Specific Module (ASM) API of the gUSE by which such customization can easily and quickly be done. Some other remaining features were included from WS-PGRADE. Our GUI is built up from JSR168 compliant portlets and can be accessed via normal Web browsers (shown in Fig. 1.).
[[File:HP-SEE-Bioinformatics_Portal.jpg|200px|thumb|left|Login screen of the HP-SEE Bioinformatics eScience Gateway]]
[[File:HP-SEE-Bioinformatics_Portal.jpg|200px|thumb|left|Login screen of the HP-SEE Bioinformatics eScience Gateway]]
-
== Infrastructure Usage ==
 
-
* Home system: ''OE cluster/HU''
+
1.2. MorphCheck
-
** Applied for access on: ''08.2010''
+
MorphCheck (MC) is a high resolution tissue image analyzer framework, which processes high resolution digital tissue images. MorphCheck software framework is capable to effectively recognize -with its extendable algorithm repository- large number of differentiated tissue structures (such as surface epithelium, gland structures, lamina muscularis, submucosa etc.), and measure their morphological and morphometrical properties. The software supports both some vendor specific tissue scanner image formats and regular image standards (such as Tagged Image File Format /tiff/, Joint Photographic Experts Group /jpg/). It supports various colorization schemes like: HE (Hematoxilin-Eozin), DAB (3,3'-Diaminobenzidin), multi-color FISH. It contains various texture-based algorithms, furthermore intensity and structure-based algorithms (such as K-means, region growth, etc.).  
-
** Access granted on: ''08.2010''
+
[[File:HP-SEE-Bioinformatics_Portal.jpg|200px|thumb|left|MorphCheck tissue image analyzer framework GUI]]
-
** Achieved scalability: ''8 cores''
+
-
* Accessed production systems:
+
-
# ''NIIF's infrastructure/HU''
+
-
#* Applied for access on: ''09.2010''
+
-
#* Access granted on: ''10.2010''
+
-
#* Achieved scalability: ''16 cores-->96 cores''
+
-
* Porting activities: ''The application has been successfully ported,the core workflow was successfully created, the GUI portlet was designed and created.''
+
-
* Scalability studies: ''Tests on 8 and 16 and 96 cores''
+
-
== Running on Several HP-SEE Centres ==
+
1.3. WND-CHARM
 +
WND-CHARM is an acclaimed open source image classifier application developed at National Institute of Aging (NIA, NIH) that supports generic image analysis methods. We are using recently the v1.30.227 as the baseline ap-plication for our MC-WND image classification service and for performance mea-surements as well. We have modified the original WND-CHARM software to include our MC tissue parameters. We have modified both the training and the classification part of the software.
-
* Benchmarking activities and results: ''At initial phase the application was benchmarkedand optimized on the OE's cluster. After successfull deployment on 8 cores benchmaring was initiated for 16 and 96 cores, further scaling is planned to higher number of cores. ''
+
2.1 The MC-WND tissue image classification service
-
* Other issues: ''There were painful (ARC) authentication problems/access issues and with the supercomputing infrastructure's local storage during porting. Further study for higher scaling is still required.''
+
We have extended our generic MorphCheck medical (tissue) image analysis framework to accurately measure our newly defined tissue image parameters. The objective numerical values of the pre-defined tissue parameters calculated by MorphCheck enables us to integrate and adapt a generic image classifier software solution, which can do effective tissue image classification automatically based on our parameter set. The two software solutions (MorphCheck and WND-CHARM) have been loosely coupled together to realize a single tissue image classification service (MC-WND). Data exchange between the two software solutions is realized with simple file exchange mechanisms. The MC-WND (Figure 3) tissue image classification service allows researchers to process and categorize medical high resolutions tissue images using HPC infrastructure in a fast and easy way.
-
== Scalability ==
+
[[File:HP-SEE-Bioinformatics_Portal.jpg|200px|thumb|left|MC-WND tissue image classification service schematic overview]]
-
Benchmark dataset
+
-
The blast database size was 5.1 GB, and the input sequence size was 29.13 kB. Each measurement was executed 10 times, the average of the 10 executions was taken as the final result
+
-
Hardware platforms
+
-
A number of hardware platforms have been used for the testing of the applications. The portlet we have developed is connected to all these different HPC infrastructures and it is the job of the middleware to choose the appropriate for each execution. For our benchmarks we specified the infrastructure the application was supposed to use.
+
-
The benchmarks were executed on five different HPC infrastructures:
+
-
*Debrecen
+
-
**Intel Xeon X5680 (Westmere EP) 6 core nodes, SGI Altix ICE8400EX
+
-
**1536 CPU cores
+
-
**6 TB memory
+
-
**0.5 PB storage
+
-
**Total capacity: ~18 TFlops
+
-
*Budapest (NIIF)
+
-
**fat-node cluster using CP4000BL blade
+
-
**AMD Opteron 6174 CPUs, 12 cores (Magny Cours)
+
-
**~700 cores
+
-
**Total Capacity ~5 TFlops
+
-
*Pecs
+
-
**SGI UltraViolet 1000 - SMP (ccNUMA)
+
-
**CPU: Intel Xeon X7542 (Nehalem EX) - 6 cores
+
-
**1152 cores
+
-
**6 TB memory
+
-
**0.5 PB memory
+
-
**Total capacity: ~10 TFlops
+
-
*Szeged
+
-
**fat-node cluster using CP4000BL blade
+
-
**AMD Opteron 6174 CPUs, 12 cores (Magny Cours)
+
-
**2112 cores
+
-
**5.6 TB memory
+
-
**0.25 PB storage
+
-
**Total Capacity ~14 TFlops
+
-
*Bulgaria
+
-
**Blue Gene/P with PowerPC CPUs
+
-
**2048 PowerPC 450 based compute nodes
+
-
**8192 cores
+
-
**4 TB memory
+
-
Software platforms
 
-
The applications were tested using multiple software stack
 
-
*Different MPI implementations
 
-
**openmpi_gcc-1.4.3
 
-
**openmpi_open64-1.6
 
-
**mpt-2.04
 
-
**openmpi-1.4.2
 
-
**openmpi-1.3.2
 
-
*Different compilers
 
-
**opencc
 
-
**icc
 
-
**openmpi-gcc
 
-
Each of the different hardware platforms have multiple MPI environments. We have tested our applications with multiple versions. There usually is one specific preferred at each of the HPC centers which we preferred using.
+
2.2 The MC-WND tissue image classification workflow
 +
We have defined the image classification tasks within a single workflow.  
 +
Inputs of the MC-WND workflow are:
 +
• Fragment of the high resolution tissue image:
 +
o size: (512x512), in TIFF  format,
 +
o resolution/zoom level is the same which was defined during (au-tomatic/manual) ROI definition/annotation,
 +
• MC calculated parameter results exported into a csv file.
 +
Output of the MC-WND workflow are:
 +
• WND-CHARM calculated image parameter results stored in a csv file,
 +
• classification process results, stored in a html file (contains all the calcu-lated statistical results).
-
*Execution times
+
2.3. MC-WND Training set
-
The following graphs show the results of the executions. The execution times varied a little depending on the hpc ceter used, but they were more or less stable so we only include the results from the Budapest server. The following graphs show the result of multiple executions of mpiBlast on the same database with the same input sequence on the same computer. The only difference being the number of CPU cores allocated to the MPI job . Figure 1 shows the execution times measured by mpiBlast. If executed on just one CPU it takes 3376 seconds for the job to finish (about 53 minutes). As we can see the applications scales well, the execution times drop when we add more and more CPUs.
+
The modified WND-CHARM application is using our tissue image specific parameter as classifier parameters both during the training phase and during the image classification phase. Our tissue image training set contains more than 90 annotated HE (Hematoxilin-Eozin) colon tissue image samples with the following main categories: healthy, malignant (adenoma and carcinoma). All tissue image annotations are done by pathologist experts at 2nd Department of Internal Medicine, Semmelweis University. The training phase should be re-launched each time the annotated training image set or the parameter set are extended. Luckily this is a rare event, because a single training phase lasts about 10 hours normally. In our recent implementation of the service it generates a report in html format, which contains all the calculated statistical results of the classification process (accuracy, prediction, interpolation). We are using the stdout to monitor the process and receive status information.  
-
The scalability is linear until 128 cores.
+
-
[[File:e1.png]]
+
2.4. Workflow implementation
-
[[File:e2.png]]
+
-
[[File:e3.png]]
+
[[File:HP-SEE-Bioinformatics_Portal.jpg|200px|thumb|left|MC-WND workflow in WS-PGrade/gUSE]]
-
Further optimization
+
As shown in Figure 5, the defined workflow consists of two consecutive jobs implemented in WS-PGRADE workflow language. The first job is a preprocessor, the second job utilizes WND-CHARM in a parametric manner. The second job contains the WND-CHARM execution and it is launched in parallel as many times the service receives tissue images from outside. WND-CHARM is installed and launched in the so called user space, which was a hard task to realize. We are using LibTIFF and FFTW as external software packages inside our service. We are collecting the results from all the WND-CHARM instances both from the stdout (as a file) and the generated html files. We have done performance evaluation to see how the WND-CHARM can run on HPC infrastructure.  
-
The first task when using mpiBlast is to split the blast database into multiple fragments. According to previous research, the number of database fragments have a direct impact on the performance of the application. Finding an optimal number was essential, so our database was split into different sizes. Figure 4 shows the measured execution times. The measurements were executed on 64 cores.
+
-
The execution times show that the application performs best when the number of DB segments are integer multiples of the number of CPU cores. The reason is straightforward: this is the only way an even data distribution can be achieved amongst the cores.
+
-
*Profiling
+
-
The two applications we have created share some of the code base which results in a similar behavior. Both applications consist of three jobs in a WS-PGrade workflow with job 1 being the preprocessor, job 2 doing the calculations and job 3 collecting the results and providing it to the user. The current implementation for the preprocessing is serial, we have investigated parallelizing but according to our profiling approximately 0.02 % of the total execution time is spent on Job 1 in Deep Aligner, so yields no real performance gain but can cause problems so we voted againts it. Job3 is 0.01% - most of the work is done in Job2. Job2 consists mainly of mpiBlast, the profiling shows the following results.
+
-
Execution time ratio of the jobs in the whole Disease Gene Mapper portlet.
+
== Infrastructure Usage ==
-
Job1: 0,09%
+
* Home system: ''OE cluster/HU''
-
Job2: 99,90%
+
** Applied for access on: ''08.2010''
-
Job3: 0,01%
+
** Access granted on: ''08.2010''
-
+
* Accessed production systems:
 +
# ''NIIF's infrastructure/HU''
 +
#* Applied for access on: ''09.2010''
 +
#* Access granted on: ''10.2010''
-
Execution time ratio inside Job2
+
The workflow has been executed on one of the HPC centers operated by NIIF called “Budapest”, which is an HP fat-node cluster using CP4000BL blades, consisting of 32 nodes with 24 Magny Cours CPU cores each (i.e. total number of CPU cores: 768). It has a mesh like topology with an Infini-band internal network. It has 1.96 TB memory and the total performance of the system is about 5,48 TFlops. Each measurement was executed 10 times, the average of the 10 executions was taken as the final result.
-
Init: 1,79%
+
-
BLAST: 97,18%
+
-
Write: 0,19%
+
-
Other: 0,84%
+
-
+
-
*Memory
+
* Porting activities: ''The application has been successfully ported,the core workflow was successfully created, the GUI portlet was designed and created.''
-
Memory usage while executing the application. The results come from the maxvmem parameter of qacct:
+
* Scalability studies: ''Tests on 32nodes with 24 cores (768 cores) ''
-
1: 1,257
+
-
2: 2,112
+
-
4: 3,345
+
-
8: 4,131
+
-
16: 5,434
+
-
32: 6,012
+
-
48: 4,153
+
-
64: 8,745
+
-
96: 9,897
+
-
128:12,465
+
-
As we can see the memory consumtion (measured by qacct) increases as the number of cores is increased.
+
== Scalability ==
-
 
+
A single run of the image classification process is about 10 minutes for a 512x512 tissue image size. Nowadays a normal high resolution tissue image size (whole size) is about 4096x4096. This is about 64 times larger than our 512x512 unit size. We have launched 991 tissue image units to the HPC infrastructure. The following graphs show the result of the multiple executions of WND-CHARM on the HPC infrastructure.
-
*Communication
+
-
mpiBlast uses a pre-segmented database and each node have their own  part where it searches for the input sequence so the communication overhead is very small.  
+
Fig. 5. Average MC-WND service execution time using HPC infrastructure (in min)
-
 
+
Average queuing overhead was ~20 minutes. Average execution time was 10 mi-nutes per image. The average total Wall Clock time was 3.02 hours for the whole image set, which means 11,503 sec was the average processing time for the whole image set.
-
*I/O
+
-
I/O as measured using the io parameter of qacct:
+
Fig. 6. MC-WND service execution time (HPC vs. single PC)
-
1: 0,001
+
Figure 6 shows the WND-CHARM wall clock (execution) time compared to a sin-gle CPU. We gain significant speedup with parallel execution of the WND-CHARM, even if the average queuing time and result collection took us some negligible time.
-
2: 0,001
+
2.1 Classification accuracy
-
4: 0,002
+
To create a usable classification procedure for colonic tissue images based on au-tomatic pre-filtering solution we have introduced a cut-off number which defines a tolerance level of the classification certainty. Below this certain numerical value the image marked automatically as malignant and forwarded to manual evaluation. In our large-scale tissue image classification tests -with more than 200 tissue images and with the cut-off value below 60%- we are able to distinguish between healthy and malignant colonic tissue images with 100% accuracy at healthy category (that means healthy category strictly contains only healthy images).
-
8: 0,003
+
-
16: 0,004
+
-
32: 0,011
+
-
48: 0,016
+
-
64: 0,019
+
-
96: 0,027
+
-
128:0,029
+
-
 
+
-
As we can see  on the previous table the I/O use increases as we increase the number of CPU cores in the job.
+
-
 
+
-
*Analysis
+
-
From our tests, we conclude that our application scales reasonably well up until about 128 cores. When the appropriate MPI  implementation is used on the HPC infrastructure the performance figures are quite similar – the scalability results are within the same region as expected. The number of database fragments play a significant role in the whole application and the best result can be obtained when that number is equal to or is an integer multiple of the number of cores. We have also noted that because of the high utilization of the supercomputing centers real life performance – wall clock time measured from the initialization of the job until the results are provided – could be better when using a smaller number of cores because small jobs tend to get scheduled easier and earlier.
+
== Achieved Results ==
== Achieved Results ==
-
In-silico Disease Gene Mapper was tested with some poligene diseases (e.g.:asthma) successfully. So far publications are targeting mainly the porting of the application, publication of more scientific results is planned.
+
We have successfully ported the modified WND-CHARM image classification software to work on distributed computing infrastructure (on HP-SEE supercomputing infrastructure) as a service. We have created the workflow structure for the MC-WND image classification service. The service is tested with HE stained tissue images and capable to separate healthy and malignant tissue images automatically with a high accuracy. The service and the internal workflow was developed at Obuda University and hosted on the HP-SEE Life Science/Bioinformatics eScience Gateway. The service can be used to do tissue image classification of the colonic region against our large tissue training databases in a short time using the HP-SEE supercomputing infrastructure at NIIF, Hungary.  
== Publications ==
== Publications ==
-
* G. Windisch, M. Kozlovszky, Á. Balaskó;Performance and scalability evaluation of short fragment sequence alignment applications;HPSEE User Forum 2012
+
M. Kozlovszky, K. Hegedűs, G. Windisch, L. Kovács, G. Pintér;Image Classification Optimization of High Resolution Tissue Images
-
* M. Kozlovszky, G. Windisch, Á. Balaskó;Short fragment sequence alignment on the HP-SEE infrastructure;MIPRO 2012
+
M. Kozlovszky, K. Hegedűs, S. Szénási, G. Kiszler, B. Wichmann, I. Bándi, L. Kovács, Z. Garaguly, V. Jónás, G. Kiss, G. Valcz, B. Molnár;Parameter assisted HE colored tissue image classification
-
* M. Kozlovszky, G. Windisch; Supported bioinformatics applications of the HP-SEE project’s infrastructure; Networkshop 2012
+
 +
== Foreseen Activities ==
== Foreseen Activities ==
More scientific publications about the porting of the data mining tool (show results of some comparative data analysis targeting polygene type diseases).
More scientific publications about the porting of the data mining tool (show results of some comparative data analysis targeting polygene type diseases).
 +
As a future work we are trying to include other image classifier solutions into our MC-WND service portfolio in a plug-in like manner. We are planning to open up our image classification service for the wider research community. So far the tissue image classification service can only be executed vie the LifeScience Portal as a standalone worflow or from the MorphCheck software. We are planning to enhance our portlet-based web user interface for the service to let pathologists manually upload and evaluate tissue images.

Revision as of 11:10, 21 August 2013

Contents

General Information

  • Application's name: Medical high resolution tissue image classifier
  • Application's acronym: MorphcheckMC
  • Virtual Research Community: Life Sciences
  • Scientific contact: Miklos Kozlovszky, Gergely Windisch, Gabor Kiss; kozlovszky.miklos at nik.uni-obuda.hu
  • Technical contact: Miklos Kozlovszky, Gergely Windisch, Gabor Kiss; kozlovszky.miklos at nik.uni-obuda.hu
  • Developers: Gergely Windisch, Biotech Group, Obuda University – John von Neumann Faculty of Informatics
  • Web site:

http://ls-hpsee.nik.uni-obuda.hu:8080/liferay-portal-6.0.5 http://ls-hpsee.nik.uni-obuda.hu

Short Description

Generic image classification methods are not performing well on tissue images. Such software solutions are producing high number of false negative and positive results, which prevents their clinical usage. We have created the MorphCeck high resolution tissue image processing framework, which enables us to collect morphological and morphometrical parameter values of the examined tissues. Size of such tissue images can easily reach the order of 100 MB - 1 GB. Therefore, the image processing speed and effectiveness is an important factor. Our main goal is to accurately evaluate high resolution H-E (hematoxilin-eozin) stained colon tissue sample images, and based on the parameters classify the images into differentiated sets according to the structure and the surface manifestation of the tissues. We have interfaced our MorphCheck tissue image measurement software framework with the WND-CHARM general purpose image classifier and tried to classify high resolution tissue images with this combined software solution. The classification is by default initiated with a large training set and three main classes (healthy, adenoma, carcinoma), however the new image classification process’ wall-clock time was intolerably high on single core PC. The processing time is depending on the size/resolution of the image and the size of the training set. Due to the tissue specific image parameters the classification effectiveness was promising. So we have started a development process to decrease the processing time and further increase the accuracy of the classification. We have developed a workflow-based parallel version of the MorphCheck+WND-CHARM classifier software. In collaboration with the MTA SZTAKI Application Porting Centre the WND-CHARM has been ported to HP-SEE projects's distributed high performance computing infrastructure.

Problems Solved

High resolution tissue image analysis and classification is a hot research topic nowdays. High-resolution image processing using large image databases/training sets is a highly data-, and compute-intensive challenge. Parallelization of such applications can decrease the computational time significantly.

Scientific and Social Impact

Researchers in the region will be able to differentiate automatically between malignant and healthy tissue images with high accuracy. The service can be used to do tissue image classification of the colonic region against our large tissue training databases in a short time using the HP-SEE supercomputing infrastructure (mainly resources from NIIF, Hungary).

Collaborations

Ongoing collaborations so far: 2nd Department of Internal Medicine/Semmelweis University, 3DHistech Ltd.

Beneficiaries

Researchers targeting colonic cancer, pathologists and doctors doing manual tissue image evaluation. The service will be freely available to the LS community. We estimate that a number of 3-5 scientific groups (5-15 researchers) world wide will use our service. The service will be accessible directly from our Morphcheck software solution as well.

Number of users

10

Development Plan

  • Concept: Done before the project started.
  • Start of alpha stage: Done before the project started.
  • Start of beta stage: 2013 Q1
  • Start of testing stage: 2013 Q1
  • Start of deployment stage: 2013 Q2 may
  • Start of production stage: 2013 Q2 june

Resource Requirements

  • Number of cores required for a single run: 128 – 256
  • Minimum RAM/core required: 4 - 8 GB
  • Storage space during a single run: 2-5 GB
  • Long-term data storage: none
  • Total core hours required: 1 000 000

Technical Features and HP-SEE Implementation

  • Primary programming language: C++/C#
  • Parallel programming paradigm: Multiple serial jobs (data-splitting, parametric studies)
  • Main parallel code: WS-PGRADE/gUSE and C++/C#
  • Pre/post processing code: BASH script (in-house development)
  • Application tools and libraries: BASH script (in-house development)
  • External required software: LIBTIFF, FFTW

Usage Example

1.1 HP-SEE’S BIOINFORMATICS ESCIENCE GATEWAY

The Bioinformatics eScience Gateway based on gUSE and operates within the Life Science VO of the HP-SEE infrastructure. It provides unified GUI of different bioinformatics applications (such as a gene mapper applications and sequence alignment applications) and enables end-user access indirectly to some open European bioinformatics databases. gUSE is basically a virtualization environment providing large set of high-level DCI services by which interoperation among classical service and desktop grids, clouds and clusters, unique web services and user communities can be achieved in a scalable way. gUSE has a graphical user interface, which is called WS-PGRADE. All part of gUSE is implemented as a set of Web services. WS-PGRADE uses the client APIs of gUSE services to turn user requests into sequences of gUSE specific Web service calls. Our bioinformaticians need application specific portlets to make the usage of the portal more customized for their work. In order to support the development of such application specific UI we have used the Application Specific Module (ASM) API of the gUSE by which such customization can easily and quickly be done. Some other remaining features were included from WS-PGRADE. Our GUI is built up from JSR168 compliant portlets and can be accessed via normal Web browsers (shown in Fig. 1.).

Login screen of the HP-SEE Bioinformatics eScience Gateway


1.2. MorphCheck MorphCheck (MC) is a high resolution tissue image analyzer framework, which processes high resolution digital tissue images. MorphCheck software framework is capable to effectively recognize -with its extendable algorithm repository- large number of differentiated tissue structures (such as surface epithelium, gland structures, lamina muscularis, submucosa etc.), and measure their morphological and morphometrical properties. The software supports both some vendor specific tissue scanner image formats and regular image standards (such as Tagged Image File Format /tiff/, Joint Photographic Experts Group /jpg/). It supports various colorization schemes like: HE (Hematoxilin-Eozin), DAB (3,3'-Diaminobenzidin), multi-color FISH. It contains various texture-based algorithms, furthermore intensity and structure-based algorithms (such as K-means, region growth, etc.).

MorphCheck tissue image analyzer framework GUI

1.3. WND-CHARM WND-CHARM is an acclaimed open source image classifier application developed at National Institute of Aging (NIA, NIH) that supports generic image analysis methods. We are using recently the v1.30.227 as the baseline ap-plication for our MC-WND image classification service and for performance mea-surements as well. We have modified the original WND-CHARM software to include our MC tissue parameters. We have modified both the training and the classification part of the software.

2.1 The MC-WND tissue image classification service We have extended our generic MorphCheck medical (tissue) image analysis framework to accurately measure our newly defined tissue image parameters. The objective numerical values of the pre-defined tissue parameters calculated by MorphCheck enables us to integrate and adapt a generic image classifier software solution, which can do effective tissue image classification automatically based on our parameter set. The two software solutions (MorphCheck and WND-CHARM) have been loosely coupled together to realize a single tissue image classification service (MC-WND). Data exchange between the two software solutions is realized with simple file exchange mechanisms. The MC-WND (Figure 3) tissue image classification service allows researchers to process and categorize medical high resolutions tissue images using HPC infrastructure in a fast and easy way.

MC-WND tissue image classification service schematic overview


2.2 The MC-WND tissue image classification workflow We have defined the image classification tasks within a single workflow. Inputs of the MC-WND workflow are: • Fragment of the high resolution tissue image: o size: (512x512), in TIFF format, o resolution/zoom level is the same which was defined during (au-tomatic/manual) ROI definition/annotation, • MC calculated parameter results exported into a csv file. Output of the MC-WND workflow are: • WND-CHARM calculated image parameter results stored in a csv file, • classification process results, stored in a html file (contains all the calcu-lated statistical results).

2.3. MC-WND Training set The modified WND-CHARM application is using our tissue image specific parameter as classifier parameters both during the training phase and during the image classification phase. Our tissue image training set contains more than 90 annotated HE (Hematoxilin-Eozin) colon tissue image samples with the following main categories: healthy, malignant (adenoma and carcinoma). All tissue image annotations are done by pathologist experts at 2nd Department of Internal Medicine, Semmelweis University. The training phase should be re-launched each time the annotated training image set or the parameter set are extended. Luckily this is a rare event, because a single training phase lasts about 10 hours normally. In our recent implementation of the service it generates a report in html format, which contains all the calculated statistical results of the classification process (accuracy, prediction, interpolation). We are using the stdout to monitor the process and receive status information.

2.4. Workflow implementation

MC-WND workflow in WS-PGrade/gUSE

As shown in Figure 5, the defined workflow consists of two consecutive jobs implemented in WS-PGRADE workflow language. The first job is a preprocessor, the second job utilizes WND-CHARM in a parametric manner. The second job contains the WND-CHARM execution and it is launched in parallel as many times the service receives tissue images from outside. WND-CHARM is installed and launched in the so called user space, which was a hard task to realize. We are using LibTIFF and FFTW as external software packages inside our service. We are collecting the results from all the WND-CHARM instances both from the stdout (as a file) and the generated html files. We have done performance evaluation to see how the WND-CHARM can run on HPC infrastructure.

Infrastructure Usage

  • Home system: OE cluster/HU
    • Applied for access on: 08.2010
    • Access granted on: 08.2010
  • Accessed production systems:
  1. NIIF's infrastructure/HU
    • Applied for access on: 09.2010
    • Access granted on: 10.2010

The workflow has been executed on one of the HPC centers operated by NIIF called “Budapest”, which is an HP fat-node cluster using CP4000BL blades, consisting of 32 nodes with 24 Magny Cours CPU cores each (i.e. total number of CPU cores: 768). It has a mesh like topology with an Infini-band internal network. It has 1.96 TB memory and the total performance of the system is about 5,48 TFlops. Each measurement was executed 10 times, the average of the 10 executions was taken as the final result.

  • Porting activities: The application has been successfully ported,the core workflow was successfully created, the GUI portlet was designed and created.
  • Scalability studies: Tests on 32nodes with 24 cores (768 cores)


Scalability

A single run of the image classification process is about 10 minutes for a 512x512 tissue image size. Nowadays a normal high resolution tissue image size (whole size) is about 4096x4096. This is about 64 times larger than our 512x512 unit size. We have launched 991 tissue image units to the HPC infrastructure. The following graphs show the result of the multiple executions of WND-CHARM on the HPC infrastructure.

Fig. 5. Average MC-WND service execution time using HPC infrastructure (in min) Average queuing overhead was ~20 minutes. Average execution time was 10 mi-nutes per image. The average total Wall Clock time was 3.02 hours for the whole image set, which means 11,503 sec was the average processing time for the whole image set.

Fig. 6. MC-WND service execution time (HPC vs. single PC) Figure 6 shows the WND-CHARM wall clock (execution) time compared to a sin-gle CPU. We gain significant speedup with parallel execution of the WND-CHARM, even if the average queuing time and result collection took us some negligible time. 2.1 Classification accuracy To create a usable classification procedure for colonic tissue images based on au-tomatic pre-filtering solution we have introduced a cut-off number which defines a tolerance level of the classification certainty. Below this certain numerical value the image marked automatically as malignant and forwarded to manual evaluation. In our large-scale tissue image classification tests -with more than 200 tissue images and with the cut-off value below 60%- we are able to distinguish between healthy and malignant colonic tissue images with 100% accuracy at healthy category (that means healthy category strictly contains only healthy images).

Achieved Results

We have successfully ported the modified WND-CHARM image classification software to work on distributed computing infrastructure (on HP-SEE supercomputing infrastructure) as a service. We have created the workflow structure for the MC-WND image classification service. The service is tested with HE stained tissue images and capable to separate healthy and malignant tissue images automatically with a high accuracy. The service and the internal workflow was developed at Obuda University and hosted on the HP-SEE Life Science/Bioinformatics eScience Gateway. The service can be used to do tissue image classification of the colonic region against our large tissue training databases in a short time using the HP-SEE supercomputing infrastructure at NIIF, Hungary.

Publications

M. Kozlovszky, K. Hegedűs, G. Windisch, L. Kovács, G. Pintér;Image Classification Optimization of High Resolution Tissue Images M. Kozlovszky, K. Hegedűs, S. Szénási, G. Kiszler, B. Wichmann, I. Bándi, L. Kovács, Z. Garaguly, V. Jónás, G. Kiss, G. Valcz, B. Molnár;Parameter assisted HE colored tissue image classification


Foreseen Activities

More scientific publications about the porting of the data mining tool (show results of some comparative data analysis targeting polygene type diseases). As a future work we are trying to include other image classifier solutions into our MC-WND service portfolio in a plug-in like manner. We are planning to open up our image classification service for the wider research community. So far the tissue image classification service can only be executed vie the LifeScience Portal as a standalone worflow or from the MorphCheck software. We are planning to enhance our portlet-based web user interface for the service to let pathologists manually upload and evaluate tissue images.

Personal tools