Automatic subarray detection in microarray images A. Mastrogianni1, G. Doukas1, E. Dermatas1 and A. Bezerianos2 1 Dept. of Electrical Engineering and Computer Technology, University of Patras, HELLAS 2 Dept. of Medical Physics, School of Medicine, University of Patras, HELLAS Abstract. In this work a novel algorithm for automatic subarray detection in microarray images, taking into account the supreme importance of this step for a subsequent accurate microarray image analysis, is described and evaluated. Initially, in the detected microarray area, a novel profiling projection-based method derives the subarray grid. During grid detection, spot spacing is estimated and used for exact subarray location. The accuracy and efficiency of the approach is validated in three different databases of real distorted microarray images giving 58.82%, 100% and 83.33% error-free detection of subarray position. Keywords: microarray image analysis, gridding, subarray detection. 1 Introduction DNA microarray technology has empowered the scientific community to understand the fundamental aspects, underlining the growth and development of life as well as to explore the genetic causes of anomalies occurring in the functioning of the human body, offering a powerful measurement tool of gene expression activity. In a typical microarray setting, thousands of cDNA clones are robotically spotted onto coated glass slides in a highly condensed array. The information extracted in a single microarray experiment is derived almost exclusively from the spot intensities of a digital image . Due to the nature of the acquisition process, microarray images contain noise, such as dust, fingerprints, small particles, distortions from the optical components and electronic noise. Furthermore, rotations, misalignment and local deformations of the ideally rectangular grid often occur and consequently, affect the accuracy of the further data analysis. The whole image is composed by a matrix of equally spaced blocks called subarrays. Each subarray consists of a certain number of rows and columns, not necessarily the same. As shown in Fig.1, a typical microarray image contains several equal-size subarrays. The spots in a subarray are arranged in a relatively uniform spacing with each other. They have a roughly circular shape, though some show significant deviations from this shape due to the experimental variation of the spotting procedure. In general, the shape and the size of the spots may fluctuate, significantly, across the array. Typical values of the spot-radius in a real microarray image are: 2, 5, 9 and 12 pixels, while the spacing along the rows varies from 17 to 22 pixels and the spacing along the columns takes values between 18 and 24 pixels . The ideal microarray image (Fig.1) has the following properties : 1. the size of the subarrays is identical, 2. the spacing between subarrays is regular, 3. the location of the spots is centered on the intersections of the lines of the subarray, 4. the size and shape of the spots is circular and it is the same for all the spots, 5. the location of the grids is fixed in images for a given type of slides, 6. there is no dust or contamination on the slide and finally, 7. the background brightness is minimal and uniform across the image. In typical microarray images, none of these properties is satisfied. The aim of the image pre-processing methods is to restore the properties of the ideal microarray image in distorted images. Horizontal spacing Spot Vertical spacing Subarray spacing Subarray/ Subgrid Fig.1. Illustration of an ideal microarray image with constant shape. Microarray image processing consists of five tasks that are carried out sequentially: gridding or addressing, segmentation, quantification step, normalization and the data mining step [4-6]. The first critical stage in the image analysis process is referred to the identification of the spot centers in a microarray image or usually in a subarray or subgrid image so as to facilitate the addressing procedure. In the last decade, many approaches for solving the gridding problem have been done in the field of the bioinformatics but rarely the pre-processing step of the subarray detection is taken into consideration [7-12]. In a considerable number of the proposed methods, referring to the gridding problem, it is arbitrarily assumed that the subarrays have been identified, either manually or automatically and the whole approach is evaluated in a single subarray image specified by the researcher. However, there has been a remarkable effort by few researchers [13-16] to overcome this assumption and to give a solution to the problem of the subarray detection. The disadvantage of those approaches lies in the fact that the methods have not been tested in microarray images with high level of noise. In the next section, a detailed outline of the subarray detection problem is presented. The novel algorithm is applied in real microarray images whether the level of the noise is high or low. In section 3, the experimental database and a number of representative examples are shown. Finally, a short discussion on the results is given. 2 Subarray detection The proposed method is subdivided into five main steps, as shown in Fig.2, requiring only four explicitly defined parameters, i.e. the number of microarray and subarray grid-rows and grid-columns. In Fig.3, a set of microarray image parameters and their definition in the rest of this document is presented. The subarray spot spacing parameter, although is preferred to be known a priori, it can be easily estimated and evaluated during step (d), as shown in Fig.2. An early estimation of the subarray spot spacing can be done by dividing the width of the image in pixels with the expected number of spot columns, i.e. the product of the number of grid columns and subarray grid columns (Ncol* ncol). Another parameter that proves to be valuable in detecting the microarray grid is the spacing of subarrays in both x and y axis as shown in Fig.1. Generally, the space between subarrays is not constant but a very small variance in a typical microarray image is met. Depending on the experimental image database, such a parameter may not be fixed, but there should be at least a bounded estimation either in absolute pixel value, or as a function of spot spacing. If this parameter is not known a priori it can be estimated, when needed, as described in step (e). (a) Noise reduction (b) Detection of microarray area (c) Align the microarray image grid with the horizontal and vertical axis (d) Locate spot centers and evaluate spot spacing (e) Subarray and spot detection Fig.2. Flowchart of the proposed subarray detection method. In the first step the microarray image is converted into a gray-scale image and a median filter is applied to reduce “salt&pepper”-like noise. Next, a binary version of the image is constructed, utilizing a threshold level produced by Otsu's method . Parameter Symbol Number of microarray grid columns Ncol Number of microarray grid rows Nrow Number of subarray grid columns ncol Number of subarray grid rows nrow Spot spacing d Subarray spacing in X axis Dx Subarray spacing in Y axis Dy Subarray width Wsa = ncol * d Subarray height Hsa = nrow * d Fig.3. Description of the set of image parameters used in the proposed method. The second step determines the rectangular region that envelops the important data of the microarray image. This is performed by locating the first and the last row and column in the binary image, where the brightness sum exceeds a specified threshold. The threshold may vary, depending on the geometry of the grid and the noise of the image. In the microarray images of the three tested databases, a threshold equal to the estimated spot spacing is proved to be a sufficient choice. The actual cropping process is performed on a greater rectangular region than the located rows and columns avoiding the loss of important data. An expansion of the original rectangle by a value equal to the threshold previously derived on every dimension, has proved to be sufficient. The next step is determining the grid alignment angle referring to horizontal and vertical axis. To accomplish this rotation, an alignment evaluation function is required performing the following estimations: 1. the sum of white pixels of each row of the image: S(r), 2. the maximum sum M of all rows: M=max(S(r)) 3. the number of rows Chigh satisfying the following condition: S(r) > M – Th, where Th is a threshold, estimated automatically as a function of spot spacing. 4. the number of rows Clow satisfying the following condition S(r) < Tl , where the threshold Tl is estimated also automatically as a function of spot spacing, 5. the value of the alignment evaluation function is given by the sum of Chigh and Clow. An alternative implementation of the proposed evaluation function can be achieved deriving the corresponding parameters column-wise. In both implementations the evaluation function counts the rows (columns), where very high and very low accumulated brightness across rows (columns) is met. The greater the evaluation function, the better grid alignment is achieved. The evaluation function is applied on a set of images produced by rotating the cropped binary image (created at step (b)) in the range of application-specific angles. The rotated image with the greater evaluation function defines the desired alignment. In practice, relatively small angles (less than 5 degrees) should be applied, and therefore in the third step high accuracy and low computational complexity can be achieved. During the fourth step, individual spots on the aligned binary image are identified and labeled, detecting the isolated white areas . The center of each spot is located by estimating the mean pixel of each isolated area. Finally, spot spacing, if not known a priori, can be estimated by the most frequent pixel-distance of successive spot centers. In the last step, the grid detection takes place using an artificial binary image generated by drawing filled circles using as centers the previously detected spot centers and radius equal to the 70% of the spot spacing (derived or known a priori). In the artificial image, the subarray detection algorithm locates “empty” regions on the x and y axis of the image where the sum of columns or rows respectively is less than a specific threshold, indicating the possible existence of a grid line. This procedure is repeatedly executed, starting from a relatively small threshold and increasing it until specific criteria are met for each axis separately. These criteria are: 1. The distance between successive grid-lines must lie within specific limits. 2. The number of the detected grid lines must be equal to the number of expected number of grid columns or rows plus one. The limits of the grid line distance of the first criterion depend on the geometrical features of the input microarray image. Actually, the detected grid line distance may vary from at least Wsa to a maximum of Wsa + Dx horizontally or from Hsa to Hsa + Dy vertically. If the parameters Dx and Dy are not known a priori, they can be estimated from the width of detected “empty” regions. 3 Experimental results The proposed method was evaluated in three databases, encoded in gif-formatted files. The “Human Sarcomas” database contains 34 microarray images of 32 subarrays, each one consisting of 6 x 8 spots . The “Young_vs_Old_Transgenic” database consists of 14 microarray images containing 48 subarrays of 29 x 30 spots . A total number of 36 cDNA microarray images from the “Lymphoma/Leukemia Molecular Profiling Project” (LLMPP) consists the third database, in which each image contains 16 subarrays of 24 x 24 spots . The automatically defined thresholds for all microarray databases are the same for all databases: Th = 2 * d, Tl = 2 * d. A good choice for the crop-threshold is twice the estimated spot spacing. Only in the LLMPP database the microarray alignment process derives significant projection angles. Therefore, the searching area of the microarray alignment is extended to [-3, 3] degrees. The proposed method, regarding to the “Human sarcomas” database, has detected all subarrays in the 58.82% of the available images, as shown in the example of Fig.4. The algorithm was proved to be inefficient for a number of images, strongly distorted during the microarray experiment. Fig.5 (a) depicts a portion of a microarray image where a number of low brightness spots can be observed while strong “salt&pepper” noise is also present (76168.gif). This results in the loss of valuable information during the binary image generation and consequently to the misevaluation of spot spacing and the misalignment of the detected grid. In Fig.5 (b) another example of an extremely distorted image (71824.gif) is illustrated, where although the spot intensity is adequate, the noise level is too high. It must be noted that both images in Fig.5 were inverted and adjusted to contrast level allowing the reader to perceive the importance of the problem. It has to be mentioned that in “Human Sarcomas” database the total number of subarray-spots is considerably small and the spacing between adjacent spot is greater than typical distance met in other databases. These effects allow the proposed method, during step (d), to wrongly assign “salt&pepper”like noise as spot centers. Fig.4. Original microarray image from the “Human Sarcomas” database, the corresponding artificial generated image and the detected subarrays. low brightness spot heavily noisy image (a) (b) Fig.5. (a) Portion of a microarray image, contained in the “Human Sarcomas” database with low brightness spots, (b) Example of a heavily noisy image in the “Human Sarcomas” database. The “Young_vs_Old_Transgenic” database results in the detection of 100% of the subarrays, as it works perfectly in all microarray images (Fig.6), in spite the presence of noise. The experimental results, concerning the LLMPP database reached to 83.33% of successful detection of subarrays (Fig.7). Although the microarray grid of the images was tilted in most cases, the method managed to correct the alignment and detect the subarrays properly. In contrast with the “Human Sarcomas” database, these images had sufficient spots density that allows the algorithm to produce a more accurate evaluation of spot spacing. Grid detection has failed in very heavily distorted and noisy images. 4 Conclusions Image analysis is an essential aspect of microarray experiments. Until recently, the pre-processing step of the subarray detection was requiring human intervention with consequences in the whole procedure, as the manual detection is a time-consuming operation in order to achieve good results. The proposed method manages to detect, in a precise manner, a respectable number of subarrays of three different databases, each one containing various types, regarding to the abundance of noise, the number of subarrays and the number of rows and columns of each subarray, of scanned microarray images. Among the most important advantages of the proposed method is the accurate detection of subarrays even in relatively noisy and misaligned images. Fig.6. Original microarray image drawn from the “Young_vs_Old_Transgenic” database and its corresponding artificial generated image. (a) (b) (c) (d) Fig.7 (a) Original image, (b) Grey scale rotated version, (c) Artificial generated image with the detected subarrays, (d) Detection of the subarrays for the original image. Acknowledgments. This paper is part of the 03ED013 research project, implemented within the framework of the “Reinforcement Programme of Human Research Manpower” (PENED) and co-financed by National and Community Funds (20% from the Greek Ministry of Development-General Secretariat of Research and Technology and 80% from E.U.-European Social Fund) References 1. Stefano Lonardi, Yu Luo, ‘Gridding and Compression of Microarray Images’, Proc. of Computational Systems Bioinformatics, CSB- 2004. 2. Peter Bajcsy, http://algdocs.ncsa.uiuc.edu/PR-20050204-1.pdf 3. A. Kuklin. ‘Laboratory automation in microarray image processing’, American Laboratory, pp. 64–67, May 2000. 4. Gerda Kamberova, Shishir Shah, ‘DNA Array, Image Analysis, Nuts & Bolts, DNA Press LLC, 2002. 5. S. Draghici, Data Analysis Tools for DNA Microarrays, CRC Mathematical Biology and Medicine Series, Chapman & Hall, London, UK, 2003. 6. J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, Calif, USA, 2001. 7. Yuan-Kai Wang, Cheng-Wei Huang, ‘DNA Microarray Image Analysis Using Active Contour Model’, Proc. Computational Systems Bioinformatics, CSB-2004. 8. Peter Bajcsy, ‘Gridline: Automatic Grid Alignment in DNA Microarray Scans’, IEEE Trans. on Image Processing, vol. 13, no. 1, Jan. 2004. 9. Christian Uehara, Ioannis Kakadiaris, ‘Towards Automatic Analysis of DNA Microarrays’, sixth Workshop on Applications of Computer Vision, WACV-2002. 10.Jinn Ho, Wen-Liang Hwang, Henry Horn-Shing Lu, and D. T. Lee, ‘Gridding Spot Centers of Smoothly Distorted Microarray Images’, IEEE Trans. on Image Processing, vol. 15, no. 2, Feb. 2006. 11. A. Mastrogianni, E. Dermatas, A. Bezerianos, ‘Robust pre-processing and noise reduction in microarray images’, Proc. of the 5th IASTED International Conference Biomedical Engineering, 2007. 12. Daniel Morris, ‘Blind Microarray Gridding: A New Framework’, IEEE Trans. on Systems, Man and Cybernetics- Part C: Applications and Reviews, vol. 38, no. 1, Jan.2008. 13. Y. Wang, M. Ma, K. Zhang, and F. Shih. , ‘A Hierarchical Refinement Algorithm for Fully Automatic Gridding in Spotted DNA Microarray Image Processing’, Information Sciences, 177(4):1123–1135, 2007. 14.Y.Wang, F. Shih, and M. Ma., ‘Precise Gridding of Microarray Images by Detecting and Correcting Rotations in Subarrays’, In Proceedings of the 8th Joint Conference on Information Sciences, pages 1195–1198, Salt Lake City, USA, 2005. 15.R. Fabbri, L. da F. Costa, J. Barrera, ‘Towards Non-Parametric Gridding of Microarray Images’, 14th International Conference on Digital Signal Processing, vol.2, 1-3 July 2002, pp.623-626. 16. Jinn Ho, Wen-Liang Hwang, Henry Horn-Shing Lu, D.T. Lee, ‘Gridding spot centers of smoothly distorted microarray images’, IEEE Trans. on Image Processing, vol. 15, no. 2, Feb. 2006. 17. N. Otsu, ‘A threshold selection method from gray-level histograms’, IEEE Trans. Sys. Man., Cyber, vol.9, pp. 62-66, 1979. 18. Haralick, Robert M., and Linda G. Shapiro, 'Computer and Robot Vision', V. I, Addison-Wesley, pp. 28-48., 1992. 19. Human Sarcomas database. Available at http://smd.stanford.edu/. 20. Young_vs_Old_Transgenic database. Available at http://smd.stanford.edu/. 21.The Lymphoma/Leukemia Molecular Profiling Project database. Available at http://llmpp.nih.gov/.
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project