preprocess.README
Institute for Systems Biology
(c) Trey Ideker, October 2000

Go up one level to [Data-Processing Pipeline]



preprocess [OPTIONS] <dappleFile> <geneKey> <processedOutput>

The 'preprocess' script converts the raw intensity data contained in the Dapple
output file <dappleFile> into a sorted list of background-subtracted, normalized
intensities for each gene on the DNA microarray. Gene names are read from the
tab-delimited text file <geneKey>, and the processed data are written to
<processedOutput>.


GENEKEY FORMAT: The first line of <geneKey> specifies the number of rows of
spots on the microarray slide with the keyword num_rows_per_slide, e.g.

num_rows_per_slide 144
Each of the following rows lists the mapping between a spot position and a
particular gene name using four columns: microarray row, column, corresponding
gene name, and gene description. Row and column numbering starts with (0,0) in
the upper-left-hand corner of the microarray and proceeds towards the lower
right.


FLAGS: Each gene may be assigned one or more flags in the output file.

  X     X intensity below background threshold (given in output header)
  Y     Y intensity below background threshold (given in output header)
  A     Abnormally high local background in X
  B     Abnormally high local background in Y
  N     No spot found by Dapple at this microarray location (row, col)
  S     Spot is saturated in X or Y intensity (intensity is above number
        set with -sat option)
  -     No flag set


COMMAND LINE OPTIONS:
  -base <num>   Output the log intensity ratio using base <num>. To obtain the
                natural logarithm, specify 'e'. The default is base 10.
-norm {median, mean, none} Specifies method for normalizing intensities between the two dyes X and Y. 'Median' (the default) scales X and Y intensities by fixed factors Mx and My so that the median X intensity is equal to the median Y intensity, over the top 50% brightest spots on the microarray (as sorted by X+Y intensity). 'Mean' uses the mean instead of the median.
-sat <num> Specify a saturating intensity for the microarray scanner. Intensities above this number are flagged with 'S'.
-scale <num> During median normalization, forces median(x)=median(y)=<num>. Without this option (by default), median normalization sets median(x)=median(y)=average(median(x),median(y)). This option is useful is multiple replicate microarrays are to be analyzed because it ensures that all of them have the same scale.
-debug Creates <outputFile>.debug, containing diagnostic information.


OUTPUT FORMAT:
# Output file contains a header containing general
# information about the preprocessing run. Each line of
# the header starts with the '#' character.  Information
# about the total number of genes, distributions of
# x and y, and average background intensity comes first...
#
# ...followed by information pertaining to the
# normalization process...
# 
# ...then by a histogram of normalized log ratios
#
#
# Normalized, background-subtracted data starts after the
# header, one line per gene.  For example...
#
#  GENE      GENE       RATIO   LOG                            -SLIDE-
#  NAME    DESCRIPT      X/Y    RATIO   X INT   Y INT   FLAG   ROW COL
#--------  ---------  --------  ------  ------  ------  -----  --- ---
  YNL080C    YNL080C    0.1893  -0.723    2017   10655      -  141   1
  YEL055C       POL5    0.3165  -0.500    1001    3818      X  141   0
  YDL081C      RPP1A    0.5217  -0.283   33393   64009      S   58   0
...
YNL330C RPD3 0.5625 -0.250 5142 9140 B 9 3