preprocess

preprocess.README
Institute for Systems Biology
(c) Trey Ideker, October 2000

Go up one level to [Data-Processing Pipeline]

preprocess [OPTIONS] <dappleFile> <geneKey> <processedOutput>

The 'preprocess' script converts the raw intensity data contained in the Dapple
output file <dappleFile> into a sorted list of background-subtracted, normalized
intensities for each gene on the DNA microarray. Gene names are read from the
tab-delimited text file <geneKey>, and the processed data are written to
<processedOutput>.

GENEKEY FORMAT: The first line of <geneKey> specifies the number of rows of
spots on the microarray slide with the keyword num_rows_per_slide, e.g.

num_rows_per_slide 144

Each of the following rows lists the mapping between a spot position and a
particular gene name using four columns: microarray row, column, corresponding
gene name, and gene description. Row and column numbering starts with (0,0) in
the upper-left-hand corner of the microarray and proceeds towards the lower
right.

FLAGS: Each gene may be assigned one or more flags in the output file.

  X     X intensity below background threshold (given in output header)
  Y     Y intensity below background threshold (given in output header)
  A     Abnormally high local background in X
  B     Abnormally high local background in Y
  N     No spot found by Dapple at this microarray location (row, col)
  S     Spot is saturated in X or Y intensity (intensity is above number
        set with -sat option)
  -     No flag set

COMMAND LINE OPTIONS:

  -base <num>   Output the log intensity ratio using base <num>. To obtain the
                natural logarithm, specify 'e'. The default is base 10.

  -norm {median, mean, none}  Specifies method for normalizing intensities
                between the two dyes X and Y. 'Median' (the default) scales
                X and Y intensities by fixed factors Mx and My so that the
                median X intensity is equal to the median Y intensity, over
                the top 50% brightest spots on the microarray (as sorted by X+Y
                intensity). 'Mean' uses the mean instead of the median.   

  -sat <num>    Specify a saturating intensity for the microarray scanner.
                Intensities above this number are flagged with 'S'.  

  -scale <num>  During median normalization, forces median(x)=median(y)=<num>.
                Without this option (by default), median normalization sets
                median(x)=median(y)=average(median(x),median(y)). This option
                is useful is multiple replicate microarrays are to be analyzed
                because it ensures that all of them have the same scale. 

  -debug        Creates <outputFile>.debug, containing diagnostic information.

OUTPUT FORMAT:

# Output file contains a header containing general
# information about the preprocessing run. Each line of
# the header starts with the '#' character.  Information
# about the total number of genes, distributions of
# x and y, and average background intensity comes first...
#
# ...followed by information pertaining to the
# normalization process...
# 
# ...then by a histogram of normalized log ratios
#
#
# Normalized, background-subtracted data starts after the
# header, one line per gene.  For example...
#
#  GENE      GENE       RATIO   LOG                            -SLIDE-
#  NAME    DESCRIPT      X/Y    RATIO   X INT   Y INT   FLAG   ROW COL
#--------  ---------  --------  ------  ------  ------  -----  --- ---
  YNL080C    YNL080C    0.1893  -0.723    2017   10655      -  141   1
  YEL055C       POL5    0.3165  -0.500    1001    3818      X  141   0
  YDL081C      RPP1A    0.5217  -0.283   33393   64009      S   58   0

  ...

  YNL330C       RPD3    0.5625  -0.250    5142    9140      B    9   3