mergeReps.README
Institute for Systems Biology
(c) Trey Ideker, October 2000
Go up one level to [Data-Processing Pipeline]
mergeReps [OPTIONS] <fileTable> <mergedOutput>
This script merges data from replicate microarray experiments, contained in
seperate files produced by the 'preprocess' script, into a single <mergedOutput>
file. Data files are listed by name in the <fileTable> along with other required
information according to a specific format.
FILETABLE FORMAT:
Each row contains three columns: <filename>,
<labeling_direction>, and <slide_ID>. Each <filename> pertains to a different
replicate data set output by the 'preprocess' script. <Labeling_direction> may
be f (forward) or r (reverse) and is used to group files according to which of
the two dyes (X or Y) was used to represent condition i vs. ii. For reverse-
labeled data, the program will reassign intensity measurements for each gene so
that x->y and y->x. <Slide_ID> is a unique alphanumeric identifier assigned to
each distinct microarray slide. Comments may be included in the filetable if they
are preceeded by a number sign '#'. For example, the following filetable lists
four files containing intensity data from four replicate microarray experiments
that compare conditions i and ii:
#######################################
# fname dir id #
#######################################
processed file 1 f 1
processed file 2 f 1
processed file 3 r 2
processed file 4 r 2
#######################################
In the first two files, dye X represents condition i and dye Y represents
condition ii, while in the second two files this mapping is reversed. The first
two files (or equivalently the second two files) contain data from replicate
microarrays printed next to each other on the same slide, so they have identical
slide IDs. The first three lines and the last line of the filetable are
comments.
OUTLIER REJECTION:
By default, replicate measurements for each gene are filtered
to reject outliers according to Dixon's test. Outlier rejection is performed
separately for the x replicates and y replicates, and is performed only if 3 or
more replicates are available. Intensity pairs in which either x or y is an
outlier are flagged with the symbol 'O' in the output file. Outlier rejection
may be disabled using the -filter option (see below).
COMMAND LINE OPTIONS:
-opt <num> Produce output for error model optimization using VERA. Only
returns those genes that are represented by at least <num>
replicate measurements in the merged data set and which are not
associated with any saturated intensity measurements (S flag).
-filter {on,off} Filter replicate measurements for each gene by performing
a statistical test to reject outliers (see above description).
The default value is 'on'.
-exclude <gene file> Do NOT output genes listed in <gene file>. Genes can be
specified using either the gene name or description, one gene
per row in <gene file>. This option is useful for eliminating
spots on the microarray that are no longer used or which
represent depricated genes.
OUTPUT FORMAT EXAMPLE:
Each row summarizes the replicate information for a particular gene.
Column 'N' lists the number of avaiable replicates, while column 'S' lists
the total number of slides these replicates were taken from (column N does not
necessarily equal S). 'RATIO' reports the average log ratio of these
replicates, and 'STD' reports the standard deviation of the log ratio.
Remaining columns list each (x,y) replicate along with that replicate's flags
(columns 'F'). In the example, three replicate measurements per gene were
analyzed.
GENE DESCRIPT | N S RATIO STD | X0 Y0 F0 X1 Y1 F1 X2 Y2 F2
------- -------- | - - ----- --- | ---- ---- --- ---- ---- --- ---- ---- ---
YCL052C PBN1 4 2 -0.34 0.6 161 2396 - 2931 5322 - 14721 11890 -
YGR148C RPL24B 3 2 -0.36 0.5 161 1254 - 3631 2464 - 10829 17113 O
...
YIR011C STS1 3 2 -0.18 0.2 55 204 YX 685 1797 - 6571 8651 -