TMAP 1.1
Dustin Cartwright
dustin.cartwright@gmail.com
http://users.math.yale.edu/~dc597/tmap/

TMAP is a software package for building genetic maps.  It includes both
command-line programs and a graphical user interface written in Java.  Most
operations can be performed using either interface.  TMAP introduces two novel
features.  First, it has a self-contained function for determining (with high
certainty) the phases from phase-unknown pedigree data.  Second, its model of
recombination explicitly accounts for the possibility of genotyping error.
Thus, with TMAP, genetic distance don't have the tendency to expand as the
markers become more dense.  The algorithms are described in:

DA Cartwright, M Troggio, R Velasco, A Gutin. Genetic mapping in the presence of
genotyping errors. Genetics, 2007 Aug 176(4): 2521–7.

* BUILDING TMAP FROM SOURCE

If you downloaded the source distribution, after unpacking, you should be able
to build and install TMAP on any Unix-like computer by typing

  ./configure && make && make install

This will build and install the following command-line programs in
/usr/local/bin: phasing tmap pedmerge chisq.  If it detects the appropriate the
appropriate infrastructure, it will also build the Java interface and install a
script jtmap in /usr/local/jar, tmap.jar in /usr/local/share/tmap, and library
called libTmapJni.so or similar in /usr/local/lib.  The rest of this section
provides advice on compiling the Java interface if the one-line command above
doesn't work.

Successfully compiling the Java interface can be tricky because it depends on
the interactions between the C compiler and linker, the Java compiler, the Java
runtime, and the operating system.  As a first prerequisite, you need the Java
Development Kit, including the Java compiler (javac) and support for the Java
Native Interface (javah program and jni.h header file).  If you have these
installed, but they are not detected automatically by the configure script, they
can be specified by setting the environment variables and rerunning ./configure:
  JAVAC - Java compiler
  JAVAH - Java header generator
  JNI_CPPFLAGS - non-standard path to jni.h file, e.g. -I/usr/java/jdk/include

Moreover, the the flags required to produce the JNI library depend on the linker
and operating system.  For MinGW (Windows), Mac OS X, Linux, and Solaris, with
their respective standard linkers, the configure script will set the flags
correctly.  On other platforms, if you know the correct flags, these can be
given by editing the variables in the Makefile which begin with JNI_.  Note that
running configure will regenerate the Makefile and undo any changes you've made.

Finally, on a 64-bit computer, it is necessary that the JNI library be 64-bit if
the Java runtime is 64-bit and 32-bit if the Java runtime is 32-bit.  You may be
able to work this by passing the -d32 or -d64 flags to the Java runtime.

* HOW TO USE TMAP

TMAP builds a genetic map in the following 5 steps:

  phase-unknown pedigree data
              |
              | 1.  Phasing (optional: for phase-unknown data)
              V
  phase-known pedigree data
              |
              | 2.  Merge data sets (optional: if there are multiple pedigrees)
              V
  phase-known pedigree data
              |
              | 3.  Find linkage groups
              V
  phase-known pedigree data and unordered list of markers
              |
              | 4.  Build initial maps
              V
  phase-known pedigree data and ordered list of markers
              |
              | 5.  Improve marker order
              V
  phase-known pedigree data and ordered list of markers

On the command-line, steps 1 and 2 are performed by the programs phasing and
pedmerge respectively.  The program tmap does steps 4 (-b option) and/or 5 (-i
option).  There is no way to do step 3 on the command line.

The programs in the Java interface (Phasing, Grouping, Builder, BuilderSplit,
and MapViewer) can perform all of the steps except 2, and differ only in what
point they start at.  Phasing starts at step 1 with phase-unknown data.
Grouping starts at step 3 with phase-known data.  Builder and BuilderSplit both
start at step 4, but Builder assumes that male and female recombination rates
are equal whereas BuilderSplit assumes they are independent.  MapViewer presents
the distances and error rates for a given map, and gives the user the chance to
remove markers, or to perform step 5.  After performing each step, these
programs allow the user to save the intermediate results, immediately go on to
the next step, or both.

TMAP was designed to be compatible with CarthaGene.  The output of steps 1 or 2
can be directly read by CarthaGene, which has no built-in equivalent of step 1,
but can do steps 2-5.

* THE JAVA INTERFACES

The Java distribution contains each entry point (Phasing, Grouping, Builder,
BuilderSplit, MapViewer) as a single JAR file, which can be run directly, e.g.
by double clicking it.  No analysis will be possible without a platform-specific
library, which is called TmapJni.dll on Windows, libTmapJni.jnilib on Mac OS X,
and libTmapJni.so on Linux and Solaris.  This TmapJni file needs to be in the
same directory as the JAR files in order for any of these programs to work.

Note that with a command-line installation, it is also possible to launch these
Java programs with the script jtmap, which takes as arguments an entry point
(Phasing, Grouping, etc.), optionally followed by a pedigree file and a linkage
group file.

The Grouping window allows you to look at the linkage groups not just a a single
LOD threshold, but how they vary over a range of thresholds.  On the right, it
displays the groups at the maximum threshold (most restrictive).  To the left of
this, the lines indicate how these groups merge as the LOD threshold gradually
decreases to the minimum threshold (least restrictive).  You can select
individual markers by clicking on their names, or whole linkage groups (at any
level) by clicking to the left of them.  The selected markers are displayed in
red.  Once you have selected the markers you want in a particular linkage group,
you can use them to build an initial map or to save them to a file to later be
used by the Builder or BuilderSplit applications.

The MapViewer display shows distances and error rates for a given map.  The
marker names are listed on the left followed by the distances and the errors.
If the "Split" option is chosen (see below), then both the distances and errors
are repeated, first for the maternal distance/error rate, then for the paternal
distance/error rate.  In addition to displaying this information, any marker or
set of markers can be removed from the map by unchecking the box to the left of
the marker name, showing the effect of that marker on the map distances.  

* COMMAND-LINE PROGRAMS

The core of the command-line interface consists of the programs phasing,
pedmerge, and tmap.  The other, auxillary programs may also be of use.

chisq    - Computes the Chi-square statistic of the distribution of the
           segregation for each marker.  Note that this is just the statistic,
           not the significance level.
compareMarkers - Compares the genotypes for a single linkage group in two
           different data sets.  In particularly, this can be used to see if the
           markers have been given the same phases.
pedmerge - Merges multiple phase-known pedigrees with common into a single data.
           By using only a single input pedigree, it can also be used to convert
           a data set into "f2 intercross" format.
phasing  - Finds the (approximate) most likely marker phases for all markers in
           a single pedigree.
project  - Projects markers onto their maternal and paternal gametes in a
           pseudo-testcross.  Also doubles each marker, once in each phase.
           This is useful for the pseudo-testcross method of phase
           determination.
quasiLinkage - Performs "quasi-linkage," which is a linkage-style analysis,
           treating a single marker as the phenotype and trying it in different
           positions along the linkage group.
swapPhases - Takes a list of markers and changes them to the opposite phases,
           both maternal and paternal.
tmap     - Finds a genetic map for a given set of markers.  Can perform the
           build step, the improve step, both, or neither.

* SPLITTING (separate maternal and paternal recombination rates)

For building and improving maps (steps 4 and 5 above), treating maternal and
paternal recombination rates separately is referred to as "splitting."  In the
command-line program tmap, it is enabled with the -s option and in the Java
programs, there are options to "Build" and "Build Split", as well as a "Split"
checkbox in the MapViewer display.  Note that splitting also has the effect of
treating the error rates separately as well.  

Even if there is no difference between the recombination rates in the two
parents, looking at the split distances may reveal problems in the genotype
data.

If you do use the "split" option, be aware that the order of adjacent markers
which are only informative for different parents is completely undefined.  In
other words, if you have an abxaa and an aaxab marker next to each other, then
the order is completely undefined.  If you didn't split them, then the order
would be determined by the distances to the surrounding linking markers.  So,
for example, if you had 4 markers:
  M1 (abxcd) 0.0
  M2 (abxaa) 5.0
  M3 (aaxab) 10.0
  M4 (abxcd) 15.0
If you split the parental distances, then you might also get the same order or
you might get the order:
  M1 (abxcd) 0.0  0.0
  M3 (aaxab) 2.5  10.0
  M2 (abxaa) 5.0  12.5
  M4 (abxcd) 15.0 15.0

* GENOTYPE FILE

There are three possible formats for the genotype file: f2 backcross, f2
intercross, and outbred.  The file should begin with "data type " followed by
one of these three types.  The next line should consist of the number of progeny
followed by the number of markers.  The rest of the line is ignored for
compatibility with CarthaGene and MapMaker.  Then come the markers, one per
line.  Each marker line begins with the marker name, optionally preceded by an
asterisk, which, if present, is ignored.  Then come the genotypes in a format
which depends on the file type.

The backcross format is appropriate for backcross, testcross, and similar
pedigrees.  The genotypes are indicated by a single character, either A, H, or -
indicating homozygous, heterozygous, or unknown in a backcross.  For testcross
and similar pedigrees, one genotype should be arbitrarily assigned to H and one
to A.  The genotypes can be separated by spaces or tabs, but need not be.

The intercross format is appropriate for intercross pedigrees, but can also be
used for general outbred pedigrees.  With intercross data, the genotypes should
be coded as follows:
    A   homozygous A
    H   heterozygous
    B   homozygous B
    C   either homozygous B or heterozygous
    D   either homozygous A or heterozygous
    -   unknown
Furthermore, for outbred pedigrees, the genotypes can be coded using the
characters 1-9 and a-f, as with CarthaGene.  See CarthaGene documentation for an
explanation of these.

The backcross and intercross formats are intended to be generally compatible
with CarthaGene.  The main incompatibility is that there is no support for
alias assignments on the second line.

In the outbred format, each marker name is first followed by a description of
the segregation type followed by the individuals' genotypes, which must be
separated by spaces or tabs.  The possible genotypes depend on the segregation
type, but in all cases both - and -- mean unknown.  The segregation type can be
specified in three ways: as a cross, a long cross, or explicitly.

A cross segregation type specifies the alleles in the parents enclosed in angled
brackets, such as <abxab>.  Each allele must be a single character, so there
should be exactly 5 characters between the brackets.  With this segregation type
the genotypes can be coded as aa, ab, ba, and bb, where ab and ba mean the same
thing.  Any characters can be used, so <ATxAT> works just as well.

Furthermore, there are two special characters which can be used in the genotypes
of the offspring: ? and -.  A ? means that the other genotype is unknown.  So,
in a <abxab> marker, a? means either aa or ab.  A - character means either
homozygous or the null allele (represented by 0).  In a <abxa0> marker, a- means
either aa or a0, but not ab, and a? means aa, a0 or ab, i.e.  everything except
b0.  However, if there is no null allele, then - means the same thing as ?, so
in a <abxab> marker, a- means aa or ab.  In order to be handled properly, these
characters must be the second alleles, meaning you should not use genotypes like
-a or ?a.

A long cross segregation type is similar to a cross, except that the alleles
need not be single characters and it is enclosed in double angle brakets.  Each
genotype consists of two alleles separated by one of the following separator
characters: '|', '/', ',', ';', or ':'.  The choice of seperator character must
be consistent within each line, but not necessarily within each file.  The
alleles ? and - in the progeny and 0 in the cross have the same meaning as
above.  For example, the following two lines are equivalent:

MARKER <abxab> aa ab bb a- b- -- --
MARKER <<100:120x100:120>> 100:100 100:120 120:120 100:- 120:- -:- -

The final way to specify the segregation type is as an explicit list of possible
genotypes, separated by commas and surrounded by brackets.  We can arbitrarily
call one of the father's chromosomes F1 and the other F2 and call the mother's
chromosomes M1 and M2.  At each locus, each progeny has one chromsome from each
parent, for a total of 4 possibilities: F1M1, F1M2, F2M1, F2M2.  The explicit
segregation type should list the genotype corresponding to these 4 inheritance
types in the same order.  The genotypes can be single or multiple characters.
For example, the following segregation types are almost equivalent:
Cross type              Explicit type
<abxab>                 [aa,ab,ab,bb]
<abxaa>                 [aa,ab,aa,ab]
<aaxab>                 [aa,aa,ab,ab]
<a0xa0>                 [a-,a-,a-,00]
<abxac>                 [aa,ab,ac,bc]
The difference is that the explicit type does not allow any genotypes in the
progeny other than the ones listed.

The explicit listing is less compact and does not allow partial information.
For example, in an abxab codominant marker, if one individual is only partially
typed, it is possible to enter that individual as a?, even though the rest are
aa, ab, or bb.  However, nothing equivalent is possible with an explicit type.
On the other hand, the advantage of an explicit segregation type is that the
genotypes can be anything.  For example a dominant marker could be coded as
[p,p,p,a] where p and a represent presence and absence of some trait.  Also an
abxab cross which segregates 3:1 can be recoded as an explicit marker by
changing the segregation type [aa,ab,ab,ab] without changing the genotypes of
the offspring.

The following example illustrates the various ways in which a marker which is
heterozygous in both parents can be coded in the outbred pedigree file:
Segregation type        Offspring genotypes
<abxab>                 a-, bb
<abxab>                 a?, bb
<a0xa0>                 a-, 00
<a0xa0>                 a?, 00
<C0xC0>                 C-, 00
[ab,ab,ab,bb]           ab, aa
[a-,a-,a-,bb]           a-, aa
[D,D,D,R]               D,  R