TMAP 1.1 Dustin Cartwright dustin.cartwright@gmail.com http://users.math.yale.edu/~dc597/tmap/ TMAP is a software package for building genetic maps. It includes both command-line programs and a graphical user interface written in Java. Most operations can be performed using either interface. TMAP introduces two novel features. First, it has a self-contained function for determining (with high certainty) the phases from phase-unknown pedigree data. Second, its model of recombination explicitly accounts for the possibility of genotyping error. Thus, with TMAP, genetic distance don't have the tendency to expand as the markers become more dense. The algorithms are described in: DA Cartwright, M Troggio, R Velasco, A Gutin. Genetic mapping in the presence of genotyping errors. Genetics, 2007 Aug 176(4): 2521–7. * BUILDING TMAP FROM SOURCE If you downloaded the source distribution, after unpacking, you should be able to build and install TMAP on any Unix-like computer by typing ./configure && make && make install This will build and install the following command-line programs in /usr/local/bin: phasing tmap pedmerge chisq. If it detects the appropriate the appropriate infrastructure, it will also build the Java interface and install a script jtmap in /usr/local/jar, tmap.jar in /usr/local/share/tmap, and library called libTmapJni.so or similar in /usr/local/lib. The rest of this section provides advice on compiling the Java interface if the one-line command above doesn't work. Successfully compiling the Java interface can be tricky because it depends on the interactions between the C compiler and linker, the Java compiler, the Java runtime, and the operating system. As a first prerequisite, you need the Java Development Kit, including the Java compiler (javac) and support for the Java Native Interface (javah program and jni.h header file). If you have these installed, but they are not detected automatically by the configure script, they can be specified by setting the environment variables and rerunning ./configure: JAVAC - Java compiler JAVAH - Java header generator JNI_CPPFLAGS - non-standard path to jni.h file, e.g. -I/usr/java/jdk/include Moreover, the the flags required to produce the JNI library depend on the linker and operating system. For MinGW (Windows), Mac OS X, Linux, and Solaris, with their respective standard linkers, the configure script will set the flags correctly. On other platforms, if you know the correct flags, these can be given by editing the variables in the Makefile which begin with JNI_. Note that running configure will regenerate the Makefile and undo any changes you've made. Finally, on a 64-bit computer, it is necessary that the JNI library be 64-bit if the Java runtime is 64-bit and 32-bit if the Java runtime is 32-bit. You may be able to work this by passing the -d32 or -d64 flags to the Java runtime. * HOW TO USE TMAP TMAP builds a genetic map in the following 5 steps: phase-unknown pedigree data | | 1. Phasing (optional: for phase-unknown data) V phase-known pedigree data | | 2. Merge data sets (optional: if there are multiple pedigrees) V phase-known pedigree data | | 3. Find linkage groups V phase-known pedigree data and unordered list of markers | | 4. Build initial maps V phase-known pedigree data and ordered list of markers | | 5. Improve marker order V phase-known pedigree data and ordered list of markers On the command-line, steps 1 and 2 are performed by the programs phasing and pedmerge respectively. The program tmap does steps 4 (-b option) and/or 5 (-i option). There is no way to do step 3 on the command line. The programs in the Java interface (Phasing, Grouping, Builder, BuilderSplit, and MapViewer) can perform all of the steps except 2, and differ only in what point they start at. Phasing starts at step 1 with phase-unknown data. Grouping starts at step 3 with phase-known data. Builder and BuilderSplit both start at step 4, but Builder assumes that male and female recombination rates are equal whereas BuilderSplit assumes they are independent. MapViewer presents the distances and error rates for a given map, and gives the user the chance to remove markers, or to perform step 5. After performing each step, these programs allow the user to save the intermediate results, immediately go on to the next step, or both. TMAP was designed to be compatible with CarthaGene. The output of steps 1 or 2 can be directly read by CarthaGene, which has no built-in equivalent of step 1, but can do steps 2-5. * THE JAVA INTERFACES The Java distribution contains each entry point (Phasing, Grouping, Builder, BuilderSplit, MapViewer) as a single JAR file, which can be run directly, e.g. by double clicking it. No analysis will be possible without a platform-specific library, which is called TmapJni.dll on Windows, libTmapJni.jnilib on Mac OS X, and libTmapJni.so on Linux and Solaris. This TmapJni file needs to be in the same directory as the JAR files in order for any of these programs to work. Note that with a command-line installation, it is also possible to launch these Java programs with the script jtmap, which takes as arguments an entry point (Phasing, Grouping, etc.), optionally followed by a pedigree file and a linkage group file. The Grouping window allows you to look at the linkage groups not just a a single LOD threshold, but how they vary over a range of thresholds. On the right, it displays the groups at the maximum threshold (most restrictive). To the left of this, the lines indicate how these groups merge as the LOD threshold gradually decreases to the minimum threshold (least restrictive). You can select individual markers by clicking on their names, or whole linkage groups (at any level) by clicking to the left of them. The selected markers are displayed in red. Once you have selected the markers you want in a particular linkage group, you can use them to build an initial map or to save them to a file to later be used by the Builder or BuilderSplit applications. The MapViewer display shows distances and error rates for a given map. The marker names are listed on the left followed by the distances and the errors. If the "Split" option is chosen (see below), then both the distances and errors are repeated, first for the maternal distance/error rate, then for the paternal distance/error rate. In addition to displaying this information, any marker or set of markers can be removed from the map by unchecking the box to the left of the marker name, showing the effect of that marker on the map distances. * COMMAND-LINE PROGRAMS The core of the command-line interface consists of the programs phasing, pedmerge, and tmap. The other, auxillary programs may also be of use. chisq - Computes the Chi-square statistic of the distribution of the segregation for each marker. Note that this is just the statistic, not the significance level. compareMarkers - Compares the genotypes for a single linkage group in two different data sets. In particularly, this can be used to see if the markers have been given the same phases. pedmerge - Merges multiple phase-known pedigrees with common into a single data. By using only a single input pedigree, it can also be used to convert a data set into "f2 intercross" format. phasing - Finds the (approximate) most likely marker phases for all markers in a single pedigree. project - Projects markers onto their maternal and paternal gametes in a pseudo-testcross. Also doubles each marker, once in each phase. This is useful for the pseudo-testcross method of phase determination. quasiLinkage - Performs "quasi-linkage," which is a linkage-style analysis, treating a single marker as the phenotype and trying it in different positions along the linkage group. swapPhases - Takes a list of markers and changes them to the opposite phases, both maternal and paternal. tmap - Finds a genetic map for a given set of markers. Can perform the build step, the improve step, both, or neither. * SPLITTING (separate maternal and paternal recombination rates) For building and improving maps (steps 4 and 5 above), treating maternal and paternal recombination rates separately is referred to as "splitting." In the command-line program tmap, it is enabled with the -s option and in the Java programs, there are options to "Build" and "Build Split", as well as a "Split" checkbox in the MapViewer display. Note that splitting also has the effect of treating the error rates separately as well. Even if there is no difference between the recombination rates in the two parents, looking at the split distances may reveal problems in the genotype data. If you do use the "split" option, be aware that the order of adjacent markers which are only informative for different parents is completely undefined. In other words, if you have an abxaa and an aaxab marker next to each other, then the order is completely undefined. If you didn't split them, then the order would be determined by the distances to the surrounding linking markers. So, for example, if you had 4 markers: M1 (abxcd) 0.0 M2 (abxaa) 5.0 M3 (aaxab) 10.0 M4 (abxcd) 15.0 If you split the parental distances, then you might also get the same order or you might get the order: M1 (abxcd) 0.0 0.0 M3 (aaxab) 2.5 10.0 M2 (abxaa) 5.0 12.5 M4 (abxcd) 15.0 15.0 * GENOTYPE FILE There are three possible formats for the genotype file: f2 backcross, f2 intercross, and outbred. The file should begin with "data type " followed by one of these three types. The next line should consist of the number of progeny followed by the number of markers. The rest of the line is ignored for compatibility with CarthaGene and MapMaker. Then come the markers, one per line. Each marker line begins with the marker name, optionally preceded by an asterisk, which, if present, is ignored. Then come the genotypes in a format which depends on the file type. The backcross format is appropriate for backcross, testcross, and similar pedigrees. The genotypes are indicated by a single character, either A, H, or - indicating homozygous, heterozygous, or unknown in a backcross. For testcross and similar pedigrees, one genotype should be arbitrarily assigned to H and one to A. The genotypes can be separated by spaces or tabs, but need not be. The intercross format is appropriate for intercross pedigrees, but can also be used for general outbred pedigrees. With intercross data, the genotypes should be coded as follows: A homozygous A H heterozygous B homozygous B C either homozygous B or heterozygous D either homozygous A or heterozygous - unknown Furthermore, for outbred pedigrees, the genotypes can be coded using the characters 1-9 and a-f, as with CarthaGene. See CarthaGene documentation for an explanation of these. The backcross and intercross formats are intended to be generally compatible with CarthaGene. The main incompatibility is that there is no support for alias assignments on the second line. In the outbred format, each marker name is first followed by a description of the segregation type followed by the individuals' genotypes, which must be separated by spaces or tabs. The possible genotypes depend on the segregation type, but in all cases both - and -- mean unknown. The segregation type can be specified in three ways: as a cross, a long cross, or explicitly. A cross segregation type specifies the alleles in the parents enclosed in angled brackets, such as . Each allele must be a single character, so there should be exactly 5 characters between the brackets. With this segregation type the genotypes can be coded as aa, ab, ba, and bb, where ab and ba mean the same thing. Any characters can be used, so works just as well. Furthermore, there are two special characters which can be used in the genotypes of the offspring: ? and -. A ? means that the other genotype is unknown. So, in a marker, a? means either aa or ab. A - character means either homozygous or the null allele (represented by 0). In a marker, a- means either aa or a0, but not ab, and a? means aa, a0 or ab, i.e. everything except b0. However, if there is no null allele, then - means the same thing as ?, so in a marker, a- means aa or ab. In order to be handled properly, these characters must be the second alleles, meaning you should not use genotypes like -a or ?a. A long cross segregation type is similar to a cross, except that the alleles need not be single characters and it is enclosed in double angle brakets. Each genotype consists of two alleles separated by one of the following separator characters: '|', '/', ',', ';', or ':'. The choice of seperator character must be consistent within each line, but not necessarily within each file. The alleles ? and - in the progeny and 0 in the cross have the same meaning as above. For example, the following two lines are equivalent: MARKER aa ab bb a- b- -- -- MARKER <<100:120x100:120>> 100:100 100:120 120:120 100:- 120:- -:- - The final way to specify the segregation type is as an explicit list of possible genotypes, separated by commas and surrounded by brackets. We can arbitrarily call one of the father's chromosomes F1 and the other F2 and call the mother's chromosomes M1 and M2. At each locus, each progeny has one chromsome from each parent, for a total of 4 possibilities: F1M1, F1M2, F2M1, F2M2. The explicit segregation type should list the genotype corresponding to these 4 inheritance types in the same order. The genotypes can be single or multiple characters. For example, the following segregation types are almost equivalent: Cross type Explicit type [aa,ab,ab,bb] [aa,ab,aa,ab] [aa,aa,ab,ab] [a-,a-,a-,00] [aa,ab,ac,bc] The difference is that the explicit type does not allow any genotypes in the progeny other than the ones listed. The explicit listing is less compact and does not allow partial information. For example, in an abxab codominant marker, if one individual is only partially typed, it is possible to enter that individual as a?, even though the rest are aa, ab, or bb. However, nothing equivalent is possible with an explicit type. On the other hand, the advantage of an explicit segregation type is that the genotypes can be anything. For example a dominant marker could be coded as [p,p,p,a] where p and a represent presence and absence of some trait. Also an abxab cross which segregates 3:1 can be recoded as an explicit marker by changing the segregation type [aa,ab,ab,ab] without changing the genotypes of the offspring. The following example illustrates the various ways in which a marker which is heterozygous in both parents can be coded in the outbred pedigree file: Segregation type Offspring genotypes a-, bb a?, bb a-, 00 a?, 00 C-, 00 [ab,ab,ab,bb] ab, aa [a-,a-,a-,bb] a-, aa [D,D,D,R] D, R