================================================================================== Datasets for: Mahony, Auron & Benos, "Inferring protein-DNA dependencies using motif alignments and mutual information" Bioinformatics 23:i297-i304 (2007) ================================================================================== EGR dataset (Figure 3 in paper) ------------------------------- EGR_public.dat: The original collection of EGR SELEX and phage-display data (Benos, et al, J Mol Biol (2002) 323: 701-727) EGR_3znfs.prot & EGR_3znfs.motifs: The protein sequences (the 3 zinc-finger regions) and the "motifs" (really just pseudo-PSSMs for the binding DNA sequence) extracted from the EGR_public.dat file. EGR_1znf.prot & EGR_1znf.motifs: As above, except that the 3 zinc fingers are separated and each has its own entry. Likewise, the 4 DNA bases bound by each separate zinc finger are entered separately. Note that this data was used for the final paper. Homeo-domain dataset (Figures 4 & 5 in paper) --------------------------------------------- HOMEO_subset.pfam: PFam alignment of a selection of homeo-domain transcription factors (homeo-domain regions only). HOMEO_nodimer_ATTA_containing.pfam & HOMEO_nodimer_ATTA_containing.motifs: These files contain the protein-sequences and binding motifs (JASPAR & TRANSFAC) for 25 non-dimer homeodomain TFs that have a binding preference that contains an ATTA-like pattern. There are 25 DNA-binding motifs, but there are more protein sequences, since sometimes the mouse and human sequences are both included (in any case, they are usually identical sequences). The protein sequences have been trimmed down to include only the recognition helices. The names of the protein sequences CONTAIN the name of the motif, but some text mining is necessary to match up the pairs. Basic-region dataset (Figures 7 & 8 in paper) --------------------------------------------- basics.pfam: PFam alignment of a selection of basic-region transcription factors (basic regions only). basics_short.pfam: Trimmed down version of the above file (trimmed to region of the basic helices that are in proximity to the DNA during binding) bHLH_bHLH-ZIP.motifs: A selection of 24 bHLH and bHLH-ZIP DNA binding motifs (JASPAR & TRANSFAC) used for the example in the paper. All motifs have a CACGTG or CAGGTG binding preference. As with the homeo-domain example above, the PFam sequence files contain more sequences than there are DNA motifs, so some text mining is necessary to pair up the sequences and motifs.