BioPerl Locations
This is a spec page, sort of an open discussion about how locations are defined in BioPerl, mainly generated from a recent thread on the mailing list. Who knows, this may be incorporated into a HOWTO or evolve into one on its own.
Any comments or changes presented here will not be incorporated into the stable (1.6) release but will possibly appear in the next developer release cycle (1.7).
Some examples are 'borrowed' from the BioRuby Locations docs.
Opinions, comments, and boos/hisses appreciated (?!?).
Contents |
Locations in BioPerl
It is difficult to discuss what we mean by 'locations' without also briefly mentioning sequence features. Currently, BioPerl splits any generic features in a sequence record into two general groups:
- Information pertaining to the sequence as a whole are Annotation (thus implementing Bio::AnnotationI)
- information that describes a particular region or regions, or 'locations', of a sequence (thus implementing Bio::SeqFeatureI).
A group of classes (Bio::Location) was devised in order to describe locations in a generic way.
How location data is handled in BioPerl
Any Bio::LocationI, at it's core, is also a Bio::RangeI. In the generic sense, this means that the class must define the following methods:
TODO: Make table
- start
- end
- strand
Factory
Bio::Factory::FTLocationFactory
Where does gap fit in here (found in CONTIG files), or do we plan on supporting it?
Coordinates
Definition of start/end coordinates, coordinate policies, etc.
Types of Locations
The following all implement Bio::LocationI.
Simple
Fuzzy
Deprecated fuzzy location types
Split
At this time, all split locations are handled by one class (Bio::Location::Split), which implements methods from Bio::Location::SplitLocationI and Bio::LocationI.
join(location,location, ... location)
The indicated elements should be joined (placed end-to-end) to form one contiguous sequence. Far and away the most common split location type. Use of this operator implies that the locations are meant to be joined together in the order listed.
There are some ambiguities in the way this location type is handled in BioPerl, particularly re: circular sequences where the location is split across the origin.
Examples:Simple subLocations
NCBI and EBI currently use differing but syntactically similar versions of a split location string on the complement strand.
From GenBank : AL137247
...
mRNA complement(join(3207..4831,5834..5902,8881..8969,
9276..9403,29535..29764))
/locus_tag="RP11-298P3.2-001"
/note="match: ESTs: AA890530 AI675036 AI701009 AI916535
AW592106 BE479269 BE645120 BE672552 BM145066 BM996077
R61510
match: cDNAs: AL049787 BC022188 BQ003364 U50527"
...
From EMBL : AL137247
...
FT mRNA join(complement(29535..29764),complement(9276..9403),
FT complement(8881..8969),complement(5834..5902),
FT complement(3207..4831))
FT /locus_tag="RP11-298P3.2-001"
FT /note="match: ESTs: AA890530 AI675036 AI701009 AI916535
FT AW592106 BE479269 BE645120 BE672552 BM145066 BM996077
FT R61510"
FT /note="match: cDNAs: AL049787 BC022188 BQ003364 U50527"
...
Examples:Remote subLocations
From AF130134 :
...
mRNA join(AF130124.1:<2563..2964,AF130125.1:21..157,
AF130126.1:12..174,AF130127.1:21..112,AF130128.1:21..162,
AF130128.1:281..595,AF130128.1:661..842,
AF130128.1:916..1030,AF130129.1:21..115,
AF130130.1:21..165,AF130131.1:21..125,AF130132.1:21..428,
AF130132.1:492..746,AF130133.1:21..168,
AF130133.1:232..401,AF130133.1:475..906,
AF130133.1:970..1107,AF130133.1:1176..1367,21..>128)
...
NCBI uses EMBL-like location strings (see above) when some of the sublocations are remote and complementary.
From AL137247 :
...
mRNA join(complement(AL353665.13:2815..2862),
complement(AL353665.13:453..500),
complement(AL353665.13:112..273),complement(47990..48111),
complement(44019..45757),complement(40283..40368),
complement(37992..38036),complement(34048..34603))
/locus_tag="RP11-298P3.4-001"
/note="match: ESTs: AA593603 AA737293 AA778631 AI718289
AU137161 AV726785 AW243426 BM055268 BM887924
match: cDNAs: AL049802 AL049802.1 U50529 U50529.1"
...
Example: Split location in circular sequence [1] (S. solfataricus genome)
CDS join(2991448..2992245,1..252)
/locus_tag="SSO12256"
/note="Similar to cellulose synthase homologs ydaM and
icaA, probably involved in cell wall biogenesis or
intercellular adhesion; Cell Envelope, Surface
polysaccharides and lipopolysaccharides"
/codon_start=1
/transl_table=11
/product="Glycosyltransferase, putative"
/protein_id="NP_341577.1"
/db_xref="GI:15896972"
/db_xref="GeneID:1455270"
/translation="MIVPVKNEERVLPRLLDRLVNLEYDKSKYEIIVVEDGSTDRTFQ
ICKEYEIKYNNLIRCYSLPRANVPNGKSRALNFALRISKGEIIGIFDGDTVPRLDILE
YVEPKFEDITVGAVQGKLVPINVRESVTSRLAAIEELIYEYSIAGRAKVGLFVPIEGT
CSFIRKSIIMELGGWNEYSLTEDLDISLKIVNKGCKIVYSPTTISWREVPVSLRVLIR
QRLRWYRGHLEVQLGKLRKIDLRIIDGILIVLTPFFMVLNLVNYSLVLVYSSSLYIVA
ASLVSLASLLSLLLIILIARRHMIEYFYMIPSFVYMNFIVALNFTAIFLELIRAPRVW
VKTERSAKVTGEVMG"
order(location,location, ... location)
The elements can be found in the specified order (5' to 3' direction), but nothing is implied about the reasonableness about joining them.
Not as commonly used, but necessary to describe the locations of features like repeated sequences, etc. Note that one might think you could assign a strand for the split location object, you are allowed to have different strands for the sublocations using this operator.
Examples:Simple subLocations
From AF006691:
...
repeat_region order(912..1918,20410..21416)
/rpt_type=direct
...
From AF264948:
...
repeat_region order(11375..11410,15692..15727)
/note="similar to inverted repeats found in IS911"
/rpt_type=direct
...
From NC_003384:
...
repeat_region complement(order(153546..153558,154281..154293))
/rpt_type=inverted
repeat_region complement(order(159639..159653,160446..160460))
/rpt_type=inverted
repeat_region complement(order(160655..160668,161461..161474))
/rpt_type=inverted
...
Examples:Simple subLocations with mixed strands
From AF264948:
...
primer_bind order(3..26,complement(964..987))
...
Examples:Remote subLocations
From AF081826:
...
gene order(complement(1009..>1260),complement(AF081827.1:<1..177))
/gene="csgD"
...
bond(location,location...location)
Found in protein files. These generally are used to describe disulfide bonds. Some of these location types imply that the order of the location is important (see the example below for 1XDA_D).
Note that this split location type is not described in the GenBank/EMBL/DDBJ Feature Table definition, presumably because it pertains to protein sequences, not nucleotide sequences.
Examples:Simple subLocations
From 1TGS_Z:
...
Bond bond(115,216)
/bond_type="disulfide"
SecStr 120..126
/sec_str_type="sheet"
/note="strand 11"
Bond bond(122,189)
/bond_type="disulfide"
...
Examples:Complex subLocations
From 1TGS_Z:
...
Het join(bond(58),bond(60),bond(60),bond(63),bond(68),
bond(68),bond(68))
/heterogen="( CA, 800 ) Calcium Ion"
...
From 1XDA_D:
...
Het join(bond(29),bond(29),bond(3),bond(3))
/heterogen="(MYR, 39 )"
Region 4..>29
/region_name="IlGF"
/note="Insulin / insulin-like growth factor / relaxin
family; smart00078"
/db_xref="CDD:47425"
Het join(bond(29),bond(29))
/heterogen="(MYR, 39 )"
...
Examples:Remote subLocations
There are no apparent remote sublocations used for bonds. If a link to another sequence is implied, it apparently is indicated in the feature tags, not in the location string.
From P67973_1, whale insulin:
...
Bond bond(7,7)
/gene="INS"
/bond_type="disulfide"
/experiment="experimental evidence, no additional details
recorded"
/note="Interchain (between B and A chains)."
Bond bond(19,20)
/gene="INS"
/bond_type="disulfide"
/experiment="experimental evidence, no additional details
recorded"
/note="Interchain (between B and A chains)."
...
Deprecated split locations
The following split location operators were in GenBank releases prior to rel. 122 and older EMBL releases, but have long since been removed. They are not supported in BioPerl.
group
group(location, location, .. location): The elements are related and should be grouped together, but no order is implied.
This differs from order in that there is no 5'->3' order of the sublocations.
one-of
one-of(location, location, .. location): The element can be any one, but only one, of the items listed.
These elements were used at one point for describing split locations where the exact position of one sublocation was uncertain but could be picked from a group. These were mainly encountered with genes that were alternatively spliced but were also used for describing variation data. Now the individually spliced genes are described in separate sequence features, using regular join or order operators.
These appear to have been mainly retrofitted to fit the current specification as of October 2011, no current examples exist. One retrofitted example:
One might also encounter these in the annotation COMMENTS section: