Human genomic coordinates and sequence

From BioPerl
Jump to: navigation, search

{{#comment| test whether user set the "named" parameter }} {{#function|present||{{#not|{{#strpos|{{#1}}|{{#2}}}}}}}} {{#var|See|@=|see}} {{#var|sp|@=| }} {{#if|{{#present|{{#var|See}}|2}}||{{#var|See|@=|see{{#var|sp}}}}||{{#var|See|@=|^.}}}} ({{#if|{{#present|{{#var|See}}|2}}||{{#var|See}}||see{{#var|sp}} }}thread)

  • Note that the first script doesn't get it quite right, but it stays within BioPerl. Replace get_genomic_sequence() with the second version to get the exact solution. This requires Bio::EnsEMBL tools; see link below. --Ed.

Emanuele Osimo, after working hard with Getting Genomic Sequences HOWTO, says:

The script works fine, but gives me the wrong coordinates...there are two different sets of coordinates. The first is called "NC_000001.10 Genome Reference Consortium Human Build 37 (GRCh37), Primary_Assembly", and is the one I need, and the second one is called just "NT_004610.19" and it's the one that the script prints. Do you know how to make the script print the "right" coordinates (at least,the ones I need)?

After solving this, the follow-up question:

Could you please tell me how to use the RefSeq coordinates in another script that fetches the sequence, for instance?

Below is a script that fetches chromosomal coordinates based on a geneID, and uses those to obtain chromosomal nucleotide sequence. It's hacky, but (sort of) works today (23:12, 23 July 2009 (UTC)).


use strict;
use Bio::DB::EntrezGene;
use Bio::WebAgent;
use Bio::DB::EUtilities;
use Bio::DB::GenBank;
use Bio::SeqIO;
use Bio::Root::IO;
 
my $ua = Bio::WebAgent->new();
my $db = new Bio::DB::EntrezGene;
 
my ($chr,$start, $end, $info); 
my @geneids = (842, 843, 844);
for my $geneid ( @geneids ) {
    ($chr,$start, $end, $info) = genome_coords($geneid, $db, $ua); 
    my $seq_obj = get_genomic_sequence($chr, $start, $end);
    print $seq_obj->seq; # or whatever you want
}
 
 
sub genome_coords {
    my ($id, $db, $ua) = @_;
    my $seq = $db->get_Seq_by_id($id);
    my $ac = $seq->annotation;
     for my $ann ($ac->get_Annotations('dblink')) {
	 if ($ann->database eq 'UCSC') 
	 {
	     my $resp = $ua->get($ann->url);
	     my ($chr, $start, $end, $info) = $resp->content =~ m{position=chr([0-9]+):([0-9]+)-([0-9]+)
                                                                  \&.*\&refGene.*?\">(.*?)</A>}gx;
             # break above is a <perl ></ perl> wiki parser workaround...
	     return unless $chr;
	     return ($chr, $start, $end, $info);
	 }
     }
    return; # parse error, no UCSC link on page 
}
 
 
 
sub get_genomic_sequence {
    my ($chr, $start, $end) = @_;
    my $fetcher = Bio::DB::EUtilities->new(-eutil => 'efetch',
					   -db    => 'nucleotide',
					   -rettype => 'gb');
 
    my $strand = $start > $end ? 2 : 1;
    my $acc = sprintf("NC_%06d", $chr); # standard locations for current hum genome build
                                        # NC_000022 is X, NC_000023 is Y. NCC_1701 is the Enterprise.
    $fetcher->set_parameters(-id => $acc,
			     -seq_start => $start,
			     -seq_stop  => $end,
			     -strand    => $strand);
    my $seq = $fetcher->get_Response->content;
    # there must be a better way than below, but oh well...
    if ($seq) {
	my ($fh, $fn) = Bio::Root::IO::tempfile();
	print $fh $seq;
	close $fh;
	my $seqio = Bio::SeqIO->new(-file=>$fn, -format=>'genbank');
	return $seqio->next_seq;
    }
    return;
}

--Ed.

Emanuele submits:

I discovered that the coupling of the two subs that Mark posted doesn't get the right results. I think this is because one gets the coordinates with RefSeq build 36.3, the other with build 37. I found that coupling the first sub, genome_coords, with the Bio::EnsEMBL::Registry fetch by region API is a lot better, and it actually generates sequences that contain the genes.

Here is a modified get_genomic_sequence, based on EO's script. This requires Bio::EnsEMBL::*; get it here, and see the core API docs here:

use Bio::EnsEMBL::Slice;
use Bio::EnsEMBL::Registry;
 
# returns a Bio::EnsEMBL::Gene
 
sub get_genomic_sequence {
  my ($chr, $start, $end) = @_;
 
  my $registry = 'Bio::EnsEMBL::Registry';
  $registry->load_registry_from_db(
      -host => 'ensembldb.ensembl.org',
      -user => 'anonymous'
  );
  my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' );
 
  my $slice = $slice_adaptor->fetch_by_region( 'chromosome', $chr, $start, $end );
  return $slice;
}
Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox