Everything DAS

From BioDAS
Jump to: navigation, search

Last Updated 06 July 2009

The intention of this document is to bring together and add to all the documentation available on the WWW for the DAS system. The content on these pages draws from many sources of information and thus has many contributors. Eventually the intention if that this document will be a set of instructions that you can print out and use as reference documentation or a good read. If you find any errors on these pages and pages that it links to then please contact me (Jonathan Warren) to let me know. Any suggestions and contributions are also welcomed. As this is now in wiki form you can log in and edit/add things yourself.

Contents

What is DAS?

As biological databases are becoming so large with the advent of high throughput technologies such as sequencers and microarray chips it is becoming increasing difficult to download all the data relevant for a research team. DAS gets around this by keeping the data stored with it's originators and allows users around the world to access just the relevant parts they need at any one time. Put another way: by making use of DAS you can take advantage of being able to view integrated information from multiple sources, without these sources needing to be aware of each other. You can also add your own DAS data source, perhaps privately in your own institution and then view the information served from this source in the context of features from other institutions. DAS stands for Distributed Annotation System. It was originally set up to be used with genomic information where annotations/features are layered on top of a reference sequence , usually a genome. The idea is that a genome browser such as ensembl or GBrowse (both DAS clients in this scenario) can be used to look at annotations from data sources both that exist on the same server/machine the browser is running on and display annotations in the same view from data sources (data served by DAS servers) that could be on the other side of the world (communicating via the WWW). The DAS system consists of the DAS Registry www.dasregistry.org as well as DAS Servers and Clients. The Registry is there to enable people and computers to easily find the DAS data sources available around the world and also to help these data sources conform to the specifications. It's important that data served by DAS servers conform to enable the interoperability of different clients and servers around the world. The current version of the specification is DAS 1.6, which adds support for non-genomic data compared to the previous version, DAS 1.53.

Current Status/ DAS specifications 1.53, 1.53E, 1.6, 1.6E, 2.0 and Future Intentions

The various versions of the DAS specification are described on the DAS_specification page.

Currently DAS 1.53 is the most widely used and supported, together with 1.53E. DAS 2.0 is quite different and can be considered a separate project from DAS, running in parallel. After the 2009 workshop it was generally agreed that most of the useful additional features that 2.0 provides is now or very soon to be implemented in DAS 1.6E and it's subsequent incarnations and thus DAS2.0 is now considered redundant. If you wish your data to be widely accessible then use the 1.6 spec or 1.53E spec documents as your guide.

Setting up a DAS Server

There are several different options available for setting up a DAS server. All are either written in PERL or Java.

Servers available

Name Programming Language advantages disadvantages
Dazzle Java Standard implementation, includes support for extensions (structure, interaction, vol) Some people say it can be hard to configure and deploy if you are not used to Java web development
Proserver PERL Standard implementation includes support for extensions (structure, interaction, vol)
MyDAS Java Some people say it's easier to set up and configure than Dazzle Doesn't support extensions currently
LDAS PERL Very Easy to set up? Limited support for DAS functionality and sources


Dazzle

Dazzle is currently the standard/default implementation for Java users- however MyDas (mentioned below) is popular.

Dazzle Eclipse Tutorial

Dazzle_Tutorial This tutorial takes you through setting up Dazzle in eclipse and then shows you how to add your own plugins

Getting Dazzle

http://biojava.org/wiki/Dazzle#Getting_Dazzle The latest version from the cutting edge source code is available here from subversion:http://www.derkholm.net/svn/repos/dazzle/

Using ready made plugins for datasources

http://biojava.org/wiki/Dazzle:plugins More examples needed here and tips for using mysql etc? http://biojava.org/wiki/Dazzle:deployment

Writing your own plugin

http://biojava.org/wiki/Dazzle:writeplugin
How to write a plugin using eclipse

Deploying an Ensembl Reference Server

link to Ensembl reference server instructions

MyDas

Information about MyDas can be found here.

Proserver

Proserver Page at the Sanger Institute.
Proserver Tutorial 1
Proserver Tutorial 2
Proserver Tutorial 3
Proserver Developers' Guide

Implementing the latest specs

One of the most important additions to the recent specifications is the sources cmd. This gives essential information such as who maintains the das source and what coordinate systems it uses which is essential information for a large distributed system like DAS and is needed by the registry and clients in order to use the source correctly. Dazzle just uses a sources.xml document to serve this information that has to be written by the DAS server owner, but proserver will create a sources document for you if you specify the extra information needed in the initialisation file (proserver.ini).

Proserver example of config to implement sources cmd:


coordinates = TAIR_8,Chromosome,Arabidopsis thaliana -> 1:2000,3000
properties  = key1 -> value1 ; key2 -> value2
mapmaster   =
http://www.gramene.org/das/Arabidopsis_thaliana.TAIR8.reference
capabilities = features -> 1.0

The coordinates data is taken from the coordinates/registry_coordinates.xml file, which is an archived copy of the list of coordinates available in the DAS registry. Specifying the name (or URI, actually) and test range is enough, ProServer will pick up the rest from the XML file. If the full data is not picked up, you may need to update the coordinates XML file from the registry (http://www.dasregistry.org/das/coordinatesystem). If your coordinate system is not in the Registry, an admin can add it for you.

Protein Annotations and Ontologies

For an explanation of ontologies usage within the DAS protocal look here ontologies

Testing your implementation

Validation and Registering of your Server

RelaxNG and other validation in the Registry

The DAS Registry uses RelaxNG to validate the xml responses from DAS servers before allowing them to register as a valid das source. RelaxNG is essentially a document like a dtd except that it uses an xml syntax that is easy to learn quickly. The registry uses the documents found at the following http://www.dasregistry.org/validation/ and has one document for each of the DAS commands (note you may need to right click "view the source" to see anything on these pages in a web browser) features.rng, sources.rng, alignments.rng, structure.rng, entry_points.rng, interaction.rng, sequence.rng and types.rng.

stylesheets

The stylesheets are specified in the specification - however some extra notes here you may find useful: in the returned stylesheet the <BUMP> tag if not defined is <BUMP>no</BUMP>. This should be the case for features that are going to be displayed on the same line e.g. when grouping a set of features in a gene the exons should not be bumped so they appear on the same line as the other feature in that gene. However a transcript probably should be bumped so that overlapping transcripts will be displayed on separate lines ( so in this case the tag <BUMP>yes</BUMP> should be inserted).

The DAS Registry

Introduction to the DAS Registry

The DAS registry can be found at http://www.dasregistry.org and serves as a central place for discovering DAS sources from around the world and for validating the sources. There is a user interface for interrogating the sources and ways for clients to also interrogate the sources. Support for searching sources based on Ontologies is likely to be included in future releases. The number of sources registered is set to increase rapidly to accommodate the ensembl genomes project data and the general increase in numbers of sequenced genomes. The registry will thus have to be modified in order to cope with this increase in data. The user interface has a warning sign next to any of the sources that have not been valid for two days or more (if this is the case, the registry will have sent an email to the administrator for the data source informing them of this fact).

Connecting to the Registry Programmatically

There are several commands that can be used to query the registry including: The sources cmd with optional parameters: label, organism, authority, capability, type and unique source_id. You can also use the organism, coordinatesystem and lastmodified commands. For examples see Scripting an example of a java classe written using Dasobert to access the Registry is here http://www.derkholm.net/svn/repos/dasobert/trunk/doc/examples/ContactRegistry.java

Adding a large set of data sources to the registry

The registry can automatically load a large set of data sources from the sources.xml that is returned from the sources cmd. If you wish to load a large set of sources you can contact dasregistry@sanger.ac.uk and ask for your data sources to be loaded. Please note that your sources must have valid coordinate systems that are in the registry and a valid sources document. You can do an initial test at http://www.dasregistry.org/validateServer.jsp and select the sources capability for your server.

Discovering DAS sources programmatically

The registry produces it's own sources.xml in response to the url request http://www.dasregistry.org/das1/sources and this can be used by clients to get information on the many DAS sources available around the world and what their capabilities are. Clients can now find out if the data sources are valid or not due to a "valid" tag in the prop elements returned for each source in the sources.xml from the sources request. example snippet of a sources with a validated features cmd:

<PROP name="valid" value="features" />

This also means that you can find out the valid capabilities of a server using soap. Here is a java example using SOAP to access the registry get the names of all the sources, their coordinate systems and their validated capabilities (considered valid by the registry which tests them approximately every 3 hours):

import org.biojava.services.das.registry.DasRegistryAxisClient;
import org.biojava.dasobert.dasregistry.DasSource;
import org.biojava.dasobert.dasregistry.DasCoordinateSystem;
import java.net.URL;

public class testWebService {

    public static void main(String[] args){
	testWebService t = new testWebService();
	try {
	    t.runTest();
	} catch (Exception e){
	    e.printStackTrace();
	}
    }

    public void runTest() throws Exception {
	
		URL u = new URL("http://www.dasregistry.org/services/das:das_directory");
	
	DasRegistryAxisClient dasregistryaxisclient= new DasRegistryAxisClient(u);
	
	DasSource[] sources = dasregistryaxisclient.listServices();
	System.out.println("got " + sources.length + " DAS servers");

	for (int i=0;i< sources.length;i++) {
	    DasSource s = sources[i];
	    System.out.println("nickname: " + s.getNickname());
	    String validCaps[]=s.getValidCapabilities();
	    for(int h=0; h<validCaps.length;h++){
	    	System.out.println("validated capability="+validCaps[h]);
	    }
	    
	    DasCoordinateSystem[] coords = s.getCoordinateSystem();
	    
	    for (int j=0; j<coords.length;j++) {
		System.out.println(coords[j].getTestCode());
	    }
	}
    }

}

Setting up a DAS client

Currently Available DAS Clients - table?

Name Description Programming Language Links
GBrowse

quote from GBrowse Website "GBrowse[1] is the most popular viewer in GMOD. For a list of GBrowse and GMOD installations see the GMOD Users page. For a demo of its features, try the WormBase, FlyBase, or Human Genome Segmental Duplication Database web sites. Spec DAS 1.53E and 1.6 soon

PERL http://gmod.org/wiki/Gbrowse
EnsEMBL EnsEMBL is a web based genome browser and database system which supports DAS 1.53E and soon 1.6 PERL http://www.ensembl.org/
IGB is an application built upon the GenoViz SDK and Genometry for visualization and exploration of genomes and corresponding annotations from multiple data sources Java http://genoviz.sourceforge.net/
Jalview A multiple sequence alignment editor & viewer Java http://www.jalview.org/
Dasty2 Dasty, a protein DAS client is implemented for visualising protein sequence feature information. The client is able to connect, to a reference server and one or many DAS servers. It merges the data from all the servers, and displays sequence information as well as annotated feature information form all the available DAS Servers in a very user friendly way . PERL and AJAX

http://www.ebi.ac.uk/dasty/

Examples

STRAP

http://www.bioinformatics.org/strap/createStrapLinks.html#DAS

Writing your own DAS client

Remember that if you are developing your clients behind a firewall you may need to configure your code to go through the proxy using the proxy settings - examples if this can be found in the code below.

A Java DAS Client Library - Dasobert

Examples of client code written in Java using Dasobert can be found here: http://www.derkholm.net/svn/repos/dasobert/trunk/doc/examples/

There is also a tutorial for using Dasobert within eclipse that (follows on from the Dazzle eclipse tutorial here): Dasobert_Tutorial

Example of walking a DAS source using perl

This example was kindly provided by Felix Kokocinski: You can specify a region or let it walk through all regions if the server can supply entry points with lengths. This is done in eg. 20 MB slices. It takes quite some time, but works nicely.


# Example script that reads genomic data from DAS server # using a defined chunk size # writing the data out to a gff file. 
# fsk@sanger.ac.uk, 2008 
use strict;
use Bio::Das::Lite;
use Getopt::Long;

#default DAS server adress my $server = "http://das.sanger.ac.uk/das";
#default DAS source name my $source = 'otter_das';
#proxy name my $http_proxy = undef;
 #genomic chunk size to query my $max_len    = 20000000;


my $chromosome = undef;
 my $start      = 0;
my $end        = 0;
my $gff_file   = undef;
 my %transcripts = ();

my $type;

&GetOptions(
	    'file=s'                 => \$gff_file,
	    'chromosome=s'           => \$chromosome,
	    'start=s'                => \$start,
	    'end=s'                  => \$end,
            'server=s'               => \$server,
            'source=s'               => \$source,
	   );

#connect to DAS server my $das = connect_das("$server/$source", $http_proxy);

#get entry point list/lengths #requires the DAS server to support the entry-points function my $chrom_lens = get_entry_points();

open(GFF, ">$gff_file") or die "Can't open file $gff_file.\n";

if($chromosome){
  #query specific region   get_region($chromosome, $start, $end);
}
else{
  #go through all chromosomes	   foreach my $chrom (keys %$chrom_lens){
	print "getting $chrom\n";
	get_region($chrom, undef, undef);
	%transcripts = ();
  }
}


close(GFF)or die "Can't close file $gff_file.\n";


  ################################################ 

#connect to DAS server sub connect_das {
  my ($dsn, $proxy) = @_;

  my $das = Bio::Das::Lite->new({
				 'timeout'    => 10000,
				 'dsn'        => $dsn,
				 'http_proxy' => $proxy,
				}) or die "cant connect to DAS server!\n";

  return $das;
}



#look at the region requested sub get_region {
  my ($chromosome, $start, $end) = @_;

  my $chrom_len    = $chrom_lens->{$chromosome};
  my $region       = "";

  if( $start and $end){
    if($start > $end){
      die "Coordinates wrong: $start > $end!\n";
    }
    if( ($end - $start) <= $max_len ){
      #get entire region       my $region = ":".$start.",".$end;
      get_transcripts($region, $chromosome);
    }
    else{
      go_through_chunks($start, $end, $chromosome, $chrom_len);
    }
  }
  elsif( $chrom_len <= $max_len ){
    #get entire chromosome     get_transcripts($region, $chromosome);
  }
  else{
    go_through_chunks(1, $chrom_len, $chromosome, $chrom_len);
  }

}


#go through a region in chunks sub go_through_chunks {
  my ($chunk_start, $chunk_end, $chromosome, $chrom_len) = @_;

  my ($region_start, $region_end);
  my %ids_seen;

  #loop through regions until all is covered   #keep track of genes to avoid duplicates!   for($region_start = $chunk_start, $region_end = $region_start   $max_len;
      $region_start < $chunk_end;
      $region_start = $region_end   1, $region_end  = $max_len){

    if($region_end > $chrom_len){
      $region_end = $chrom_len;
    }elsif($region_end > $chunk_end){
      $region_end = $chunk_end;
    }
    my $region = ":".$region_start.",".$region_end;

    #get all transcripts from chunk     my $new_ids = get_transcripts($region, $chromosome, \%ids_seen);
    %ids_seen = (%ids_seen, %$new_ids);
  }

}



#fetch all available entry-points (chromosomes) and their lengths from server sub get_entry_points {

  my %chrom_lens;

  my $entry_points = $das->entry_points();

  foreach my $k (keys %$entry_points){
	foreach my $l (@{$entry_points->{$k}}){
		foreach my $segment (@{ $l->{"segment"} }){
			$chrom_lens{ $segment->{"segment_id"} } = $segment->{"segment_size"};
		}
  	}
  }

  return \%chrom_lens;
}



#fetch the data and process it. #note that this function is quite specific to the way your DAS source is set-up. #the idea is to get together all exons, etc that belong to a transcript and all transcripts #that belong to a gene. sub get_transcripts {
  my ( $region, $chromosome, $previous_genes ) = @_;

  print STDERR "have chr $chromosome$region\n";

  my %genes = ();
  my %new_features = ();
  my $response = undef;
 
   #fetch DAS features   $response = $das->features({
			      'segment' => $chromosome.$region,
			      'type'    => $type,
			     });

  while (my ($url, $features) = each %$response) {

    if(ref $features eq "ARRAY"){
      print STDERR "Received ".scalar @$features." features.\n";

    FEATURES:
      foreach my $feature (@$features) {

	my %notes = ();

	my $grouphash = $feature->{'group'}->[0];

	#get other notes 	my $i = 0;
	my $morenote_entry = '';
 	while(defined($feature->{'note'}->[$i])){
	  my $morenotes = $feature->{'note'}->[$i];
	  my ($morenotes_type, $morenotes_value) = split('=', $morenotes);
	  $morenotes_value =~ s/\&\#39\;/\'/g; 	  $notes{$morenotes_type} = $morenotes_value;
	  $i  ;
	}

	#remove duplicates from overlapping regions 	if(defined $previous_genes and exists($previous_genes->{$grouphash->{'group_type'}})){
	  next FEATURES;
	}

	#you could do some filtering of the response at this point 
	my %gff_element;

	#build structure for exons and general items 	#find type 	my $element_type = $feature->{'type'} || "exon";
	$element_type    =~ m/((intron)|(UTR)|(exon))/g;
	if($1){ $element_type = $1 }

	my $group_type   = $grouphash->{'group_type'};

	my $strand       = $feature->{'orientation'};
	if($feature->{'orientation'}    =~ /^(\ |\-|\.)$/) {  }
	elsif($feature->{'orientation'} ==  1){ $strand = ' ' }
	elsif($feature->{'orientation'} == -1){ $strand = '-' }
	elsif($feature->{'orientation'} ==  0){ $strand = '.' }
	else{ die "INVALID STRAND SYMBOL: ".$feature->{'orientation'}."\n"; }

	my $phase        = ".";
	if($feature->{'phase'}){
	  $phase = $feature->{'phase'};
	}
	elsif($element_type eq "exon"){
	  $phase = "0";
	}

	if(!$notes{"Transcriptstatus"}){
	  die "PROBLEM: $element_type, ".$feature->{'feature_id'}."\n";
	}

	$gff_element{'seqid'}      = $chromosome;
	$gff_element{'source'}     = $notes{"Transcripttype"};
	$gff_element{'type'}       = $element_type;
	$gff_element{'start'}      = $feature->{'start'};
	$gff_element{'end'}        = $feature->{'end'};
	$gff_element{'score'}      = ".";
	$gff_element{'strand'}     = $strand;
	$gff_element{'phase'}      = $phase;

	#check for some missing values 	if(!exists $feature->{'feature_id'}){
	  print STDERR "Missing value for Parent-feature_id\n";
	  $feature->{'feature_id'} = "0";
	}
	if(!exists $notes{"Transcriptstatus"}){
	  print STDERR "Missing value for Transcriptstatus\n";
	  $notes{"Transcriptstatus"} = "-";
	}
	if(!exists $notes{"Created"}){
	  print STDERR "Missing value for Created\n";
	  $notes{"Created"} = 0;
	}
	if(!exists $notes{"Lastmod"}){
	  print STDERR "Missing value for Lastmod\n";
	  $notes{"Lastmod"} = 0;
	}
	$gff_element{'attributes'} = "Parent=".$feature->{'feature_id'}.
	                             ";Status=".$notes{"Transcriptstatus"}.
				     ";CREATED=".$notes{"Created"}.
				     ";LASTMOD=".$notes{"Lastmod"};

	if(!exists $genes{ $group_type }){
	  $genes{ $group_type } = 1;
	  my %gff_gene;

          my $gene_region = $feature->{'target'};
          my ($gs, $gene_loc) = split('\=', $gene_region);
	  my ($gene_start, $gene_end) = split('\-', $gene_loc);

	  #build structure for gene 	  $gff_gene{'seqid'}      = $chromosome;
	  $gff_gene{'source'}     = $notes{"Genetype"};
	  $gff_gene{'type'}       = "gene";
	  $gff_gene{'start'}      = $gene_start;
	  $gff_gene{'end'}        = $gene_end;
	  $gff_gene{'score'}      = ".";
	  $gff_gene{'strand'}     = $strand;
	  $gff_gene{'phase'}      = ".";

	  #get gene description 	  my $description = "";
	  foreach my $gnote (@{$grouphash->{'note'}}){
	    my ($gnote_s, $gnote_string) = split('=', $gnote);
	    if($gnote_s eq "DESCR"){
	      $description = ";Description=".$gnote_string;
	    }
	  }
	  $gff_gene{'attributes'} = "ID=".$grouphash->{'group_type'}.
	                            $description.
				    ";Status=".$notes{"Genestatus"}.
	                            ";CREATED=".$notes{"Created"}.
				    ";LASTMOD=".$notes{"Lastmod"};

	  #print entry for transcript 	  print_gff_line(\%gff_gene);
	  %gff_gene = ();

	  $new_features{$grouphash->{'group_type'}} = 1;

	}

	if(!exists $transcripts{ $feature->{'feature_id'} }){
	  $transcripts{ $feature->{'feature_id'} } = 1;
	  my %gff_transcript;

	  #build structure for transcript 	  $gff_transcript{'seqid'}      = $chromosome;
	  $gff_transcript{'source'}     = $notes{"Transcripttype"};
	  $gff_transcript{'type'}       = "transcript";
	  $gff_transcript{'start'}      = $feature->{'target_start'};
	  $gff_transcript{'end'}        = $feature->{'target_stop'};
	  $gff_transcript{'score'}      = ".";
	  $gff_transcript{'strand'}     = $strand;
	  $gff_transcript{'phase'}      = ".";
	  $gff_transcript{'attributes'} = "ID=".$feature->{'feature_id'}.";Alias1=".$feature->{'target_id'}.
	                                  ";Parent=".$grouphash->{'group_type'}.
					  ";CREATED=".$notes{"Created"}.
					  ";LASTMOD=".$notes{"Lastmod"}.
					  ";Status=".$notes{"Transcriptstatus"};

	  #print entry for transcript 	  print_gff_line(\%gff_transcript);
	  %gff_transcript = ();
	}
	#else{ print STDERR "_" } 
	#print entry for exons, etc. 	if($feature->{'type_category'} =~ /error/){
	  print STDERR "Found an error feature:\n";
	  print STDERR $gff_element{'seqid'}."\t";
	  print STDERR $gff_element{'source'}."\t";
	  print STDERR $gff_element{'type'}."\t";
	  print STDERR $gff_element{'start'}."\t";
	  print STDERR $gff_element{'end'}."\t";
	  print STDERR $gff_element{'score'}."\t";
	  print STDERR $gff_element{'strand'}."\t";
	  print STDERR $gff_element{'phase'}."\t";
	  print STDERR $gff_element{'attributes'}."\n";
	} else {
	  print_gff_line(\%gff_element);
	  %gff_element = ();
	}

	$feature = undef;
       }
       @$features = ();
       $features  = undef;
     }
   }
 
   return \%new_features;
}



#print the different data types as GFF sub print_gff_line {
  my ($element) = @_;

  print GFF $element->{'seqid'}."\t";
  print GFF $element->{'source'}."\t";
  print GFF $element->{'type'}."\t";
  print GFF $element->{'start'}."\t";
  print GFF $element->{'end'}."\t";
  print GFF $element->{'score'}."\t";
  print GFF $element->{'strand'}."\t";
  print GFF $element->{'phase'}."\t";
  print GFF $element->{'attributes'}."\n";
}


Acknowledgments

(some of this document may have been cut and pasted from documentation contributed by the following people):

  • Andreas Prlic
  • Andy Jenkinson
  • Phil Jones
  • Tim Hubbard
  • Lincoln Stein
  • Thomas Down