Notes from DAS/2 code sprint #2, day one, 13 Mar 2006 $Id: das2-teleconf-2006-03-13.txt,v 1.2 2006/03/16 20:47:28 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt Sanger: Andreas Prlic Dalke Scientific: Andrew Dalke (at Affy) UC Berkeley: Nomi Harris (at Affy) UCLA: Allen Day, Brian O'Connor (at Affy) Panther Informatics: Brian Gilman Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. General note: Passcode is now required to enter teleconf. This is a change in their system. Issue: Continuation Grant ------------------------- gh: no word yet. Issue: Coordinate System ------------------------ ad: question of what happens when there are multiple coordinate systems for an assembly. auth and source, source: contig space, scaffold space auth: organization (e.g. ncbi, ucsc) gh: not enough to get uniqueness. ncbi, genome, human is not enough, need version to uniquely id the coord system ad: auth, source, species, version identification string gh: use case: need to know whether uris for two versioned source refer to the same genome. gh: ncbi version numbers are separate from organism info, eg. v35. ad: we could have a service for mapping strings gh: idea - every server can say this assembly name is same as that. Clients could chain together statements from multiple servers. For the affy das server used by igb, we now have a synonyms file on our server which igb reads. It's a pain to maintain. ad: type of alignment server? gh: a synonym server. Here's a uri, give me a list of synonyms that refer to the same thing. This is something tho talk more about when Andreas is on line. [Andreas joins in.] GH: How would a das server verify the version info in a sources document point to same genome assembly? AP: You would check auth=ncbi, vers=35, taxid=human AP: In protein structure space, you check verison on every object you work with. Protein seq. gh: so we have to map version info on sequences as well as genome assemblies. gh: use case: two segment responses from diff servers, diff uris for the diff sequences, how you know they are refering to the same seq. name=chromosome21 vs name=chr21? ad: we require the same name for the same segments. gh: going to fall apart fast. no way to enforce it. People use 1, I, chr1, chromI. ee: can put this in the validation suite. aday: yes. gh: but what do you use for name: accession # for entry, string chr1, etc. gh: important since this is the name that goes to user. ad: could have one slot for computer to use, one for human consumption. ad: for segments there seem to be two diff ids: url, ad: the point of having special ids for segments is segment equivalence from different servers. Separate coordinates element that says how to merge things together. Identifiers in here that are just coordinate space ids, not necessarily for human use. Only for identifying coords. gh: but how do we get people to use it? sc: what about the idea of using checksums as identifiers for a seq? ad: problem of duplicate seqs in an assembly. eg., same seq from chr1 and chr9. gh: if they are the same seq they should get the same id. ad: don't you want to know if there is a region on chr1 that is an exact duplicate of a region on chr9? sc: we could create the checksum on source:sequence gh: useful to have a central place to ask for diff names for the same coord system. ad: uniqueness idea: coords element, has: auth, source, version, species (optional) uniqueness says these are the names you use. gh: this can fail. What do we say happens when it fails? Should there be a way of resolving it. ad: this is where your synonym table comes it. Publish it? gh: maybe as part of the registry, knows ap: there isn't a big variety in naming because there aren't many people providing assemblies. gh: we already have 10 different synonyms for an assembly ee: this has some performance impact on igb. should have to do it. ap: we should say this is how naming works. gh: will fail. ad: is this required for this version of the spec? gh: need something that can be used now. aday: without hardwiring gh: if we don't agree during the code sprint, then it won't happen for everyone else. aday: using roman numerals for yeast since sgd uses it. ee: trouble with chrX ad: andreas: is there a place for naming of segments to use ap: no, something for the reference server, not coords ad: given these coords, here are the names that are used. ap: same as reference server. gh: maybe registry should provide: here's a coord system and here are the names you can use for ap: you would get a long list for proteins aday: a user who wants to [Brian Gilman joins in] gh: question for brian g: LSID, when you come across this for LSIDs, ncbi is auth for human genome assembly yet they have no lsid for their assembly, how do people refer to their lsid when there's no authority to say what it is? bg: you can't, no one is the authority. but you can write a resolver that queries ncbi under the cover, in your resolver you make ncbi the authority of the lsid, add namespace, object id. Then everyone has to know that your resolver is hosted at some site somewhere. So there is no satisfactory answer. It's a problem if the authority does not host the resolver. bg: I'm at the w3c meeting at mit, providing a webified resolver, they would host a resolver, everyone would know to go to a well-known web address. bg: you start a convention, enforce it, give error if people don't use it. gh: thinking we need it associated with registry. ap: ref server + coord system, provides ids that can be used, gh: so other ids can be used, but registry server wouldn't support it. ad: site has ftp site for downloading chromosomes, contains names for different segments in the file. How do I go from the ids in ths file to the ids that Andreas describes. To make my annotations in the same space. Mapping from file from ncbi. bg: what are your use cases? write back to server? ad: user publishing locally, bg: you make a ref server. gh: experience from das1 is that everyone makes their own reference server and refers to it from their annotation server, using different names. ad: new tag 'coordinates' gh: like enforcing common names at registry server. Can use their own names, they just won't be allowed to post on the registry. ad: need documentation ap: could point to docn on reference server bg: workflow1: fish researcher looking for abberant regions in chr7, 11 and 3, singled out the abctransporter gene. How does that work in das/2? type 'abc' in web page for reference server? This is a gene name. ad: your client browser can go to to registry to find servers that host the assemblies for your fish. Go to those reference servers, do searches there. Will go to coord system, get a segments document, get display chromosome by title. gh: get a das features xml document saying the sequence and coordinates. gh: our discussion here is on getting the diff. ad: we don't have anything on coordinates saying which is the latest version. bg: latest build may have changed their gene coordinate. gh: mapping servers is part of our continuation grant. Can push an annotation on one assembly to another assembly. bg: a hard thing. gh: that's why where enlisting UCSC to do it! ad: Topic: id, url, uri, iri (see email) gh: likes uri, not url. Some things aren't really urls (resolvable). Iri might work. ad: multiple coord elements for same ref server. ap: originally there was one, but some use two, zebrafish guy chrom and scaffold coordinates. or chromosomes vs. gene ids. same types, different accession codes and features. ad: if you have graphical browser, do you get scaffolds or chromosomes. ap: depends on your view. gh: if you do a segments query, do you get segments and contigs? ap: depending on the coordinate system of the requrest. ad: one capabilities for scaffolds and one for chromosomes? gh: maybe Deliverables: [A] gregg: by end of week, load stuff from multiple servers, compare in the same view. [A] steve will work on getting gregg's das/2 server up and running. gh: trouble with biopackages.net server aday: possible power outage interference. gh: target filters have been dropped. aday: yay!