Notes from the biweekly DAS/2 teleconference, 5 Mar 2007 $Id: das2-teleconf-2007-03-05.txt,v 1.2 2007/03/05 19:01:59 sac Exp $ Teleconference Info: * Schedule: Biweekly on Monday * Time of Day: 9:30 AM PST, 17:30 GMT * Dialin (US): 800-531-3250 * Dialin (Intl): 303-928-2693 * Toll-free UK: 08 00 40 49 467 * Toll-free France: 08 00 907 839 * Conference ID: 2879055 * Passcode: 1365 Attendees: Affy: Steve Chervitz, Ed Erwin, Gregg Helt CSHL: Lincoln Stein Sanger: Andreas Prlic UCLA: Allen Day Note taker: Steve Chervitz Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Agenda ------- * Review of BioSapiens DAS workshop * Status updates gh: I sent my summary of the biosapiens das workshop and feature classification workshop I attended with Ed in Hinxton: http://lists.open-bio.org/pipermail/das2/2007-March/000982.html "das developers workshop from a das/2 perspective", summarizes what I took home from these meetings, how well das/2 meets needs of people in europe (ensembl, sanger, biosapiens -- the focus of these meetings). and a quick biosapiens overview: a big european project , 25 institutions, large scale genome protein annotation. decided early on to use das to distribute annotations between organizations. can check the stats on their das servers -- andreas' registry -- 23 servers serving up 69 das sources -- a major das investment! In developing das/2 we haven't had too much experience with the kind of data they're dealing with (protein annotations). das/1 clients under study: - dasty2, dasty1 - ajax-based viz clients - jalview - alignment viewer, editor - igb - Ed gave presentation - pepper and spice - das viewers, also use alignment and 3d structure info - proview - protein annotation, - ensembl viewer servers presented/discussed: - pfam, ensembl, proserver, Andreas', - Extensions to das/1 protocol discussed: gene das, protein das, structure das, 3d-em das (arbitrary 3d volumes), interaction das for prot-prot interactions. Moddas - writeback in das/1. Alignment das (Andreas). - Simple das - das servers that don't impl all of das/1 (entry_points, or types, e.g.,). Gregg presented on das/2, will put up ppt later. Tailored it assuming [A] Gregg will send out powerpoint for his talk from BioSapiens DAS workshop Focussed on familiarity with das/1, how big the diffs are with an eye towards how hard it would be to move to das/2. Conceptually, not that big a switch, though XML is a lot different. Also discussed how well das/2 addresses some of the problems with das/1 that came up at the workshop. extensions for das/1: - das/2 addressed some of them very well. E.g., gene das (das w/o specifying location of feature). this is addressed well in das/2. can have features w/o location, or w/o range. - protein das - das/2 did a good job of removing nucleotide specific parts of das features (orientation, phase are not required). das/2 is much more agnostic about dna vs protein. - alignment das - pairwise or multiple - locations with features in das/2 addresses some of these issues (0,1,or more locations for a feature) each location can have optional gap attribute (cigar string). so if you can describe it with a cigar string, you can describe it in das/2. Can use multiple locations to do mult alignments. Not dealt with in das/2: 3d-threading of an alignment through a structure. Need to look at this in the future [A] Look at how to handle 3D structure alignment threading in DAS/2 spec - simple das stuff handled better in das/2 - in das/1 the assumption is you support all things unless. but in das/2 there is a capabilities header, you must indicate support there, if not stated, the default is you don't support it. Can also say you support feature filters, so there's more formal support for that. Surprises: - smaller subset of das/1 is in use than expected. of 69 sources, 64 either fail entry points or say not applicable. types query: 49 fail/not applicable ls: for types query. only one type? gh: for ensembl, this is the case. ap: lack of consistency of types is addressed in the other workshop related to features. gh: in types in das/1 it is less necessary because all info is replicated in each feature, type-method, category, id ls: use case for types query is to present user with set of checkboxes, select which type to retrieve from source. if in practice das sources are being use to for one type or a set of types that only make sense together, no reason to turn off a part of it, then makes sense to not support types query. ls: have heard that types query is expensive. computationally. simple db backends with no normalization/indexins, finding all types involves visiting each record. gh: part of justification with 1 type / source is because those types are stored in separate db. so having a das server to integrate them make sense. gh: Re: using smaller subset of das/1 than I expected: types can be expensive in another way, example: representing pfam in das. feat type for each pfam domain type (9000 primary domains). Pfam b - there are 70-400K more! ls: in das/2 create a single type 'protein domain' then use attribute pointing to an ontology saying which pfam domain it is. gh: concern there is, assuming clients will do something useful for particular attributes. For rendering, I could do diff rendering based on diff attribs (color diff domains differently). but for clients to really understand that they're different, that's a more complicated issue. gh: not using types or entry_points by clients because servers don't, feedback loop. ap: low coverage genomes (e.g., elephant) may have several 100K entry points. gh: in das/2 we are more formal and say that you don't support it. Creates problem: how do you know what to query in the first place? Then you have to know what you're looking for. gh: feature hierarchies handled in das/2 -- this is not an issue for protein das, where annotations are completely flat. even protein disulfide bond is one level, just rendered differently so it doesn't span all residues in between. But doing non-visual things (unions, intersections) this could be a problem. ls: flat in terms of location or ontology? gh: location. there is no feature ontology yet (no consistent, agreed upon yet, just proposed at this meeting). ls: they aren't creating discontinuous features because too hard, or don't care. gh: just not needed for most protein annotations. even when it could be needed, just not being used. ls: for nucleotide, it's needed frequently gh: not an issue for das/2 gh: ensembl collapses type and source into one thing. what does this mean? das/2 could be over complicated. ls: no doubt that it is too complicated for the biosapiens use case. we could make it easy for them to use by providing tool kits to read and write. could also argue that postscript is too complicate to draw simple rectangles on the page. You wouldn't expect then to simplify postscript. There are tools to ease simple rendering. The complexity of das/2 won't interfere with adoption, but not having toolkits, middleware layers to read/write. Not getting ensembl buy-in to das/2 could be a problem gh: tim hubbard was there and was on-board to transition to das/2. ls: would have be better to have buy in now (i.e., Tony Cox dropping out) gh: we've made it more formal to say, here is the subset of das/2 that this server supports. for other use cases, we do need the added complexity. gh: re: ensembl support for das/2. I mentioned andrew's das/1 - das/2 transformational proxy server. not released yet, but making progress on it. So if you have a das/1 server, you can put a das/2 front end on it. ls: can you go the other way, provide das/1 interface on das/2? gh: want to do this for the affy public das/2 server. Andrew's doesn't do that yet, but I'd like to do this. Another thing: integrate that proxy into the registry, so the registry makes it into a das/2 server. then we don't have a burden on servers to support two versions of the protocol. got email from andrew about his proxy on that. sc: I put a note about Andrew's proxy server on the biodas.org wiki. gh: he needs to have a place to keep it. sc: open-bio server would work. Just need a beetter mechanism to ensure it stays up. I think it's not getting started when the machine gets rebooted. [A] Steve/Andrew work on stable home for the proxy server [Correction: In my note in the teleconf, I was thinking about Andrew's validation server, which is hosted on open-bio and has a problem with not being up reliably. The proxy server is another issue. There's a mention of it on the DAS FAQ page, but not pointer to any server yet. -steve] gh: data overload and redundancy from the user perspective. clients where default for protein annotation is to go to all servers, you have way too many track showing up. Lots of servers and types. Ensembl is moving to expose even more data via das, thousands of new tracks (organisms, type, assembly version). Concern with biosapiens is replication of the same annotation data. E.g., pfam domains in different biosapiens data sources, may return same thing or slight diffs in feature ranges. how does user decide which is authoritative? Which can be left out? A big concern at the biosapiens meeting -- redundant information. gh: another issue: mirrors for the data. discussed in early days of das/2, not resolved how to deal with mirrors, http redirection mechanism. This can lead to redundant data when you hit all mirrors. gh: feature classification and ontologies around that. My take was that the sequence ontology is inadequate to describe protein annotation as it stands now. PAO - protein annotation ontology ls: are they doing this with NCBO involved? gh: talked to them about getting hold of lincoln and suzi and integrating with SO as an extension. ap: for 3rd version of SO we will contact lincoln and suzi to discuss ls: great gh: for biosapiens, Janet Thornton is the person to contact about that. gh: more about types (proliferation causing data overload issue mentioned above.) also discussion about dag vs hierarchical tree. pointing to multiple terms in the ontology for a particular type. in SO, how much has multiple parents come up? may need a type that can point to multiple ontology terms for that type. das/2 cannot do it yet, only one term per type. ls: the more flexible we make it the less coherent it will be. data overload will get even worse. to reduce data overload, need a way to take data from servers and deciding if same or different. are they reachable in same ontology? allowing set arithematic will create ambiguity. biosapiens can be allowed with an attribute, multiple attributes that point at different ontologies. gh: combining cellular location with protien classification ontologies. ls: certainly, but those are separate attributes. what we created is essentially an RDF. Actually, terminology is 'property' not attribute. Types property is the correct way to do this. gh: use of subset of das/1, what it means for das/2 data overload for users, featu classification issues gh: das wish list, people wrote up what they feel what das is inadequate for. Das/2 group was aware of these. ls: encryption, synchronous request seem like impl issues, not part of protocol. gh: some people complained that das is inadequate because it relies on http(s). you can do much more high-level things with soap-based system. I think this is correct, but wrong that no one in our space needs that. ls: no pharma that cares about this will entrust it to the public internet with any thing, soap or otherwise. gh: at affy, we've done das/1 servers with https and no one has ever complained. ls: identity theft problems via people stealing from encrypted streams never emerged as a problem. they steal it from your physical trash, setting up phony banking sites. Not related to strength of encryption. gh: regarding asynch request - discussed 2 years ago -- yes, it's outside of das/2 spec, but we say, use http as you will. redirect and say "your request has been accepted, check back here in a while." gh: wish list (sent out in email to the list noted above): - multi-level features, stylesheets - caching - use http caching as you will - features from other sources - dealth with since we use URIs. a problem for das/1 ls: providence requires people to put in effort to maintain the providence, but it doesn't free you of responsibility of having to track it. - scalability and large analysis - the data overload issue. the answer to me is smarter clients. - more queries -- addressed in das/2 - entry point supports - in das/2 we have a less ambiguous way to say whether a server points it or not. - counting number of features of each type per source -- have the 'count' format in das/2 - refering to id's externally (das/2 uri's) - errors and exception handling - we have http error codes -- remains to be seen how well it works out. done a reasonable job to map it to http error codes - better stylesheets - in progress for das/2 - mapping servers - different genome assembly versions or mapping from protein to nucleotide space. -- under discussion with data providers. ap: Another thing on wish list: people want to know stats per server, uptime, hits, etc. (server stats). gh: andreas' registry does a good job for das/1. biosapiens registry is built on Andreas' registry. How many are up, which requests they support, the data the server. Very nice. ap: Gregg's coverage was good. Also gave a very good advertisement for das/2! gh: the das/1 to das/2 transformational proxy was quite popular. doesn't take advantage of das/2 power, but gets people started. Other Topics: -------------- sc: biodas.org wiki is now officially up. gh: mentioned to Tim Hubbard. He said, "I know. I already edited it." sc: globalseqids page needs das2xml snippets for coordinates. [A] lincoln will add das2xml coordinate snippets to globalseqids page on wiki sc: might also be good to have notice of the next teleconf on the site. Maybe pointers to the notes as well. gh: maybe have an automatic email sent out reminding folks? sc: maybe not, if we have a list of the dates for upcoming meetings on the site. [A] Steve post list of dates of upcoming DAS/2 teleconferences on wiki Next meeting in two weeks: 19 mar 2007