Notes from the DAS/2 teleconference for the code sprint, 10 Feb 2006 $Id: das2-teleconf-2006-02-10.txt,v 1.1 2006/02/10 22:13:17 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt CSHL: Lincoln Stein Sanger: Thomas Down, Andreas Prlic Sweden: Andrew Dalke UCLA: Allen Day Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. [note taker missed the first 5 minutes] Topic: Properties ----------------- gh: Properties are all tag-value ad: yes gh: don't think we need your binary thing. ad: ok drop it gh: href is needed. can always point it to a binary something out there. can the value just be a url? ad: can make it relative to xml base gh: do you need some property with tag value and href at same time? ls: how would you interpret that? should be either value or href. ad: there's nothing to say how to interpret the url. gh: nice to have multiple links out to somewhere else and to have some indication what they are w/out traversing the link. e.g., this is the genbank ref, ensembl ref, protein, etc. if xid had an extra field with label, title e.g. that would suffice. ad: sounds ok [A] xids will have title + href, properties will have tag + value Topic: Exercising the spec --------------------------- gh: we need the reference server to actually exercise this part of the spec. xid. possibly other things like: target overlap, inside, cigar strings. encoding, decoding. aday: oh no. ls: line element. cigar string is something that no one has tested yet. gh: if we don't have server doing it by next code sprint aday: any impls out there we could use? gh: bioperl has a gff3 parser. aday: I wrote it, and I didn't impl cigar string parsing. ls: there's a cigar processor in bioperl AlignIO. in theory not hard to do. gh: lbl folks (Nomi et al) have a java one, too. I think. gh: other parts of spec that aren't getting exercised? I doubt if anyone has used xml lang. ad: added xml id. just there for other reasons, but not what we need it for. gh: we talked about all ids being xml ids and combing xml id and xml base, can't remember why we stopped discussing. ad: don't think we need to. style sheet has uses for this maybe. ad: has anyone generated doc href yet? td: can add this stuff easily now. gh: for testing purposes, just throw a doc href everywhere it's allowed. ad: are servers supporting retrieval of seq data? aday: yes ad: support for alt feature formats? aday: can do old compact formats, not sure about coverage. gh: yes, alt feat formats are handled, but server isn't up and running yet. igb das/2 client can handle it already. ad: retrival of assembly? aday: no assembly data ad: i don't touch assembly gh: may be for next code sprint. Topic: range based query ------------------------ gh: thomas and i don't like optional mins and maxes. ls: fine as long as you can always determine the size of the reference. provide beginning and end. gh: exception: if you want the whole sequence, can you just not supply range? ad: yes gh: :1 and :-1 how to interpret nothing for strand on end and 0 for strand at end? ls: features that have strand +1, -1, features that have no strand or on both strands (0) features that may have a strand but you don't know (empty) gh: when you put it in the query there's a differences between i don't know and i will accept anything. use case: transfrags from transcriptome project. unknown strand, but I know it *is* one or the other strand. ls: how about this arrangement: empty = i don't care 0 = has strand but i dont know 1 = forward strand -1 = reverse strand 2 = both strands ad: could be organized by track (everything in a track has same strand. gh: don't think is good to structure a query so it's required that you do have strand. you might could have diff strand designation on same track. ls: you want to be able to distinguish things that are on both strands, things that are on either strand, but you don't know which. gh: biggest concern: given a range based query to server 1000-2000 means everything that overlaps, any strandedness within this range. ad: should support stranded searches. client can filter out opposed to do a strand request against seq to get the rev comp. client should be able to do this. gh: in range attrib of features, you can add colon to indicate strandedness. ad: yes gh: if no :strand does this mean unknown or don't care? ls: defaults to *, anything. you get fwd, rev, don't know, don't care. gh: required things on fwd strand to be :1, not make it a default. ad: ok. if not there, means ambiguous, unknown, or not appropriate. see email i sent. if you get rid of search for strand in region query, most of this issue goes away. gh: don't think people would use this often (stranded query) ad: you can make two queries to server instead of one. gh: this is a resolution for all range-related issues. ad: check my email to make sure it covers this. [A] everyone review andrew's email re: range queries and strand issues. gh: also or-ing of diff range-based queries is not useful for me. I mainly need intersects of overlaps and inside. or-ing is equivalent to using multiple queries. td: why do you need and overlaps and inside? gh: optimization on client side. keeps track of what it has received. wants to minimize re-fetching. td: can you just use overlap and not overlap? gh: that may be equivalent, but the way I do it, you can guarantee you never get the same feat twice with that combo. will require and-ing of two range-based queries. ad: modifying query lang, or-ing together two. include first range and include second range should use multiple query keys because of the comma. you will have to escape any comma if it's inside of query string. gh: don't like the implicit 'and' if different but 'or' if keys the same. it depends on the query. ad: now all queries are and-ed, but commas mean multiple. ls: comma syntax seems natural. the occasional query that had to have an escaped comma didn't cause any bother. td: this was as it is in das/1. exons and repeat. type=exon, type=repeat. so the suggestion is to use the das/1 behavior. ad: three independent segments gh: types as well. can have any number of types= and segment= all or-ed together. I still need anding of overlaps and inside. td: different key are or-ed, same keys are and-ed. ls: hoisted by my own petard here. works for me. gh: allen? aday: what's changed? ls: the whole query language has changed in a fundamental way. aday: dealing with multiple attributes with same name. fine. gh: will server accept full urls for types? aday: not now but will impl this. gh: all types should be full uri's now. my client can't deal but will soon. Topic: status reports --------------------- gh: state what what you hoped to accomplish and what you actually accomplished. gh: hoped to get igb das client up to date with spec, working with one das2 server, and get affy das2 server up and going. affy das2 server will take longer. maybe by next code sprint. igb is now using latest das2 spec, calling allen's server, and using registry as well. happy with results. not everything done, but some unexpected things (registry). wrote up progress report for grant: going out 3pm today (we got another day) a 2pg summary. will send out to everyone later. todo: get das2 server up. client: deal with full uri issue. this is a basic fuctionality of the client. smart handling of uris. ee: igb client. big thing is make it treat all data sources too all behave similar way das1/das2, quick load, separate files, regardless of the data format. want to make it all seamless. going well. sc: streamlined pipeline for populating das sever with affy exon array data. didn't get to pipeline for external data (UCSC tracks), but have basic framework in place. ad: decided to do more writeback at next sprint. when is next sprint? gh: march 13-17. lincoln will be in UK and can participate from there. ad: I'm in the states next week. will come to emeryville for next sprint. [A] next code sprint is 13-17 March. Mark your calendars. ad: hoped to work on spec, resolve detailed questions, make sure it works with people's needs. will work on incorporating latest ideas into spec. validator: have one but is not fit for public consumption. not at where it was last summer on the previous version of spec. ap: das interface for registry, can serve das1 and das2 sources w/ new source command. java client - not yet. registry: todo UI so users can upload to das registry. td: hoping to write server. got something up for feat, types, segments, need to run through andrew's validator. hope to work on writeback, but didn't happen (but good discussion on it). want to get more data included, ensembl database. roy has been working on zmap client, coming along fine. aday: primary goals: to support new version of spec -- not fully done uri problem in query parsing. apache config integration is done. installation and rpm for server - done for FC3 i386, available in the next couple of days (brian o'connor). general documentation improvement in code for server - not done. Next step: post, put, delete, writeback framework (originally planned this but may need to rethink), impl transaction logs (maybe in flux). adding more unit tests. ad: writeback spec won't happen for at least 2 weeks. need to write up what we've done on current spec first. ls: will be available from 14th on. at ensembl meeting up to the 13th. gh: allen come to emeryville? aday: maybe. gh: will have to explore how to fund hosting folks here for next codesprint. gh: speaking for nomi - she had apollo working for parsing features and displaying them. some issues with higher level integration into apollo. making good progress. gh: time to wrap it up. thanks for your hard work. [applause] [A] next teleconf will be on 20 Feb, 9:30 PST 5:30 UK (regular time) we're skipping 13 feb (next monday) given all our time this week.