Notes from DAS/2 code sprint #3, day five, 18 Aug 2006 $Id: das2-teleconf-2006-08-18.txt,v 1.2 2006/08/18 19:14:11 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt Dalke Scientific: Andrew Dalke UCLA: Allen Day, Brian O'Connor Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Topic: Spec concerns --------------------- ad: segments doc (not 'segment') top-level element is missing three fields, one is uri (I added). second is reference (a collection corresponding to a dataset). seemed less useful since it's already mentioned in vsource document. I added id to schema, not spec yet. last thing: missing a doc_href, for each segment ok, but we can't say, here's doc for human. gh: optional? ad: yes. gh: if optional doesn't change server impl. uri for segments is specified in segment capability. gh: my only objection is spec churn. gh: question about writeback spec: what you're supposed to do if you remove an exon from a txt, you are supposed to have a delete element in post that deletes that id. ad: yes gh: if you just have that delete, does that force parent to remove it's child, or do you also have to have the parent in there? ad: everything in that relation has to be sent. gh: in that example, if you have a delete for that exon, you have to return the rooted hierarchy as well with txt not having that part element. ad: yes gh: what if you create a curation with three exons in it, you then decide to delete the middle exon. server gets post with same annotation, but exon is missing and parent is not pointing to it as a part. is that legal? ad: nothing that says delete? gh: no ad: i think it should be illegal. if you have three generations. grand parent and grand child with no intermediate. also illegal. gh: server will have to catch these things. ad: easy. just check whether all ids involved are representing something on the server, if so, you delete old, update new. gh: allen, will your server catch this? aday: if you modify something, it already has to check before it gets deleted, i can just reject it. now I say, you modified it, here are the things that are modified by your request. gh: [drawing] d:a-----b-----c -> d:a----------c , b read this as: transcript d has exons a,b,c three exons attached to a txt, never indicated that anything was deleted, I just re-wrote the feature as a--------c gh: this should throw an error, since you didn't explicitly delete b. aday: what's wrong with leaving d dangling? ok to not mention the missing exon ad: one is to keep it there, one is to delete automatically, gh: if keep does it have pointer to parent? that's enough to tell db it's not connected? aday: yes, it becomes an orphan. you should get back a message, "hey you affected all of these features." so client can see what your modification affected. you'll know from response what was affected by deletion you performed. gh: if you now submit a new transcript named e containing a and c: e:a-----------c ad: so annotation 'd' will come back as saying, "was deleted" aday: my response tells you everything that needs to be updated. you might see things that need to be cleaned up that weren't expected. ad: python maxim: when in doubt, refuse temptation to guess. you're guessing it makes sense to leave orphans around ad: if it's ambiguous, should be not supported. gh: from allen's side, it might be hard to catch and call error. aday: no I can catch. i track all changes caused by client request. I have to track all changes made, see if it was present in the submitted document, if not, an error. just another level of tracking. can do. gh: if this is what you wanted to do, client would submit, write b (with no txt as parent), write d with txt as parent. and no delete to get this d:a-----b-------c -> a-----------c + b gh: if you really want to get rid of children, you need to specify both parent and child. gh: approach on client. I do on client. curational model is that you are never really editing locations or parent child relations ships, you are just making successors, so I keep this version chain. not deleting old ones on server (that is the plan though). aday: every edit does a delete and create on server. that's very transactional. can you keep track of it in memory. gh: yes. user has to request writeback. any number of edits between one and the next. once you've committed you can rollback on client. aday: everything is pruned off in client? gh: no you need redo. aday: redo is not considered saved unless you save again. gh: if you re-edit after a undo, you can't redo. no branching. aday: just keep track of recent save point. gh: todo: keep modification dates. so if there were no edits since the last save then there's no need to write back to the db again. gh: if you want something deleted, you must explicitly do it. if you want to delete it do this: * delete b * write d:a------------c if you want to orphan it do this: * write b with no parent * write d:a------------c Topic: semantics of insides and overlaps as they relate to parent-child ------------------------------------------------------------------------- gh: this is a continuation from yesterday's discussion we had offline. bring up spec, feature filters. see part that says, "any part of a complex feature that is one with parents... then all parts are returned". that's wrong. you do an insides query, you only get back things that are inside. two exons in a txt, one is inside, one is not inside. ad: gh: if it has no location, it's never going to be returned by a range query. ad: by type q gh: if multiple locations on the feature.if one of those locations is inside the range query it passes. sc: gh: not the same as multiple locatsion -- aligns to multiple places in the genome. top level parent of a feat hierarchy must have a location that passes one of the location in the range query. one of the locations has to pass the range filter. and it is at the top level of the hierarchy. aday: think of this: locations are cols in matrix, filters are rows. in order for column to qualify, the entire row must be true. ad: different people may have modeled it differently. may get only part of it back. gh: if two servers model the same data differently you may get different answers back. that's the way it goes. ad: annotation contains features. returns all annotations that match the query. gh: don't add notion of some other object that is sort of a feature, but is really a group of feats. aday: i call it a feature group. range filters operation on the group. gh: we don't need to have a special designation. it's just a feature with no parents. what your're calling a feature group. aday: all things under the parentless feature is the group. ad: yes aday: not identical to the root, it's the root plus all attached things. gh: to clarify things in the spec, maybe call it annotation/feature group, maybe ok. ad: all things connected by a parent-part relationship. return the entire feature group. gh: change: root of the feature hierarchy matches (range filters) the root of the group has to pass all the feature filters in the range query. ad: you want the root to be guaranteed to have locations if any sub feats have location. featureless roots. aday: no way to retrieve based on location. weird. parent with no location. gh: not weird. bounds of gene are fuzzy. they'll spell out bounds of exon but not the gene we can say the highest level with location. we can say that if children has location, then parent has. ad: put all children ranges in the root. gh: ok. no children should never have locations outside their parent. ad: old conversation: is this single or multiple rooted. single is easier to understand. but there is a use case for multiple locations. now we say the single root must be union of all it contains. gh: inclusive, not necessarily union. ad: software check will be needed gh: you don't want someone submitting exons that are outside bounds of a transcript. dangerous to have children outside location of parent. aday: true for bioperl ad: for only root, or intermediate? aday: every intermediate gh: only acceptible if you want to punt on location of upper level thing whose location isn't well understood (gene). aday: feature 100-200, locationless thing attached to it.. gh: if you have locationless, they need to be locationless up to the root. maybe we should not allow that for now. if you have a locationless feature, it's locationless all the way down and all the way up. meets requirement for gene das. ad: don't understand why this restriction needs to be there. ee: we want it. gh: you cannot have children outside bounds of their parents and their parents recursively. to me, that needs to happen. question: can you have children with location that have parents that are locationless? ad: why parents that don't overlap child location? gh: throws off our range filter mechanism. no easy answers to ad: if any children meet criteria, then they all get returned. gh: they you get back features that don't meet sc: lets say you're editing an exon... gh: forget editing. just basic reading. there was ambiguousnes in old spec here that I want to kill. I've seen desire to have locationless thing above, but never the reverse: definitive location above but locationless below. gh: we hashed this out in last code sprint. let's complete it! ad: if any feature matches, then all features match. includes the situation if parent has no location, but child matches, that implcitly matches. my proposal was to return all things in feat group if any one of the features match. same as assuming all parents have location of their children. this search will get back the parent. returning the feat group is a way to say all parents implicitly include locations of their children. aday: not all parents, multiple roots. gh: they all must go to a single root. aday: if any location of the root of group matches, then the whole group matches. boils down to: are descendent feats are allowed to be outside the bound of parent. gh: [insides query example on board] aday: the query is on the feature group root features ad: I don't remember allowing range queries being allowed only on root elements. two exons that are very far apart. query hits in between them. gh: parent meets overlap, return them all. ad: parent has only two small locations, not one large locations. gh: modeled as multiple small locations, not child features. sc: so it doesn't include the interveneing sequence. aday: gh: cannonical example of mult location stuff: 25mer probe that hits 4 diff locations in genome. multiple alignments, where none of the alignments align to the whole thing. aday: two probe pair, only some of the children are in the region. ad: example: protein structure catalytic group, three residues on different chains. gh: mult locations of probe set, one location falls inside query, return the probe set why can the rule be ad: besides range searches: when you find that a feature matches title or curator name, do you return back just the matching feats or the group? gh: don't see why we can't add more rules. aday: name search and exon is named, return it's parents. ad: so for any searches besides ranges, it returns all features in the feature group. gh: different behavior for range queries. they already have different behavior that other queries. ad: my criteria, if any feature matches, then all features in group are returned, except that in range query, only this that match the range query are returned. gh: don't see why you have a problem with that requirement. ad: do the search on all features, root is not special, if any feat match, get all features in group, if a range filter, then get features that pass. if a filter, then full hierarchies are not returned, only those that pass filter. gh: don't like. do an overlaps, two exon are in, two are not. you send back only the txt and the two that are, you are depriving user of data, there's no way of know that it's missing, how can they get at it? ad: i'm confused. in system you want, you return back everything? gh: yes. everything that has a root with one location that matches all range filters. if the root of the feat group meets range criteria for at least one of it's locations. aday: and any name filter ad: root has no location info, but one of exons overlap, whole thing returned. ee: distinction between olap and includes, different if parent lacks location info. aday: gregg needs for range optimizations. name may matches, but feat location may not, but root of group may ad: specified in root node. not convinced we need locationless features that aren't descented. gh: we're not talking about locationless nodes now. parent has location, that's all you need to search on. ad: use pieces, or whole range? gh: the whole range, not piece by piece. ad: why aday: there can be things gh: I argued against having mult locations, caused problems in bioperl, children with locations, and mult locatable features. so I didn't want to have mult locations, but got voted down. only thing it makes sense: when you want one feat to represent one feature to represent an alignment to things on genome. OK to represent with mult locs, but better to not. aday: offsets relative to the root. gh: no. will confuse people a lot. ad: any annotations that will go on mult segments in dna world. aday: blast results, very common. gh: every blast hit is a separate feature, avoids the problem. I use them in transforms, so I can say this feature maps to different genome assemblies. fine in a data model. but causes problems when it's in a spec, hard to describe when you should use one vs the other. aday: what rules do you use internally? gh: i know it when i see it. ee: in genometry, these are equivalent regions on these genomes. gh: right. the length of the range is the same length can be identical, but seq is different. genometry doesn't care about sequence identity. "this part of hg17 is equivalent to hg18". but this is getting tangential. ad: question is what do you do for things that are mult segments. example where parent is wider than children aday: you don't know where 3' end it gh: haplotype block for a set of snps, you know it extends to the next block, so the block is bigger than the bounds of the snps used to construct it. ad: curation tool, marked off three regions, one thing can extend over a broader range. tool automatically inserts. allows curator to stretch it out as need be. sc: this is what fuzzy locations are used for at genbank. gh: we don't have fuzzy locs. no needs for these at present. ad: implicitly the parent is the min-max ov its children. a db could optimize that way. curation tool gets data back from server. does curation tool know to change the parent range or not? gh: it better ad: if user changes the min/max exon bounds, will tool know to adjust parent transcript? the txt could be left extending past the current location of these. gh: up to the client app to figure it out. a smart gui should say, you cannot extend the txt past the exons you have, but for a genotype block, it might allow such a change. in theory, your client would understand what elements in the sequence ontology you could do it and what you could not. ee: this is outside the spec. should say it's possible for parent to extend beyond bounds of children, and not possible for childre to be outside of parent. ad: which of these can be on multiple segments? gh: if we're going to have mult locs, then everything can. ee: if child can, then parent can. aday: an argument for doing relative offsets I suggested. only allow parents to have relative offsets to children. no duplication of data. gh: duplication of data is a red herring. ad: more error prone to checking a string to see if it matches. hard to extend the parent to be a bit wider than children, gh: range queries to apply to root of featu hierarchies, and at least one of the children to pass all range filters? ad: why is this diff than requirement I gave? gh: your's give back partial feature groups. it's allowing filters to apply to any of the children , not just the root. ad: only difference is if you have two widely spaced features, everything has an implicit convex hull. if your query hits the midddle. gh: [whiteboard drawing] +-----------+ exon a in transcript c +----------------+ exon b in transcript c |______________| | inside query ee: for overlaps you would include the parent, for inside query you would not. ad: how will software guarantee this? min-max or just union of the children. ee: min-max of all children. ad: should be in the spec. gh: allen: how do you do min and max of mRNA, implicit or explicit? for me, it's explicit. aday: explicit. ee: using gff1 where it's implicit, but our parsers force it to be explicit in our data model. aday: in gff3 it can be implicit (using '.'). gh: gff, bed, psl, xml formats, raw blast output -- all explicit. ad: does server verify that it meets this criteria. each feature comming in, if it has parent it can only have one segment id. for eeach segment in the parent, find each one that matches the range in the child, if any child has segment x, only one location on segment x aday: can have mult locs on the same segment. ad: why not model as one range? aday: need to create the parent in two locations. gh: as long as one loc of parent contains the loc of the child, it's ok. ad: gregg saying that aday: location only includes one instance of the children. two locations for exon a, b, c. first set of locations for these exons is different than the second set of locations for these exons. a logical grouping not simple collection of all parts. mult locations on the same segment is harder. check location of parents, rify that no two childs. ad: spec now allows for dumb servers. by putting this extra requirements, it doesn't make server easier, complicates life on clients. gh: it makes clients life simpler. aday: location as two additional attribs: group, rank group - groups things together that are in the same segment rank - prioritized location conceptual grouping of things, to know which child locs match up with which parent locations, because locations can overlap. gh: (aside) can you make them multiple feats rather than diff locations? when it comes out as das2xml. ad: need to mention to lincoln and berkeley folks. specify what the algorithm is to Topic: status reports --------------------- gh: doing writeback to allens writeback server. create new annot, edit location, add, remove, extend exons, can write them all back. keeps creating new features in the db instead of editing the ones that are there. plan: delete the old annot in the same doc that edits the new one. aday: so you're leaving lots of old annots around. aday: finishing touches. old uri - new uri mapping, so gregg knows. fixing bugs on writeback server. working on new das front end that takes incoming reqest , breaks down with modulus operation with configurable blocks size, filters the results, this is for caching. working well. can convert the typical 40-50s response times down to 7s on a single megabase region. takes a while to get cache populated. todo: automatically populate cache. add code to know when a block became stale, so server can flush cache to get new stuff. bo: refactor domain factor response. found lots of hardcoded logic. went back to refactor. one object that populates hash structure of objects, handles. support for wiki stuff from lincoln, unique coord identifiers. todo: go ahead and update test suite now out of date. coord filter needs to be added in. gh: server now supports full type uris and segment uris? bo: yes, in cvs. todo make rpm package and install on production server. gh: then public release of igb can start using full type uri. bo: can communicate with you on it. gh: congrats -- end of code sprint. good to get the writeback stuff going. spec changes are little, but feels very nailed down. ad: finished off action items from yesterday. timestamp. reference server implementation. ee: still working on gff3 parser. progress nothing to report. sc: updated affy probe set alignments for drosophila arrays to be based on dm2 on our das/1 server (Ann's request). Restarted server. Worked on updating the affy das server info page in progress. todo: update the das2_server with latest improvements committed by gregg, then test the new and improved bp2 format for exon data. will need to deal with array prefix used by netaffx ('1:') rather than as used in CHP files ('HuEx:'). Post-teleconference Discussion ------------------------------- gh: would you be willing to give up multiple locations in the spec? aday: would you be willing to give up bidirectional parent-child pointers? gh: let me think about it...