Combined species testing

I made some artificial files based on real data. Some files are super mixed together, other files are straight cuts.
So, an example of a very naive test file, just for the sake of interest, was when I combined two species of bacteria together at around 50% of one of their contigs. The GC Content chart then looked something like this:


Kinda out there!
Was this what I expected? Well, I did expect that there would be a noticable shift like that, but the weird thing is how the standard deviation acts. I think this is just my lack of understanding, but looking at that chart you can SEE where the huge difference lies, but since the mean is just across the middle, the standard deviation threshold only gets a bit mad when things are a fair bit above/below the mean.
This tells me that it’d be good to have another measure, to look for dramatic shifts in GC Content percentage like this. I’m not sure I have the time to do this right now though, so I’m jotting it down here as note of something to do if I get more time after the bulk of my documentation/as something I’d do as a future task.

Processing a huge file

I went to NCBI and found a large file. A file I thought my application might struggle with. It took time, but it processed! I imagine if this application were to actually be used, it should be hosted somewhere with a large amount of memory to be able to get through these sequences. Anyway, look, here’s some screenshots of what happened.

The contig itself is from:

In particular, Campylobacter coli strain CVM N287 N287_contig_8, which is 207270 characters long.
I’m still working on ‘Superframe’, but you can see how most of the GC Content percentage regions that were outside of the mean threshold are all within the ORF Locations, which is exactly what I hoped to see.

hugefileprocessedfullsequence1 hugefileprocessedfullsequence2
hugefileprocessedfullsequence3 hugefileprocessedfullsequence4

The next step is to find another few contigs, smaller in size for time sake, run them individually to see that their GC content % regions match with their ORF Locations, and then start mixing them up together where I know there should be differences in the GC Content % and see if I can view this after processing the mixed contig data.

Things, Stuff, Superframe

Started my final tech sprint last week, working towards something useful – Finishing up ORF Location viewer finally, with a lot of enhancements in where the parameters lay.
It took a lot of refactoring and more time than I expected to get that sorted, and I made various changes to design and code structure.

pushingForward1 pushingForward2 pushingForward3 pushingForward4

I’ve gotten started on the superframe now though. Don’t ask why it’s called that, it’s a WIP name, promise. In short though, it’s really just a very simple canvas displaying the overlaps/standout sections of the GC Content and ORF Locations, so a user can visually see where there might be problem areas or regular GC Content changes in their selected window size. Whether this is useful or not is another question..
Tomorrow I’ll be talking to Amanda about it, and showing her what I’ve done, as this may be the last I touch the code once the Superframe is modified a bit more.

WIP Superframe - Very WIP...

WIP Superframe – Very WIP…

There are more ‘tech’ things to do (refactoring always, JavaDoc, double check test coverage, usability tests, additional test files), but as far as new functionality, there might not be any more. I would like to implement a quick k-mer frequency analysis section, but this depends on my feedback from Amanda tomorrow. If she says it’s okay, I may leave it until after my documentation is finished and I know if I have time. If she says I need to do much more, I’ll bring k-mer frequency analysis forward into this Sprint starting tomorrow, along with starting the documentation in the final push for hand-in.

Contig List (Thymeleaf + code restructure)

Since I want to allow a user to look at multiple contigs at once, I’ve added a page for looking at a full list of the contigs within their submission and they can further inspect each individual contig from there and return to that list later. This is held in the users session while on the page.

It took some time to get Thymeleaf configured to find the data and display it, and then more so to actually pick up the data from the particular contig I wanted to inspect. This was mainly just my lack of understanding when it comes to Thymeleaf and Spring MVC mappings, however.

Each of the pages still need cleaning up, and I want to move the parameters for inspecting each contig to the contigs list page, perhaps when a users clicks to inspect, a menu will appear asking them what parameters they want to set there. The actual inspection page needs to be continued too, as the ORF Location is still a bit unclean, and isn’t yet lined up with GC Content to be useful.

I’ve come to the conclusion that I set out to do one thing and ended up getting stuck on the smallest and easiest part of it, made it into something bigger than it needed to be and ended up missing the mark and this tool won’t actually be useful to anybody. ;_; I’ve certainly learned a lot, in new technologies, processes, what works for me and what doesn’t work, but since the deadline is approaching I’m coming to terms with the fact that I missed the mark a bit.

Anyway, below can be seen the new pages, for what little excitement there is. Again, much cleaning needed!

Submitting multiple contigs

The contigs in a list

The contig selected

Viewing stuff!

I’ve been prototyping what the ORF Location viewer should look like (stand-alone, not tying into the GC Content stuff though) and working with Canvas to get this working, and allowing a user to click on sections that are ORF Locations for the info.

ORF Location view 1

The prototype is a little bit underwhelming and using dummy data, but once it’s tied together there should be a full list of the ORF Locations to scroll through beside the view, and when a user clicks an ORF Location, the content about it should be shown in more detail somewhere on the page. Right now this just comes up as an Alert, as below:

ORF Location view 2

I’m going to work on improving the display a little bit and then I’ll plug it into the actual web application. Once it’s in the application I can test it with the real data to see (hopefully) the expected results, and further work on how the results should be displayed on the page.

Overall the page layout and frame work needs some improvements, so I’m going to make a quick sketch wire frame once I’ve finished with the prototype to figure out how it should be displayed.

ORF Finding logic

Adding on to the previous post, the logic for finding ORF Locations in my application is (something) like this:

Get sequence – >
Break it into the first 3 frames -> 1: Sequence, 2: Sequence minus first character, 3: Sequence minus first and second character
Break it into the last 3 frames -> 4: Sequence in reverse, with the pair of the characters, 5: as 4, minus first character, 6: as 4, minus first and second character
For each frame sequence, build a list of every Start Codon (ATG) and every Stop Codon (TAG, TGA, TAA), including the data for the start character of each codon

For each frame, while there are still at least one Start and one Stop Codon ->
For the list of Stop Codons, remove any that are before the first Start Codon in the sequence
For the list of Start Codons, look at the first one in the list, and remove any Start Codons that come between this Start Codon and the first Stop Codon in the sequence
For the list of Stop Codons, look at the next Start Codon in the list, find the first Stop Codon that comes before it, and remove any Stop Codons that come before that one in the sequence
If there is only one Start Codon left, remove all Stop Codons but the last
If there is still a Start and Stop Codon in each list, checking that the Start Codon comes before the Stop, create an OpenReadingFrameLocation object using the first of each in the list, then remove them from the lists and repeat the process, until either or both lists are empty

Through this we should have the longest possible ORF Locations with the lists, so you can imagine the process might look something (naively) like this, where ‘-‘ is something we don’t care about in this instance, when only looking at Start/Stop Codons:
Result ORF Location: “ATG——————TGA”, Start index:6, End index:30

This is just me doing an almost napkin workout of what the logic looks like and would produce, not actual code output though, so take it with a grain of salt 🙂
After this, based on user input the application may remove any ORFs under a length threshold they set and then returns a list of every ORF Location found in each frame.

Longest ORFs

I’ve been slowly working through the code logic for finding the longest ORF Locations within the contig provided. I don’t feel it’s there yet and I’ve still to:
1. Correctly label the start/end character of ORF Locations in Frames 4, 5 and 6
2. Find a decent way to display them

Right now, it finds the longest ORF Locations in each of the frames, creates an OopenReadingFrameLocation object and adds the data for the start and stop characters, what frame it is in and what the content of the ORF Location is. All the ORF Locations from each frame are put together in one OpenReadingFrameResult, ready to be used by the View at any time.
Aside the above mentioned missing bits, I’m a little concerned about the actual ORF Locations I’m getting from the process. I’ve run the same contig I run in my application through ORF FInder and compared results. While the ORF Locations match up (characters, start/stop index), my application produces more output than the ORF Finder. I’m not sure if this is due to something wrong in my application, the settings I’ve used with ORF Finder, or if ORF Finder removes some ORF Locations that overlap/contain too many ‘N’ character, etc.

Once I’ve gotten a visualisation similar to ORF Finder (I’ll customize more later, but the results that puts out seem a nice way of doing it so far), then I can really compare results and see where my application finds ORF Locations where ORF Finder doesn’t and try to understand why.

Overall my pace feels slow. Maybe it’s because I’m doing TDD and I spent a long time on the figuring out of the logic of finding/constructing ORF Locations in my code, and some days I’m just not working at all. Probably all of that! I do still intend on working with k-mer frequency analysis too, but at the rate I’m going I’ve got to push myself. On top of this, I need to write my documentation, and I’d like to start some of that i.e. background reading, introduction, building my references, in the next week.
The results from my mid-project demo weren’t too bad (82%), but it highlighted my lack of domain knowledge, so I know I need to speak to Amanda, Wayne and potentially Sam some more to try and iron out what I think I know and fill in the gaps I don’t know, which I’d like to do at the end of the Easter holidays. I’m sure understanding more of this might help my application better. Amanda suggested having part of my system suggest GC Content window % sizes of interest to the user, but right know ‘knowing’ what is interesting to the user is a little weird when I think about the way my application is coded and my understanding of what the user is looking for. So, we’ll see.

I should blog more..