Things, Stuff, Superframe

Started my final tech sprint last week, working towards something useful – Finishing up ORF Location viewer finally, with a lot of enhancements in where the parameters lay.
It took a lot of refactoring and more time than I expected to get that sorted, and I made various changes to design and code structure.

pushingForward1 pushingForward2 pushingForward3 pushingForward4

I’ve gotten started on the superframe now though. Don’t ask why it’s called that, it’s a WIP name, promise. In short though, it’s really just a very simple canvas displaying the overlaps/standout sections of the GC Content and ORF Locations, so a user can visually see where there might be problem areas or regular GC Content changes in their selected window size. Whether this is useful or not is another question..
Tomorrow I’ll be talking to Amanda about it, and showing her what I’ve done, as this may be the last I touch the code once the Superframe is modified a bit more.

WIP Superframe - Very WIP...

WIP Superframe – Very WIP…

There are more ‘tech’ things to do (refactoring always, JavaDoc, double check test coverage, usability tests, additional test files), but as far as new functionality, there might not be any more. I would like to implement a quick k-mer frequency analysis section, but this depends on my feedback from Amanda tomorrow. If she says it’s okay, I may leave it until after my documentation is finished and I know if I have time. If she says I need to do much more, I’ll bring k-mer frequency analysis forward into this Sprint starting tomorrow, along with starting the documentation in the final push for hand-in.


Contig List (Thymeleaf + code restructure)

Since I want to allow a user to look at multiple contigs at once, I’ve added a page for looking at a full list of the contigs within their submission and they can further inspect each individual contig from there and return to that list later. This is held in the users session while on the page.

It took some time to get Thymeleaf configured to find the data and display it, and then more so to actually pick up the data from the particular contig I wanted to inspect. This was mainly just my lack of understanding when it comes to Thymeleaf and Spring MVC mappings, however.

Each of the pages still need cleaning up, and I want to move the parameters for inspecting each contig to the contigs list page, perhaps when a users clicks to inspect, a menu will appear asking them what parameters they want to set there. The actual inspection page needs to be continued too, as the ORF Location is still a bit unclean, and isn’t yet lined up with GC Content to be useful.

I’ve come to the conclusion that I set out to do one thing and ended up getting stuck on the smallest and easiest part of it, made it into something bigger than it needed to be and ended up missing the mark and this tool won’t actually be useful to anybody. ;_; I’ve certainly learned a lot, in new technologies, processes, what works for me and what doesn’t work, but since the deadline is approaching I’m coming to terms with the fact that I missed the mark a bit.

Anyway, below can be seen the new pages, for what little excitement there is. Again, much cleaning needed!

Submitting multiple contigs

The contigs in a list

The contig selected

Viewing stuff!

I’ve been prototyping what the ORF Location viewer should look like (stand-alone, not tying into the GC Content stuff though) and working with Canvas to get this working, and allowing a user to click on sections that are ORF Locations for the info.

ORF Location view 1

The prototype is a little bit underwhelming and using dummy data, but once it’s tied together there should be a full list of the ORF Locations to scroll through beside the view, and when a user clicks an ORF Location, the content about it should be shown in more detail somewhere on the page. Right now this just comes up as an Alert, as below:

ORF Location view 2

I’m going to work on improving the display a little bit and then I’ll plug it into the actual web application. Once it’s in the application I can test it with the real data to see (hopefully) the expected results, and further work on how the results should be displayed on the page.

Overall the page layout and frame work needs some improvements, so I’m going to make a quick sketch wire frame once I’ve finished with the prototype to figure out how it should be displayed.

ORF Finding logic

Adding on to the previous post, the logic for finding ORF Locations in my application is (something) like this:

Get sequence – >
Break it into the first 3 frames -> 1: Sequence, 2: Sequence minus first character, 3: Sequence minus first and second character
Break it into the last 3 frames -> 4: Sequence in reverse, with the pair of the characters, 5: as 4, minus first character, 6: as 4, minus first and second character
For each frame sequence, build a list of every Start Codon (ATG) and every Stop Codon (TAG, TGA, TAA), including the data for the start character of each codon

For each frame, while there are still at least one Start and one Stop Codon ->
For the list of Stop Codons, remove any that are before the first Start Codon in the sequence
For the list of Start Codons, look at the first one in the list, and remove any Start Codons that come between this Start Codon and the first Stop Codon in the sequence
For the list of Stop Codons, look at the next Start Codon in the list, find the first Stop Codon that comes before it, and remove any Stop Codons that come before that one in the sequence
If there is only one Start Codon left, remove all Stop Codons but the last
If there is still a Start and Stop Codon in each list, checking that the Start Codon comes before the Stop, create an OpenReadingFrameLocation object using the first of each in the list, then remove them from the lists and repeat the process, until either or both lists are empty

Through this we should have the longest possible ORF Locations with the lists, so you can imagine the process might look something (naively) like this, where ‘-‘ is something we don’t care about in this instance, when only looking at Start/Stop Codons:
Result ORF Location: “ATG——————TGA”, Start index:6, End index:30

This is just me doing an almost napkin workout of what the logic looks like and would produce, not actual code output though, so take it with a grain of salt 🙂
After this, based on user input the application may remove any ORFs under a length threshold they set and then returns a list of every ORF Location found in each frame.

Longest ORFs

I’ve been slowly working through the code logic for finding the longest ORF Locations within the contig provided. I don’t feel it’s there yet and I’ve still to:
1. Correctly label the start/end character of ORF Locations in Frames 4, 5 and 6
2. Find a decent way to display them

Right now, it finds the longest ORF Locations in each of the frames, creates an OopenReadingFrameLocation object and adds the data for the start and stop characters, what frame it is in and what the content of the ORF Location is. All the ORF Locations from each frame are put together in one OpenReadingFrameResult, ready to be used by the View at any time.
Aside the above mentioned missing bits, I’m a little concerned about the actual ORF Locations I’m getting from the process. I’ve run the same contig I run in my application through ORF FInder and compared results. While the ORF Locations match up (characters, start/stop index), my application produces more output than the ORF Finder. I’m not sure if this is due to something wrong in my application, the settings I’ve used with ORF Finder, or if ORF Finder removes some ORF Locations that overlap/contain too many ‘N’ character, etc.

Once I’ve gotten a visualisation similar to ORF Finder (I’ll customize more later, but the results that puts out seem a nice way of doing it so far), then I can really compare results and see where my application finds ORF Locations where ORF Finder doesn’t and try to understand why.

Overall my pace feels slow. Maybe it’s because I’m doing TDD and I spent a long time on the figuring out of the logic of finding/constructing ORF Locations in my code, and some days I’m just not working at all. Probably all of that! I do still intend on working with k-mer frequency analysis too, but at the rate I’m going I’ve got to push myself. On top of this, I need to write my documentation, and I’d like to start some of that i.e. background reading, introduction, building my references, in the next week.
The results from my mid-project demo weren’t too bad (82%), but it highlighted my lack of domain knowledge, so I know I need to speak to Amanda, Wayne and potentially Sam some more to try and iron out what I think I know and fill in the gaps I don’t know, which I’d like to do at the end of the Easter holidays. I’m sure understanding more of this might help my application better. Amanda suggested having part of my system suggest GC Content window % sizes of interest to the user, but right know ‘knowing’ what is interesting to the user is a little weird when I think about the way my application is coded and my understanding of what the user is looking for. So, we’ll see.

I should blog more..

GC Content % and averages

I’ve got my plotly chart displaying when there are GC Content window percentages that are above or below a user specified threshold. This might need tweaking, and I’ll try and get some feedback during the mock mid-project demonstration I have on Friday from Amanda and the other people what could be done. I will also clean up the web page design and layout before then, this is just a basic start.

MQT GC content first showing

Things I want to get done before Friday:

  • Create artificial contigs for use and testing
  • Write more tests for the GC content checking
  • ORF results
  • Display ORF results with the GC content chart
  • Create some slides for explaining metagenomes, the project and its aims
  • Allow user to upload a FASTA file or paste their contig directly
  • Make the page look prettier

With those done I’ll feel a bit more comfortable about giving my mid-project demo.
After that it’s working on the Job Application assignment for Prof. Issues due next week, work on anything that comes from feedback from the mock mid-project demo and then clean up the slides and notes for the actual mid-project demo on Monday.

At some point after that, I plan on going through everything I’d done so far and turning my Stories and tasks (currently loose paper) into something electronic I can keep track of. My process has slipped a bit, but I’ve started using pomodoro technique to get back on track with dealing with working through things at a steady pace.

Thymeleaf and Plotly

I’ve been messing with getting my stuff working in Java Spring Boot, for the main reason that I wanted to display my results from Java in a nice way (and what better way than a browser using Javascript).
While I’m aware there are other options (no web at all, instead make a GUI using a Java library, using Ruby/Rails), I figured it would be a good opportunity to learn a new technology (Spring Boot) while using something I enjoy programming in (Java, Javascript, HTML).

I wrestled with getting my results to display for a while, and configuration was shown as many, many different ways when looking at examples of how to set up Thymeleaf. I also think due to my lack of experience with Thymeleaf, there were a lot of moments when I was probably close to my desired result, but messed up my code somewhere so missed the mark.

Anyway, it works, horay! I’ll show examples later as I’m being kicked out of the room I’m typing this in. 😀

Project Poster & Spring Boot

Very small update:

Been working on my Project Poster and got some feedback from Amanda. It’s due tomorrow and I’ve got a couple of changes to make. Should be okay. Not totally happy with it, but aside Amanda’s changes I’m at a bit of a loss with it.

I’ve started working using Java Spring Boot. I wanted to look for a decent way to display the report, and then I found nice web things (Plotly.js), but didn’t want to move my code over from Java to Javascript. So, using JSB I should be able to tie them together instead. In theory anyway. I’m pretty clueless about Spring Boot, but I’m going to give it a go anyway and see how I get on.

Once that’s done and the poster is submitted I can get back to working on ORF highlighting and generating artificial metagenomes.

Things to do as stories

Looking at my tasks, there’s a couple of epic stories, and then I can see them being able to be split into narrow slices, so let’s give that a go:

“As a researcher (user? We’ll say researcher from now on)
I want to get a report on the quality of my metagenome
So that I know whether it is of good or bad quality”

Okay, super high level. This can be broken down into:

“As a researcher
I want to get a report on the GC content of my metagenome
So that I can see where there might be inconsistencies”

So, that could be explained better (i.e. what are ‘inconsistencies’? Areas where there might be a split/chimera, or just gene encoding regions and completely natural).

“As a researcher
I want descriptions of the GC content of my metagenome
So that I can pinpoint areas of interest”

Perhaps a better way of making a story for GC content in this instance.
What about the report?

“As a researcher
I want a textual and graphical description of my metagenome quality
So that I can see and understand where there might be quality issues”

Again, quite high level, but not too bad. This could be broken down further.

“As a researcher
I want a graph plotted to show me the GC content in my metagenome
So I can visualise the distribution of GC content to better understand my metagenome assembly”

From some of these, further tasks can be broken down, so, lets take one and do that with the last story I defined. I suppose, before we can do that though, since we don’t have an application developed, we might need some initial ‘setup’ stories.

“As a researcher
I want an application to read in my metagenome assembly
So I can see it outside of the FASTA file”

Maybe that’s pushing it a little. There’s not really much to be gained from this in business value, but, as far as development goes it can give us some nice little tasks:

  • Read in FASTA file
  • Output display of metagenome visually for researcher to understand

That’s just two simple tasks. Read in a file type, and with the contents, display it. It might not be much, but it’s a start where we can say to a hypothetical researcher “Okay, we’ve taken your file, and we can show you that your metagenome looks like this. There’s no processing done to it, but you can see how with this visualisation, there’s the room for labelling and noting the interesting points later. What do you think?”

Once that story is done we can move on to something like implementing a GC Content counter, and that then can be applied to the visualisation (whether automatically or by the user clicking a tab/option to turn on/switch the display is to be thought of later).

So I think this is where I will start. It’s a very small and humble start, but it’s something I can get to work on and begin my project with, where I may have something to show for it.
There are some other developer based tasks here, because of my lack of knowledge which are:

  • Understand FASTA file format in order to read in
  • Research display options for UI in Java
  • Research display options for genome assemblies that work with Java

In summary
Current story:
“As a researcher
I want an application to read in my metagenome assembly
So I can see it outside of the FASTA file”


  • Understand FASTA file format in order to read in
  • Read in FASTA file
  • Research display options for UI in Java
  • Research display options for genome assemblies that work with Java
  • Output display of metagenome visually for researcher to understand

Sprint Goal:
To display a metagenome in an application after reading in a FASTA file, with the look at implementing GC Content counting should time allow.

Side note: I feel like I’m working too slowly. It began with me thinking I didn’t know what to do, then even once I started to understand the issues, I felt I had a mental block/paralysis. Now I know what to do, it’s just a process of breaking everything down into sizeable chunks I can get my teeth into and work with. I’ve done self-motivated projects before, but often I understood the domain well enough that I could get straight into test/code cycles.
With the lack of knowledge of this field, I often find myself stumped with questions like “What window size should I use?”, “Do I need to do this, has someone else done it?”, “What even is it that I’m doing”, and end up pausing and getting mentally blocked again from continuing, like stuck in molasses. I think I’m getting there though. I’m setting aside time during the evenings/weekend to make up for what I internally feel has been ‘lost time’, even if that time itself was spent reading/researching/thinking/discussing about the project to reach the point I’m at now.

Quality measures in metagenomics

So, how do we measure quality in a metagenome? To be honest, this kind of beats me right now. I think I had it half-figured out at one point, but not really for certain, and this might be something I’m grappling with for some time.

To my most simplistic and naive knowledge, when we’re looking at an assembly from a metagenome, we’re looking for interesting genes, be they sections or the full contiguous read. We want to know that we’ve got bits and pieces that we might want to look at, synthesize in a lab, etc. So, to do this we want sections that are of ‘good quality’, as in, they’re not genes that are not found in nature. The interesting bits must exist in nature, and may well be part of one particular species (or sub-species), or might be shared between species. We won’t know what or how many things we have in our sample, and so determining whether the assembly we have is a chimera of multiple species that an assembler put together incorrectly, or just didn’t have the data for, can be quite challenging.

When we talk about a chimera, we’re looking at the assembly and thinking “Does this exist in nature?”, “Is this sample actually comprised of multiple species and assembled incorrectly such that the gene doesn’t belong to any one species, but parts of it to multiple different?”. So we want to find ways in which we can report on how likely it is that this has occurred.

Not only the misalignment or mis-assembly of the metagenome, but we can also consider bad quality in the assembly where the contiguous read might be short enough that it was just one read, indicating that the assembly didn’t actually know what to do with this read and so left it singular. Likewise, if a metagenome contains all the reads and its length is way above the length of the majority of the other contigs produced by an assembler, it’s possible that this too is a bad quality read (for a metagenome). I believe this can be measured with n50, but I need to do some more reading to understand this.

We may want to find where interesting or ‘bad quality’ sections of the metagenome assembly are. To do this, we might look at much smaller sections of the metagenome and do a GC content count, and see where there may be large varying portions of the assembly, and we could indicate in the report that there is something going on that might not be quite right. We can’t out-right declare that there is a lack of quality, but we can point it out as something useful to a user. In this regard too, using k-mer counting may be a useful technique for similar results of detecting quality.

Throughout my project I expect that how I check for quality will alter slightly, and through talking with those in the field and reading papers I may get a better idea of what would be considered useful. Throughout the project, I’ll have to consider the report output too, and how best to display the interesting sections of the assembly to a user and convey why any results were reached. This will take some research into UI design and where I can research into what applications already exist, and what software engineering approach I can take to develop the design of my program to carry out this task.

Right now I feel a little overwhelmed, trying to understand the quality of the metagenome. It feels like I have three ideas but not much else. I’ve read a couple of papers (read, skimmed, tried to understand..) and it seems that using k-mer counting is the way to go. There are already some excellent applications that do efficient k-mer counting (Jellyfish, BFC) and so I wonder whether it’s better to use their output rather than my own. If I do rock my own k-mer counting, it will give me a more software engineering approach, and all me to specifically find interesting areas with my own application, but I wonder then if this will actually be useful to anyone when these tools exist. On the other hand, if I just use the output from other applications, what can my application actually do by itself in order to be a relevant software engineering solution, and be useful to users when they could instead use these applications and see for themselves, without then needing to try and compile all outputs into my application to give them a result that might not be entirely useful if they have seen the other program outputs themselves?

It’s questions like these that keep me up at night wondering where it is that I come into the problem in the section of quality control in metagenomic assemblies. I feel like I’ve been told twice now what it is I should be doing, but then when I analyse it, either I’ve not quite grasped, it forgotten or choose to pick holes in it.

I’m hoping my meeting(s) with my supervisor this week where we can plan some tasks for me to begin working on something tangible where I can visualise the direction of my application will give me some relief. I know I’ll feel better when I feel like I have a concrete direction anyway. Until then, I’ll look through some more papers, and give writing a GC content counter a go, using some data kindly provided by Sam Nicholls of a metagenome assembly of a limpet gut.

Side note: Go write some notes about all tasks done so far on card, and begin writing them on card every time I have a planning session, with or without supervisor!