Things to do as stories

Looking at my tasks, there’s a couple of epic stories, and then I can see them being able to be split into narrow slices, so let’s give that a go:

“As a researcher (user? We’ll say researcher from now on)
I want to get a report on the quality of my metagenome
So that I know whether it is of good or bad quality”

Okay, super high level. This can be broken down into:

“As a researcher
I want to get a report on the GC content of my metagenome
So that I can see where there might be inconsistencies”

So, that could be explained better (i.e. what are ‘inconsistencies’? Areas where there might be a split/chimera, or just gene encoding regions and completely natural).

“As a researcher
I want descriptions of the GC content of my metagenome
So that I can pinpoint areas of interest”

Perhaps a better way of making a story for GC content in this instance.
What about the report?

“As a researcher
I want a textual and graphical description of my metagenome quality
So that I can see and understand where there might be quality issues”

Again, quite high level, but not too bad. This could be broken down further.

“As a researcher
I want a graph plotted to show me the GC content in my metagenome
So I can visualise the distribution of GC content to better understand my metagenome assembly”

From some of these, further tasks can be broken down, so, lets take one and do that with the last story I defined. I suppose, before we can do that though, since we don’t have an application developed, we might need some initial ‘setup’ stories.

“As a researcher
I want an application to read in my metagenome assembly
So I can see it outside of the FASTA file”

Maybe that’s pushing it a little. There’s not really much to be gained from this in business value, but, as far as development goes it can give us some nice little tasks:

  • Read in FASTA file
  • Output display of metagenome visually for researcher to understand

That’s just two simple tasks. Read in a file type, and with the contents, display it. It might not be much, but it’s a start where we can say to a hypothetical researcher “Okay, we’ve taken your file, and we can show you that your metagenome looks like this. There’s no processing done to it, but you can see how with this visualisation, there’s the room for labelling and noting the interesting points later. What do you think?”

Once that story is done we can move on to something like implementing a GC Content counter, and that then can be applied to the visualisation (whether automatically or by the user clicking a tab/option to turn on/switch the display is to be thought of later).

So I think this is where I will start. It’s a very small and humble start, but it’s something I can get to work on and begin my project with, where I may have something to show for it.
There are some other developer based tasks here, because of my lack of knowledge which are:

  • Understand FASTA file format in order to read in
  • Research display options for UI in Java
  • Research display options for genome assemblies that work with Java

In summary
Current story:
“As a researcher
I want an application to read in my metagenome assembly
So I can see it outside of the FASTA file”


  • Understand FASTA file format in order to read in
  • Read in FASTA file
  • Research display options for UI in Java
  • Research display options for genome assemblies that work with Java
  • Output display of metagenome visually for researcher to understand

Sprint Goal:
To display a metagenome in an application after reading in a FASTA file, with the look at implementing GC Content counting should time allow.

Side note: I feel like I’m working too slowly. It began with me thinking I didn’t know what to do, then even once I started to understand the issues, I felt I had a mental block/paralysis. Now I know what to do, it’s just a process of breaking everything down into sizeable chunks I can get my teeth into and work with. I’ve done self-motivated projects before, but often I understood the domain well enough that I could get straight into test/code cycles.
With the lack of knowledge of this field, I often find myself stumped with questions like “What window size should I use?”, “Do I need to do this, has someone else done it?”, “What even is it that I’m doing”, and end up pausing and getting mentally blocked again from continuing, like stuck in molasses. I think I’m getting there though. I’m setting aside time during the evenings/weekend to make up for what I internally feel has been ‘lost time’, even if that time itself was spent reading/researching/thinking/discussing about the project to reach the point I’m at now.


Quality measures in metagenomics

So, how do we measure quality in a metagenome? To be honest, this kind of beats me right now. I think I had it half-figured out at one point, but not really for certain, and this might be something I’m grappling with for some time.

To my most simplistic and naive knowledge, when we’re looking at an assembly from a metagenome, we’re looking for interesting genes, be they sections or the full contiguous read. We want to know that we’ve got bits and pieces that we might want to look at, synthesize in a lab, etc. So, to do this we want sections that are of ‘good quality’, as in, they’re not genes that are not found in nature. The interesting bits must exist in nature, and may well be part of one particular species (or sub-species), or might be shared between species. We won’t know what or how many things we have in our sample, and so determining whether the assembly we have is a chimera of multiple species that an assembler put together incorrectly, or just didn’t have the data for, can be quite challenging.

When we talk about a chimera, we’re looking at the assembly and thinking “Does this exist in nature?”, “Is this sample actually comprised of multiple species and assembled incorrectly such that the gene doesn’t belong to any one species, but parts of it to multiple different?”. So we want to find ways in which we can report on how likely it is that this has occurred.

Not only the misalignment or mis-assembly of the metagenome, but we can also consider bad quality in the assembly where the contiguous read might be short enough that it was just one read, indicating that the assembly didn’t actually know what to do with this read and so left it singular. Likewise, if a metagenome contains all the reads and its length is way above the length of the majority of the other contigs produced by an assembler, it’s possible that this too is a bad quality read (for a metagenome). I believe this can be measured with n50, but I need to do some more reading to understand this.

We may want to find where interesting or ‘bad quality’ sections of the metagenome assembly are. To do this, we might look at much smaller sections of the metagenome and do a GC content count, and see where there may be large varying portions of the assembly, and we could indicate in the report that there is something going on that might not be quite right. We can’t out-right declare that there is a lack of quality, but we can point it out as something useful to a user. In this regard too, using k-mer counting may be a useful technique for similar results of detecting quality.

Throughout my project I expect that how I check for quality will alter slightly, and through talking with those in the field and reading papers I may get a better idea of what would be considered useful. Throughout the project, I’ll have to consider the report output too, and how best to display the interesting sections of the assembly to a user and convey why any results were reached. This will take some research into UI design and where I can research into what applications already exist, and what software engineering approach I can take to develop the design of my program to carry out this task.

Right now I feel a little overwhelmed, trying to understand the quality of the metagenome. It feels like I have three ideas but not much else. I’ve read a couple of papers (read, skimmed, tried to understand..) and it seems that using k-mer counting is the way to go. There are already some excellent applications that do efficient k-mer counting (Jellyfish, BFC) and so I wonder whether it’s better to use their output rather than my own. If I do rock my own k-mer counting, it will give me a more software engineering approach, and all me to specifically find interesting areas with my own application, but I wonder then if this will actually be useful to anyone when these tools exist. On the other hand, if I just use the output from other applications, what can my application actually do by itself in order to be a relevant software engineering solution, and be useful to users when they could instead use these applications and see for themselves, without then needing to try and compile all outputs into my application to give them a result that might not be entirely useful if they have seen the other program outputs themselves?

It’s questions like these that keep me up at night wondering where it is that I come into the problem in the section of quality control in metagenomic assemblies. I feel like I’ve been told twice now what it is I should be doing, but then when I analyse it, either I’ve not quite grasped, it forgotten or choose to pick holes in it.

I’m hoping my meeting(s) with my supervisor this week where we can plan some tasks for me to begin working on something tangible where I can visualise the direction of my application will give me some relief. I know I’ll feel better when I feel like I have a concrete direction anyway. Until then, I’ll look through some more papers, and give writing a GC content counter a go, using some data kindly provided by Sam Nicholls of a metagenome assembly of a limpet gut.

Side note: Go write some notes about all tasks done so far on card, and begin writing them on card every time I have a planning session, with or without supervisor!

Major Project Begins

“A Toolkit for reporting on the quality of a metagenome”

That is the title of my major project for my final year here at Uni. It’s a little bit challenging (perhaps more than a little bit), and there’s a fair bit of research needed for me to get up to speed with what metagenomes are, what file formats I’m using are, existing technologies and how best to tackle this from a software engineering stand point.

I’m making a list of things I think may be useful as quality measures in my resulting application. It won’t be able to say whether something is necessarily true ‘good’ or ‘bad’ quality, simply because of the nature of a metagenome, but the hope is I can provide a useful report on a users metagenome that may help them be aware of problem or interesting areas in the genome provided. These quality measures will then help me decide what techniques and tools I may wish to use, or at least begin with, when writing my application, and I’m meeting my supervisor on Tuesday to discuss them.