The past couple weeks (maybe months?) I’ve been struggling with analyzing some fungal ITS data that we have for our Edge Effects side project. No one in our lab really specializes in fungal barcoding (or fungal anything) so we became sheep and followed the mainstream path. We amplified the ITS region, between the small subunit and large subunits of RNA, which was to our knowledge the “chosen one” for fungal barcoding, using ITS1F and ITS2 primers. ITS appears to serves its purpose in terms of detailed classification (family/genus taxonomic levels) but it is definitely not a perfect barcode – for one ITS reads cannot be aligned (perhaps due to too much variation between reads, insertions, deletions, length variation, etc) which makes the reads useless by themselves for phylogenetic approaches.
Before this particular dataset fell into my hands, it was in Jenna’s and when issues with the ITS dataset arose, she turned to twitter for answers (part 1 and part 2). The conclusion – due to our desire for phylogenetic analysis it is highly likely that future fungal analysis will not be done using ITS as ultimately we care more about phylogeny than taxonomy.
That is great – but we still have our ITS dataset, what do we do with it?
I essentially did what they do here in this tutorial which I of course found after figuring out what to do from scratch. I used the UNITE ITS database to cluster my forward unmerged reads into OTUs in QIIME using UCLUST. I also used UCLUST to assign taxonomy (because it was the default option). I then did some basic filtering using filter_taxa_from_otu_table.py and filter_otus_from_otu_table.py to remove singletons, mitochondria, chloroplasts and unassigned (at kingdom level) taxa. This is where things began to go wrong (if they weren’t already wrong to start with).
I summarized my biom table using biom summarize-table and I saw this:
Counts/Sample summary:
Min: 0.0
Max: 838.0
Median: 30.000
Mean: 100.512
Std. dev.: 201.292
What happened to all my sequences?? Better yet, are there even fungi on seagrass? Is what we are seeing the result of low fungal biomass????
Let the investigation begin. I decided to look at what my biom table looked like before I filtered out the unassigned reads. This is what I saw.
Min: 14.0
Max: 48889.0
Median: 2847.000
Mean: 6001.653
Std. dev.: 9287.627
Now, that looks a bit better… except that the “unassigned” reads could be anything (seagrass, jellyfish, bacteria, fungi, sponges, etc). Since we want to do a “fungal” analysis this just won’t do. So to investigate further, I downloaded NCBI’s nucleotide “nt” database. Approx ~4250 OTU’s in my dataset were classified as “Unassigned” so I pulled these out and locally blasted them against the “nt” database to get some idea of their taxonomy. What I found was that my “Unassigned” OTUs were seagrass, jellyfish, bacteria, sponges and lots and lots of uncultured fungi. Of my ~4250 OTU’s, ~3250 hit something in the “nt” database and ~700 of hit something with >70% identity over >70% of the query OTU length. So there are obviously fungi (or fungi-like sequences) in my dataset that aren’t being identified using the method for taxonomic assignment I’ve been using (UCLUST & UNITE).
On a whim while writing this blog post about the dreary nature of ITS, I took a second look at the earlier mentioned tutorial. On the surface, it looks identical to what I did with my dataset (reassuring), but I then noticed they were using a mysterious parameter file. Perhaps this parameter file was filled with rainbows, pixie dust and unicorns that could solve all my fungal problems? To investigate further, I downloaded and took a peak at this mythical parameter file. Cue dramatic music. Low and behold, they are using the “blast” method for taxonomic assignment over UCLUST. So I thought what the heck, I’ll try anything at this point to make this fungal data usable, let’s give it a go. Of course (because this is how my life seems to be going recently) using the “blast” method of taxonomic assignment worked like magic. My new biom table summary (and this is after removing OTUs with “No blast hit”) looks like this:
Min: 7.0
Max: 47199.0
Median: 2344.000
Mean: 5441.093
Std. dev.: 9146.485
According to the log file, using the “blast” method 4717 sequences were inspected and only 1796 could not be identified. This is a huge improvement from before where ~4250 were “Unassigned”. I will note here, that upon investigating the blast assigned taxonomies, I do see a lot of unidentified fungi so this solution might not work for you if you care about specific taxonomy. I still have to analyze this new biom table which since I can’t use phylogenetic approaches will be its own hurdle, but at least I have enough truly “fungal” data to analyze now. Thinking back on all of my struggles, I am so incredibly angry that one silly QIIME parameter was what was keeping me from moving forward. Even before this I was wary of what the default QIIME script options meant for my data, but moving forward I’ll be even more vigilant in my choice of programs and parameters. This entire situation is equal parts ridiculous, embarrassing, frustrating and dumb luck. Perhaps, the craziest part is that had I not decided to write a blog post about my problems with ITS, I would never have found the solution to this particular problem. I can’t be the only one to ever have had this issue – is this some well kept mycologist secret method to ITS success? My hope is that by writing this blog post, I can save others from weeks (or months) of mental anguish over poor quality ITS taxonomic classification when the answer is hidden (or not so hidden) away in a silly parameter file.
I clustered my ITS2 reads with swarm (d=1, 1 base pair difference) and assigned taxonomy with blast using the fungorum database, then I classify the blast output with FHiTINGS (LCA).
LikeLike