I went through an exercise this weekend trying to think of the questions I ask myself when designing sequencing-based experiment to describe microbial communities. I am finding myself in many collaborations where I have to answer these questions with investigators with little to no experience with sequencing and its opportunities and limitations.
I thought I’d put it here for some feedback from the universe.
Should you even perform/use/initiate sequencing
What is your research question?
What is your budget?
What are your resources? Samples? Computational?
Do you have a hypothesis – or what do you expect to see? Do you have previous evidence to suggest your expectations?
What kind of data do you have already or plan on getting?
What kind of data outputs do you expect? need? want? have?
Are there datasets that already exist and can answer your questions?
How many treatments / gradients are being compared?
What kind of / how much sequencing do you need
Do you want to characterize differences or identify significant differences? How many replicates do you minimally need?
Do you have appropriate positive / negative controls? (Thanks @markstenglein)
Are you trying to identify some specific genes? How much do you know about what you are looking for? How much is known in general?
Once you get this data, are you prepared for the analysis?
How much does the quality of the data matter – how much resolution do you need?
How specific? (do you need to identify mobile genetic elements and species host? Or carbon metabolism and phyla? Do you need to identify strain variation?)
How much do you need to sample? (e.g., is excellent characterization of the 10% most abundant organisms or decent characterization of 90% of organisms)
Do you have a good reference database? Or do you need to develop one? Is this reference database applicable to the samples you are studying?
If I describe every gene in your sample, how much will you actually use?
What kind of collaborator are you / looking for
Do you want a collaborator who helps you with understanding the biological question? Or data analysis assistance?
How happy will you be if:
I gave you just the raw sequencing files
I gave you an assembly of partial genes? whole genomes?
I gave you a species/function-abundance matrix
If 10, 30, 50, 80% of your sequencing reads can be identified
If sequences are identified as significantly different but we have little idea what they are
If I tell you who is there
If I tell you who is there and what they are doing
If I develop a reference that is more specific to your system
This analysis took 1, 2, 3, >6 months
All your data and the analysis was openly accessible
I’d love to get feedback on what kinds of questions others are thinking of.