DecodeME’s GWAS data analysis plan (version 1) is available to read on our website (here).
What is a GWAS data analysis plan?
A data analysis plan outlines, in detail, how we intend to conduct our DNA analysis (our Genome-wide association study – GWAS). Over the coming months, we will continue to refine our data analysis plan, ahead of performing the analyses.
Why is this analysis plan important?
We believe that it is important that crucial decisions are made ahead of performing the analysis. This allows us to be as objective as possible and should give others greater confidence in our results. This means not just that our science is robust and of the highest scientific value, but that it is seen to be so.
Why are we sharing this analysis plan?
Throughout our study we have worked closely with members of the ME/CFS community to facilitate a ‘co-production’ of research. This means that people with ME/CFS are at the heart of everything we do and have been key in shaping our project into what it is today. We are sharing our data analysis plan on our website so that it is accessible to the wider ME/CFS community who wish to read it. For those interested but who do not have a scientific background, we hope that this blog and the ‘lay summary’ are a good place to start.
A summary of our GWAS data analysis plan
This outlines how DecodeME recruits cases (via our questionnaire) and – separately – controls (from The UK Biobank). It is important that the only thing that differs between cases and controls is a ‘ME/CFS diagnosis’ and the plan discusses how we will ensure that the UK Biobank controls do not have ME/CFS (or any other form of post-viral illness). It also talks about how we control for other differences between datasets, such as in age, sex, and genetic ancestry of participants, so that these do not negatively affect our findings.
To avoid false findings we will try and ensure that each participant’s sample and each DNA variant used in the analysis are of the highest standard. For this we conduct rigorous quality control measures. The plan also describes what other factors (covariates) we will use in the analyses to minimise the chance of false discoveries. It then explains how we will ‘impute’ DNA variants to increase our chances of discovering a ‘causal DNA variant’. Imputation ‘fills in the gaps’ lying between the ~800,000 DNA variants that are read out during genotyping. This allows us to test variants from across the whole genome, covering virtually every gene.
Association testing: To perform the GWAS DecodeME will use 2 methods: ‘REGENIE’ and ‘KNOCKOFF GWAS’, each with their own strengths and limitations. Together, they allow us to test a wide range of ancestral backgrounds. Historically studies have largely been conducted on Europeans only.
The plan explains that there are two GWAS that we intend to perform:
GWAS 1 – This groups together all DecodeME DNA participants vs hundreds-of-thousands of people without ME/CFS (or without post-viral fatigue syndrome or Long COVID with PEM) from The UK Biobank.
GWAS 2 – This repeats the same analysis, but also removes from controls those with an electronic health record of R53 ‘Fatigue and Malaise’.
In addition to the above, DecodeME will perform ‘stratified analyses’ to see whether there are different genetic factors between: sexes, or people with or without an infection prior to ME/CFS-onset, or people with or without a co-occurring condition, such as fibromyalgia. This will allow us to tell whether (or not) different sets of people with ME/CFS share genetic factors.
These will test whether any of our associations to DNA variants are caused not by ME/CFS but instead, by a co-occurring condition (e.g. fibromyalgia). To do this we’ll remove the data from individuals with the co-occurring condition and see whether our finding remains or disappears.
There is also scope to further increase our ME/CFS sample size up to 29,000 people by including those who self-report ME/CFS in The UK Biobank (an additional roughly 4,000 cases).
This is essential for testing whether our findings are valid. Without this, we cannot be certain that our findings are not just due to the specific set of controls we are using (from The UK Biobank). We will try and replicate our findings with other controls. However, we haven’t yet decided on the most appropriate set of controls for replication, which needs to be large, diverse and without people who have ME/CFS.
This data analysis plan explains how we will interpret any findings. For each of the millions of DNA variants tested the analysis will output a p-value. This is a probability: it tells us whether-or-not a finding is likely due to ‘chance’ alone. To be even more certain our findings are not due to chance, we adjust for the millions-of-tests performed and use something called a ‘genome-wide significance threshold’. This threshold is very strict and therefore mitigates the risk of ‘false positives’ and helps us be confident in our findings. The data analysis plan explains this in more detail and how the results will be visualised in plots.
Up until this point all we have are ‘statistical associations’: DNA variants that are more (or less) common in people with ME/CFS than in the general population.
We then need to interpret these variants and understand their biological relevance to ME/CFS. First, we have to identify which of the identified DNA variants is likely the ‘causal’ one. This is because the causal DNA variant will likely be inherited with many non-causal variants nearby. Then, we need to work out which gene the causal DNA variant is affecting, and from here, the function of that gene – including which biological pathways it contributes to and which tissue/cell types it is active in. The data analysis plan explains the different approaches we can use to do this. It is also possible to see if ME/CFS risk variants, or genes, are shared with any other conditions of interest. Combined, this aims to shed light on the biological processes causally involved in ME/CFS risk.