Fisher exact test in R, without using 'simulate.p.value=TRUE'-CodePudding

I'm repeating some analysis done in a paper to see how well my results line up with theirs. I have the following contingency table (see below). I want to do a Fisher exact test here (as I can't do Chi-square with not all cell counts being >= 5). However, every time I try and execute "fisher.test(ContTable)", it comes back with the following error message (see below contingency table).

I've tried it with adding in 'simulate.p.value=TRUE' as instructed but then I'm not getting the same p-values as the paper I'm working from. Could someone please either explain why it won't work without 'simulate.p.value=TRUE', or please explain how to get around this issue without using that line, as I don't want to get simulated p-values, I'm trying to get the same result as the data provided in the paper.

> ContTable
                data.age
data.DEATH_EVENT 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 60.667 61
               0  7  1  6  1  2 13  2  1  0  3 19  3  5  9  1 14  1  1  8  1 20      1  4
               1  0  0  1  0  0  6  1  0  2  1  8  1  0  1  1  3  0  1  2  3 13      1  0
                data.age
data.DEATH_EVENT 62 63 64 65 66 67 68 69 70 72 73 75 77 78 79 80 81 82 85 86 87 90 94 95
               0  4  8  3 18  2  2  3  1 18  2  3  5  1  2  1  2  1  0  3  0  0  1  0  0
               1  1  0  0  8  0  0  2  2  7  5  1  6  1  0  0  5  0  3  3  1  1  2  1  2

> fisher.test(ContTable)
Error in fisher.test(ContTable) : 
  FEXACT error 7(location). LDSTP=18210 is too small for this problem,
  (pastp=20.1432, ipn_0:=ipoin[itp=498]=11425, stp[ipn_0]=3.68888).
Increase workspace or consider using 'simulate.p.value=TRUE'

CodePudding user response：

The Fisher exact test for a table like yours needs a huge amount of time or memory. It has to enumerate every possible outcome that is at least as extreme as the one you've got according to some measure of departure from independence. Apparenly the algorithm that R uses needs a lot of memory for this.

The alternative way to calculate it is to approximate the result by sampling from all outcomes with the given margins, and finding the proportion of outcomes that are at least as extreme. This is the "simulation" method. It won't give exactly the same p-value each time, because it's a random approximation. If you set parameter B to a very large number you'll get quite consistent results, and they'll be close to the exact result.

If this doesn't give you an answer close to the one you're trying to replicate, then I'd suspect you're doing a different test than that paper. I believe there are multiple possible definitions of "at least as extreme". It's also possible that the paper is doing the test incorrectly. You should probably post details and ask about the differences on the stats.stackexchange.com web site, not Stackoverflow, as this is a stats issue, not a programming issue.

CodePudding user response：

From calling help(fisher.test):

[workspace is] an integer specifying the size of the workspace used in the network algorithm. In units of 4 bytes. Only used for non-simulated p-values larger than 2 by 2 tables. Since R version 3.5.0, this also increases the internal stack size which allows larger problems to be solved, however sometimes needing hours. In such cases, simulate.p.values=TRUE may be more reasonable.

From the error you get, I really believe that you have problems with the dimension of the workspace (see Increase workspace or consider using 'simulate.p.value=TRUE'). As far as I know, Fisher's exact approach consists in permuting the treatment assignment vector so to consider all possible treatment allocations. My background mostly consists of causal inference, so I do not know how this applies to your case, if no treatment exists. Anyway, what matters here is that the approach considers all possible allocations of your data (in broad sense), and this may be computationally infeasible if you have a lot of observations. Thus, sometimes we are happy with considering just a subset of all the possible permutations, e.g., 10000, and this is what we ask by setting simulate.p.value = TRUE. The more replications we consider (parameter B), the more the resulting p-values are close to the true ones.

So, answering your question, you should either use more workspace, so that your machine is able to perform all the needed computations (which can require hours, as mentioned in the documentation quoted above), or consider the Monte-Carlo approach.