R- help reading fixed-width format-CodePudding

could you please help me read this file in R:

 Weekly SST data starts week centered on 3Jan1990

                Nino1 2      Nino3        Nino34        Nino4
 Week          SST SSTA     SST SSTA     SST SSTA     SST SSTA
 03JAN1990     23.4-0.4     25.1-0.3     26.6 0.1     28.6 0.5
 10JAN1990     23.4-0.8     25.2-0.3     26.6 0.1     28.6 0.5
 17JAN1990     24.2-0.3     25.3-0.3     26.5-0.1     28.6 0.5
 24JAN1990     24.4-0.4     25.5-0.4     26.5-0.1     28.4 0.3
 31JAN1990     25.1-0.1     25.8-0.2     26.7 0.1     28.4 0.3
 07FEB1990     25.8 0.2     26.1-0.1     26.8 0.2     28.4 0.4
 14FEB1990     25.9 0.0     26.4 0.0     26.9 0.2     28.5 0.5
 21FEB1990     26.1 0.0     26.7 0.2     27.1 0.3     28.9 0.8

As you can see, below each NinoXX header, there are two data columns with SST and SSTA.

Any help appreciated!!

CodePudding user response：

Kludgy hack. It would be far better to ask the originating author(s) to provide a better format.

dat <- read.fwf(textConnection("
                Nino1 2      Nino3        Nino34        Nino4
 Week          SST SSTA     SST SSTA     SST SSTA     SST SSTA
 03JAN1990     23.4-0.4     25.1-0.3     26.6 0.1     28.6 0.5
 10JAN1990     23.4-0.8     25.2-0.3     26.6 0.1     28.6 0.5
 17JAN1990     24.2-0.3     25.3-0.3     26.5-0.1     28.6 0.5
 24JAN1990     24.4-0.4     25.5-0.4     26.5-0.1     28.4 0.3
 31JAN1990     25.1-0.1     25.8-0.2     26.7 0.1     28.4 0.3
 07FEB1990     25.8 0.2     26.1-0.1     26.8 0.2     28.4 0.4
 14FEB1990     25.9 0.0     26.4 0.0     26.9 0.2     28.5 0.5
 21FEB1990     26.1 0.0     26.7 0.2     27.1 0.3     28.9 0.8"), c(15, 4,9, 4,9, 4,9, 4,4), skip = 2)
colnms <- trimws(unlist(dat[1,], use.names = FALSE))
colnms <- paste0(colnms, ave(as.character(colnms), colnms, FUN = function(z) if (length(z) == 1) "" else seq_along(z)))
dat <- data.frame(lapply(setNames(dat[-1,], colnms), type.convert, as.is = TRUE))
dat
#              Week SST1 SSTA1 SST2 SSTA2 SST3 SSTA3 SST4 SSTA4
# 1  03JAN1990      23.4  -0.4 25.1  -0.3 26.6   0.1 28.6   0.5
# 2  10JAN1990      23.4  -0.8 25.2  -0.3 26.6   0.1 28.6   0.5
# 3  17JAN1990      24.2  -0.3 25.3  -0.3 26.5  -0.1 28.6   0.5
# 4  24JAN1990      24.4  -0.4 25.5  -0.4 26.5  -0.1 28.4   0.3
# 5  31JAN1990      25.1  -0.1 25.8  -0.2 26.7   0.1 28.4   0.3
# 6  07FEB1990      25.8   0.2 26.1  -0.1 26.8   0.2 28.4   0.4
# 7  14FEB1990      25.9   0.0 26.4   0.0 26.9   0.2 28.5   0.5
# 8  21FEB1990      26.1   0.0 26.7   0.2 27.1   0.3 28.9   0.8

If you have a file instead of just the text, you would use something like this for your first step.

dat <- read.fwf(filepath, c(15, 4,9, 4,9, 4,9, 4, 4), skip = 1)

Walk-through:

The widths (c(15, 4,9, ...)) were determined manually, nothing magical here. (Minor sub-note: I paired them visually as 15, then 4,9, etc; that is not a comma-decimal notation, it is merely showing visually that the 4 and 9 are logically assigned together; R ignores this and treats this as c(15, 4, 9, 4, 9, ...).)
skip=2 in the first code block is half aesthetic (for the answer), half functional. That is, my first code block has a newline after the opening quote, and while read.table will silently skip that, read.fwf will not, so I have to set skip=1 to skip that. Since I also want to skip the Nino* line, I have to increment to skip=2. For production and a real file to read from, you should use skip=1.

If you want to programmatically preserve the Nino number, then perhaps

ninos <- trimws(unlist(read.fwf(textConnection("
                Nino1 2      Nino3        Nino34        Nino4
 Week          SST SSTA     SST SSTA     SST SSTA     SST SSTA"), c(15, 13, 13, 13, 8), skip = 1)[1,], use.names = FALSE))
ninos <- ninos[nzchar(ninos)]
colnames(dat)[-1] <- paste0(rep(ninos, each = 2), "_", colnms[-1])
dat
#              Week Nino1 2_SST1 Nino1 2_SSTA1 Nino3_SST2 Nino3_SSTA2 Nino34_SST3 Nino34_SSTA3 Nino4_SST4 Nino4_SSTA4
# 1  03JAN1990              23.4          -0.4       25.1        -0.3        26.6          0.1       28.6         0.5
# 2  10JAN1990              23.4          -0.8       25.2        -0.3        26.6          0.1       28.6         0.5
# 3  17JAN1990              24.2          -0.3       25.3        -0.3        26.5         -0.1       28.6         0.5
# 4  24JAN1990              24.4          -0.4       25.5        -0.4        26.5         -0.1       28.4         0.3
# 5  31JAN1990              25.1          -0.1       25.8        -0.2        26.7          0.1       28.4         0.3
# 6  07FEB1990              25.8           0.2       26.1        -0.1        26.8          0.2       28.4         0.4
# 7  14FEB1990              25.9           0.0       26.4         0.0        26.9          0.2       28.5         0.5
# 8  21FEB1990              26.1           0.0       26.7         0.2        27.1          0.3       28.9         0.8

Note that these names are generally not R-friendly, so you'll need backticks with many of them, e.g.,

dat$`Nino1 2_SST1`
# [1] 23.4 23.4 24.2 24.4 25.1 25.8 25.9 26.1

That can be remedied in any number of ways, over to you.