I am working with R.

My data looks like this...

un01 <- structure(list(...1 = c("mean", "n", "SD"), hill = c(2.26315789473684, 
                                                       38, 2.0623289187203), user = c(1.13157894736842, 38, 1.57978877533674
                                                       ), name = c(2.42105263157895, 38, 1.96773403700762), paint = c(4.18421052631579, 
                                                                                                                               38, 1.81369062527503), smile = c(6.05263157894737, 38, 1.98572430003074
                                                                                                                               ), back = c(4.10526315789474, 38, 2.0767629916461), mouse = c(7.35135135135135, 
                                                                                                                                                                                             38, 1.75144727798686), potato = c(5.94736842105263, 38, 2.01276156159657
                                                                                                                                                                                             ), pain = c(1.34210526315789, 38, 1.69699759972873), life = c(4.05263157894737, 
                                                                                                                                                                                                                                                                      38, 1.99928863593695)), row.names = c(NA, -3L), class = c("tbl_df", 
                                                                                                                                                                                                                                                                                                                                "tbl", "data.frame"))

This is a summary of my data. My real data contains 3 observations and 160 variables. I also have 25 different csv's.

The name of the csv's are:

un01 un02 un03 un04 un05 pm01...pm05 kl01... jp01... tr01... (25 csv's)

Every cvs is organized in the same manner. Every csv that contains the code 01 has the same words, this mean that if the un01 (the one I am sahring above) has the words: hill, user, name, paint, smile, back, mouse, potato, pain, life. Then, all the 01 (pm01, kl01, ...) contain this very same words. However, they are in different documents because they are different variables.

So, what I need is my output to look like this...

word     un_mean  un_n  un_SD  pm_mean  pm_n  pm_SD  kl_mean  kl_n  kl_SD ...rest of the variables
hill
user
name
paint
smile
back
mouse
potato
pain 
life
*This first 10 words are the same for every 01 version
word one (1) of every variable coded with 02
word two of every variable coded with 02
word three of of every variable coded with 02
" " until word 10 of every variable coded with 02
" " same with every variable until you reach 05


Then you will reach to the total of words

So, I need the words in a single column and their values in each variable. This means that each row is an observation/word and then we have the value of that word on each variable (un, pm, kl, jp, tr)

How can I do this?

****ONE thing I forgot, the documents that are coded with a 05 do not contain the same amount of words as the rest of the document. So they contain 6 columns less.

adding un05 and pm01

un05

un05 <- structure(list(...1 = c("mean", "n", "SD"), house = c(5.08823529411765, 
                                                              34, 1.65817802638964), person = c(4.11764705882353, 34, 2.15694376772143
                                                              ), city = c(2.47058823529412, 34, 1.70978102651927), agressive = c(5, 
                                                                                                                                      34, 1.95401684183679)), row.names = c(NA, -3L), class = c("tbl_df", 
                                                                                                                                                                                                "tbl", "data.frame"))

pm01

pm01 <- structure(list(...1 = c("mean", "n", "SD"), hill = c(5.3030303030303, 
                                                         33, 1.64857606142484), user = c(4.24242424242424, 33, 1.96898297082375
                                                         ), name = c(7.18181818181818, 33, 1.18465568453838), 
               paint = c(3.27272727272727, 33, 1.71887912738082), 
               smile = c(1.15151515151515, 33, 4.8728523053289), back = c(4.3030303030303, 
                                                                             33, 2.02306772756204), mouse = c(7.06060606060606, 33, 1.57994054356869
                                                                             ), potato = c(3.34666666666667, 33, 1.36289077492212), pain = c(4.17666666666667, 
                                                                                                                                             33, 1.25762044965971), life = c(4.22121212121212, 33, 1.40884650861919
                                                                                                                                             )), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"
                                                                                                                                             ))

CodePudding user response：

I think one good starting point could be to load all needed files in a list and all names of the files in another list or vector, e.g. something like

names2use <- list.files(path2direcitonwherefilesare)
fileList <- lapply(names2use, function(x) { read.csv(file.path(path2direcitonwherefilesare, x), ...)

Then you can create a matrix/data.frame an fill it up, e.g.

#my dummy data set
un01 <- data.frame(types=c("mean", "sd"), hill=c(0.1,0.2), boat=c(0.1,0.2))
un02 <- data.frame(types=c("mean", "sd"), trip=c(0.1,0.2), clip=c(0.1,0.2))
jp01 <- data.frame(types=c("mean", "sd"), hill=c(0.1,0.2), boat=c(0.1,0.2))
fileList <- list(un01, un02, jp01)
filenames <- c("un01", "un02", "jp01")

types <- unique(substr(filenames, start = 1, stop = 2))
infos <- as.character(un01[,1])
names2use <- c(paste(types, sort(rep(infos, length(types))), sep="_"))
rows <- unique(unlist(lapply(fileList, function(x) {names(x)})))[-1]

#creating empty an data.frame
data <- as.data.frame(matrix(NA, nrow=length(cols), ncol=length(names2use) 1))
data[,1] <- rows
names(data) <- c("word", sort(names2use))

i <- 1
for (file in fileList) {
  filename <- substr(filenames[[i]],1,2)
  for (n in 2:length(names(file))) {
    data[data$word==names(file)[n], substr(names(data),1,2) %in% filename] <- file[[names(file)[n]]]
  }
  i <- i   1
}

> data
  word jp_mean jp_sd un_mean un_sd
1 hill     0.1   0.2     0.1   0.2
2 boat     0.1   0.2     0.1   0.2
3 trip      NA    NA     0.1   0.2
4 clip      NA    NA     0.1   0.2

CodePudding user response：

Here's an elegant solution in the tidyverse, without any cumbersome looping. It assumes that all the variables {un01, ..., un05, pm01, ..., pm05, ..., tr01, ..., tr05} exist as data.frames in your workspace; though it could be easily modified to accommodate another situation.

Solution

# Use the tidyverse.
library(tidyverse)

First create the standardize_csv() function like so:

# Function to transform CSVs into a standard form.
standardize_csv <- function(csv_data, csv_name){
  # Extract a prefix to use when naming columns and indexing the results.
  csv_prefix <- gsub(
    x = csv_name,
    pattern = "^([a-z]*)(\\d*)$",
    replacement = "\\1"
  )
  
  # Turn the word columns into a single column.
  csv_data %>% pivot_longer(
    cols = !...1,
    names_to = "word"
  ) %>%
    # Include the CSV prefix for reference.
    mutate(
      csv_prefix = csv_prefix
    ) %>%
    # Rename and reformat the dataset.
    select(
      csv_prefix,
      word,
      metric = ...1,
      value
    )
}

Armed with standardize_csv(), you can do the rest in three easy steps:

# Find all CSV variables named by your convention, within your workspace.
csv_names <- ls(pattern = "^[a-z]{2,2}0[1-5]$")


# Consolidate into the final form:
final_df <- mapply(
  # Transform into standard form...
  FUN = standardize_csv,
  # ...the datasets...
  csv_data = sapply(
    X = csv_names,
    FUN = get,
    simplify = FALSE
  ),
  # ...from all the named variables.
  csv_name = csv_names,
  
  SIMPLIFY = FALSE
) %>%
  # Stack those standardized datasets into a single table.
  do.call(
    what = bind_rows
  ) %>%
  # Pivot the table to have one column for each family and metric ('un_mean',
  # 'pm_SD', etc.).
  pivot_wider(
    names_from = c(csv_prefix, metric),
    names_glue = "{csv_prefix}_{metric}",
    values_from = value,
  )

# View the result.
final_df

Result

Given the CSV variables {un01, un05, pm01} that you sampled

un01 <- structure(
  list(
    ...1 = c("mean", "n", "SD"),
    hill = c(2.26315789473684, 38, 2.0623289187203),
    user = c(1.13157894736842, 38, 1.57978877533674),
    name = c(2.42105263157895, 38, 1.96773403700762),
    paint = c(4.18421052631579, 38, 1.81369062527503),
    smile = c(6.05263157894737, 38, 1.98572430003074),
    back = c(4.10526315789474, 38, 2.0767629916461),
    mouse = c(7.35135135135135, 38, 1.75144727798686),
    potato = c(5.94736842105263, 38, 2.01276156159657),
    pain = c(1.34210526315789, 38, 1.69699759972873),
    life = c(4.05263157894737, 38, 1.99928863593695)
  ),
  row.names = c(NA, -3L),
  class = c("tbl_df", "tbl", "data.frame")
)

un05 <- structure(
  list(
    ...1 = c("mean", "n", "SD"),
    house = c(5.08823529411765, 34, 1.65817802638964),
    person = c(4.11764705882353, 34, 2.15694376772143),
    city = c(2.47058823529412, 34, 1.70978102651927),
    agressive = c(5, 34, 1.95401684183679)
  ),
  row.names = c(NA, -3L),
  class = c("tbl_df", "tbl", "data.frame")
)

pm01 <- structure(
  list(
    ...1 = c("mean", "n", "SD"),
    hill = c(5.3030303030303, 33, 1.64857606142484),
    user = c(4.24242424242424, 33, 1.96898297082375),
    name = c(7.18181818181818, 33, 1.18465568453838),
    paint = c(3.27272727272727, 33, 1.71887912738082),
    smile = c(1.15151515151515, 33, 4.8728523053289),
    back = c(4.3030303030303, 33, 2.02306772756204),
    mouse = c(7.06060606060606, 33, 1.57994054356869),
    potato = c(3.34666666666667, 33, 1.36289077492212),
    pain = c(4.17666666666667, 33, 1.25762044965971),
    life = c(4.22121212121212, 33, 1.40884650861919)),
  row.names = c(NA, -3L),
  class = c("tbl_df", "tbl", "data.frame")
)

the result for final_df should look like this:

# A tibble: 14 x 7
   word      pm_mean  pm_n pm_SD un_mean  un_n un_SD
   <chr>       <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
 1 hill         5.30    33  1.65    2.26    38  2.06
 2 user         4.24    33  1.97    1.13    38  1.58
 3 name         7.18    33  1.18    2.42    38  1.97
 4 paint        3.27    33  1.72    4.18    38  1.81
 5 smile        1.15    33  4.87    6.05    38  1.99
 6 back         4.30    33  2.02    4.11    38  2.08
 7 mouse        7.06    33  1.58    7.35    38  1.75
 8 potato       3.35    33  1.36    5.95    38  2.01
 9 pain         4.18    33  1.26    1.34    38  1.70
10 life         4.22    33  1.41    4.05    38  2.00
11 house       NA       NA NA       5.09    34  1.66
12 person      NA       NA NA       4.12    34  2.16
13 city        NA       NA NA       2.47    34  1.71
14 agressive   NA       NA NA       5       34  1.95

Note

Since no pm_05 was available, there are "blanks" (NAs) for the corresponding words under the pm_* columns. This illustrates how missing CSVs are handled, but as long as you provide all the CSV variables, there should be no blanks in your full result.