pairwise similarity with consecutive points-CodePudding

I have a large matrix of document similarity created with paragraph2vec_similarity in doc2vec package. I converted it to a data frame and added a TITLE column to the beginning to later sort or group it.

Current Dummy Output:

Title	Header	Doc1.1	Doc1.2	Doc1.3	Doc2.1	Doc2.2
Doc1	Doc1.1	1.000000	0.7369358	0.6418045	0.6268959	0.6823404
Doc1	Doc1.2	0.7369358	1.000000	0.6544884	0.7418507	0.5174367
Doc1	Doc1.3	0.6418045	0.6544884	1.000000	0.6180578	0.5274650
Doc2	Doc2.1	0.6268959	0.7418507	0.6180578	1.000000	0.5755243
Doc2	Doc2.2	0.6823404	0.5174367	0.5274650	0.5755243	1.000000

What I want is a data frame giving similarity in consecutive order for each following document. That is, the score for Doc1.1 and Doc1.2; and Doc1.2 and Doc1.3. Because I am only interested with similarity scores inside each individual document -- in diagonal order as shown in bold above.

Expected Output

Title	Similarity for 1-2	Similarity for 2-3	Similarity for 3-4
Doc1	0.7369358	0.6544884	NA
Doc2	0.5755243	NA	NA	NA
Doc3	0.6049844	0.5250659	0.5113757

I was able to produce one giving the similarity scores of one doc with the remaining all docs with x<-data.frame(col=colnames(m)[col(m)], row=rownames(m)[row(last)], similarity=c(m)). This is the closest I could get. Is there a better way? Because I am dealing with more than 500 titles with varying lengths. There is still the option of using diag but it gets everything to the end of matrix and I loose document grouping.

CodePudding user response：

If I understood your problem correctly one possible solution within the tidyverse is to make the data long, remove the leading letters from Title and Header, split them on the dot and filter by comparing the results. Finally a new column is generated to serve as column names after this the data is made wide again:

library(tidyverse)

# set up / read in dummy data
df <- data.table::fread("Title  Header  Doc1.1  Doc1.2  Doc1.3  Doc2.1  Doc2.2
Doc1    Doc1.1  1.000000    0.7369358   0.6418045   0.6268959   0.6823404
Doc1    Doc1.2  0.7369358   1.000000    0.6544884   0.7418507   0.5174367
Doc1    Doc1.3  0.6418045   0.6544884   1.000000    0.6180578   0.5274650
Doc2    Doc2.1  0.6268959   0.7418507   0.6180578   1.000000    0.5755243
Doc2    Doc2.2  0.6823404   0.5174367   0.5274650   0.5755243   1.000000")

df %>%
    tidyr::pivot_longer(-c(Title, Header)) %>% 
    dplyr::mutate(across(c(Header, name), ~ stringr::str_remove(.x, "^[a-zA-Z] "))) %>%
    tidyr::separate(Header, sep = "\\.", into = c("f1","f2")) %>%
    tidyr::separate(name, sep = "\\.", into = c("s1","s2")) %>% 
    dplyr::filter(f1 == s1 & (as.numeric(f2) - as.numeric(s2)) == 1) %>% 
    dplyr::mutate(cols = paste("Similarity for", s2, "-", f2)) %>% 
    tidyr::pivot_wider(-c(f1, f2, s1, s2), names_from = "cols", values_from = value)


# A tibble: 2 x 3
  Title `Similarity for 1 - 2` `Similarity for 2 - 3`
  <chr>                  <dbl>                  <dbl>
1 Doc1                   0.737                  0.654
2 Doc2                   0.576                 NA

CodePudding user response：

Another solution:

df %>%
  group_by(Title) %>%
  summarize(name = embed(Header, 2), .groups = 'drop') %>%
  mutate(value = transform(df, row.names = Header)[name],
         name = str_remove_all(paste(name[,2],name[,1], sep = '_'), '[^_] [.]'))%>%
  pivot_wider()

# A tibble: 2 x 3
  Title `1_2`     `2_3`    
  <chr> <chr>     <chr>    
1 Doc1  0.7369358 0.6544884
2 Doc2  0.5755243 NA