I'm trying to construct co-occurrence matrix of my dataframe on Databricks using pyspark.pandas API.
I tried this method to construct the matrix. Constructing a co-occurrence matrix in python pandas
The code is working fine in pandas, but is throwing error with pyspark.pandas
coocc = psdf.T.dot(psdf)
coocc
I'm getting this error
TypeError: Unsupported type DataFrame
I checked the doc. https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.dot.html
pyspark.pandas.DataFrame.dot()
Takes series as input.
I tried to converting dataframe to series using psdf.squeeze(), it does not convert dataframe to series, as my dataframe has multiple columns.
Is there any way to change pyspark.pandas.Dataframe to pyspark.pandas.Series?
Or Different method in pyspark.pandas to construct cooccurrence matrix
CodePudding user response:
I solved it using csr_matrix as dataframe has '1' and '0' as values
import scipy.sparse as sp
psdfx = sp.csr_matrix(psdf.astype(int).values)
psdfc = ptdfx.T * psdfx
psdfc.setdiag(0)
coocc = ps.DataFrame(psdfc.todense(), columns=psdf.columns, index=psdf.columns)
coocc
