I have a sparksql select query as below
select max(age),min(age),avg(age),max(sal),min(sal),avg(sal) from Emp;
Output dataframe is getting created as below:
| max(age) | min(age) | avg(age) | max(sal) | min(sal) | avg(sal) |
|---|---|---|---|---|---|
| 46 | 23 | 31 | 10000 | 2000 | 5000 |
My requirement is the dataframe should be as below using pyspark using transpose.
| columns | max | min | avg |
|---|---|---|---|
| age | 46 | 23 | 31 |
| sal | 10000 | 2000 | 5000 |
Thanks for the help in advance.
CodePudding user response:
The easiest way would be to run two queries (one for sal and one for age and union them.
select 'age' as column, max(age) as max, min(age) as min, avg(age) as avg from Emp;
select 'sal' as column, max(sal) as max, min(sal) as min, avg(sal) as avg from Emp;
Load those into two dataframes df_sal and df_age and union them:
final = df_sal.union(df_age)
Update:
In case only a single query can be done (as commented by the OP). In this case the stack method can help you.
df = spark.createDataFrame([
Row(avg_sal=1, max_sal=1, min_sal=1, avg_age=1, max_age=1, min_age=1)
])
df.show()
------- ------- ------- ------- ------- -------
|avg_sal|max_sal|min_sal|avg_age|max_age|min_age|
------- ------- ------- ------- ------- -------
| 1| 2| 3| 4| 5| 6|
------- ------- ------- ------- ------- -------
(
df
.select(F.expr("stack(2, 'sal', avg_sal, max_sal, min_sal, 'age', avg_age, max_age, min_age) as (column, avg, max, min)"))
.show()
)
------ --- --- ---
|column|avg|max|min|
------ --- --- ---
| sal| 1| 2| 3|
| age| 4| 5| 6|
------ --- --- ---
In the example I renamed the input columns to avoid problems with brackets in column names. This can be done directly in the SQL query.
CodePudding user response:
If your original dataframe df look like below,
----- ------
| age| sals|
----- ------
| 46| 2000|
| 23| 10000|
| ...| ...|
----- ------
then
org_cols = df.columns
# ['age', 'sal']
sql = spark.sql('select max(age),min(age),avg(age),max(sal),min(sal),avg(sal) from Emp;')
cur_cols = sql.columns
df2 = sql.rdd.flatMap(lambda row: [[org_cols [i], row[3*i], row[3*i 1], row[3*i 2]] for i in range(0, int(len(row) / 3))]).toDF(['columns', 'max', 'min', 'avg'])
df2.show()
------- ----- ---- ----
|columns| max| min| avg|
------- ----- ---- ----
| age| 46| 23| 31|
| sal|10000|2000|5000|
------- ----- ---- ----
