Finding the minimum value of a column using PySpark Map Reduce-CodePudding

I'm looking to find out how I can use the map and reduce functions on a PySpark RDD to determine the smallest value in each of my RDD columns.

I understand that the agg function can be used on dataframes, but I really want to be able to perform the function on a large dataset in parallel.

e.g. in the below RDD example, I'd want to find the smallest values in the Value 1 and Value 2 columns.

 ------- --------- --------- 
| Entry | Value 1 | Value 2 |
 ------- --------- --------- 
| A     | 7.034   | 0.342   |
 ------- --------- --------- 
| B     | 3.684   | 1.043   |
 ------- --------- --------- 
| C     | 2.963   | 0.085   |
 ------- --------- ---------

Thanks!

CodePudding user response：

Using the following sample.txt file as an example.

A,7.034,0.342
B,3.684,1.043
C,2.963,0.085

Using the following code we can get the minimum value of the val1, and val2 columns

Firstly, each line mapped to a row using the map function, then we are using reduce to find the minimum value in each column

from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName('word count')
sc = SparkContext(conf=conf)

def words_once(line):
    a = line.split(",")
    return (a[0], float(a[1]), float(a[2]))

def get_min_val1(x,y):
    if x[1] < y[1]:
        return x
    return y

def get_min_val2(x,y):
    if x[2] < y[2]:
        return x
    return y

text = sc.textFile("./sample.txt")
words = text.map(words_once)
words.foreach(lambda x: print(x, "yee"))

lowest_val1 = words.reduce(get_min_val1)[1]
lowest_val2 = words.reduce(get_min_val2)[2]

print("\n\n")
print("Val1 Min: {}".format(lowest_val1))
print("Val2 Min: {}".format(lowest_val2))
print("\n\n")

CodePudding user response：

I do something like this, there might be more efficient solutions.

words_df = words.toDF(schema)
min_colName1 = words_df.order_by('colName1').first()['colName1']
min_colName2 = words_df.order_by('colName2').first()['colName2']