I am currently using this piece of code :
class FileSystem(metaclass=Singleton):
"""File System manager based on Spark"""
def __init__(self, spark):
self._path = spark._jvm.org.apache.hadoop.fs.Path
self._fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(
spark._jsc.hadoopConfiguration()
)
@classmethod
def without_spark(cls):
with Spark() as spark:
return cls(spark)
My object depends obviously on the Spark object (another object that I created - If you need to see its code, I can add it but I do not think it is required for my current issue).
It can be used in 2 differents ways resulting the same behavior :
fs = FileSystem.without_spark()
# OR
with Spark() as spark:
fs = FileSystem(spark)
My problem is that, even if FileSystem is a singleton, using the class method without_spark makes me enter (__enter__) the context manager of spark, which lead to a connection to spark cluster, which takes a lot of time. How can I make that the first execution of without_spark do the connection, but the next one only returns the already created instance?
The expected behavior would be something like this :
@classmethod
def without_spark(cls):
if not cls.exists: # I do not know how to persist this information in the class
with Spark() as spark:
return cls(spark)
else:
return cls()
CodePudding user response:
I think you are looking for something like
import contextlib
class FileSystem(metaclass=Singleton):
"""File System manager based on Spark"""
spark = None
def __init__(self, spark):
self._path = spark._jvm.org.apache.hadoop.fs.Path
self._fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(
spark._jsc.hadoopConfiguration()
)
@classmethod
def without_spark(cls):
if cls.spark is None:
cm = cls.spark = Spark()
else:
cm = contextlib.nullcontext(cls.spark)
with cm as s:
return cls(s)
The first time without_spark is called, a new instance of Spark is created and used as a context manager. Subsequent calls reuse the same Spark instance and use a null context manager.
I believe your approach will work as well; you just need to initialize exists to be False, then set it to True the first (and every, really) time you call the class method.
class FileSystem(metaclass=Singleton):
"""File System manager based on Spark"""
exists = False
def __init__(self, spark):
self._path = spark._jvm.org.apache.hadoop.fs.Path
self._fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(
spark._jsc.hadoopConfiguration()
)
@classmethod
def without_spark(cls):
if not cls.exists:
cls.exists = True
with Spark() as spark:
return cls(spark)
else:
return cls()
CodePudding user response:
Can't you make the constructor argument optional, and initiate the Spark lazily, e.g. in a property (or functools.cached_property):
from functools import cached_property
class FileSystem(metaclass=Singleton):
def __init__(self, spark=None):
self._spark = spark
@cached_property
def spark(self):
if self._spark:
return self._spark
return self._spark := Spark()
@cached_property
def path(self):
return self.spark._jvm.org.apache.hadoop.fs.Path
@cached_property
def fs(self):
with self.spark:
return self.spark._jvm.org.apache.hadoop.fs.FileSystem.get(
self.spark._jsc.hadoopConfiguration()
)
