I am hosting a pretrained fasttext model on s3 (uncompressed) and I am trying to load it in a lambda function. I am using the gensim.models.fasttext module to load the model:
from gensim.models.fasttext import load_facebook_vectors
def load_model(obj):
model = load_facebook_vectors(obj["path"])
with obj["path"] being the s3 path, but I keep getting the following error:
"errorMessage": "fileno"
"errorType": "UnsupportedOperation"
"stackTrace": [
...
" File \"/var/task/gensim/models/fasttext.py\", line 784, in load_facebook_vectors\n full_model = _load_fasttext_format(path, encoding=encoding, full_model=False)\n"
" File \"/var/task/gensim/models/fasttext.py\", line 808, in _load_fasttext_format\n m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)\n"
" File \"/var/task/gensim/models/_fasttext_bin.py\", line 348, in load\n vectors_ngrams = _load_matrix(fin, new_format=new_format)\n"
" File \"/var/task/gensim/models/_fasttext_bin.py\", line 282, in _load_matrix\n matrix = np.fromfile(fin, _FLOAT_DTYPE, count)\n"
]
CodePudding user response:
Unfortunately, the np.fromfile() method on which this load depends doesn't work on a streamed-from-S3 file.
Some alternate options include:
- download the S3 file to a local path first, then use
load_facebook_vectors()from there; or… - while having the FastText file local, load it locally, then use Python's
picklefunctionality to save it to a single file (now of Python's format), then put that file on S3, and in the future re-load it using Python's unpickling
The utility functions in gensim.utils pickle() and unpickle() (which take a file path, including S3 URLs) may be helpful for the 2nd option, eg:
https://radimrehurek.com/gensim/utils.html#gensim.utils.unpickle
Since your prior code only shows using the vectors (via .load_facebook_vector), not the whole model, you could just pickle & upload the model.wv subcomponent of the loaded model, rather than the whole model, to save some storage/bandwidth.
If perhaps in future Gensim versions, the FastText-model related classes change in shape/operation, an old pickled-model might not cleanly load. In such an eventuality, you could potentially either:
- go back to the original Facebook-format model file (which could then be loaded, & then re-saved in a modern format, again); OR...
- load your pickled model into the older Gensim where it works, save it locally using Gensim's native
.save()(which may split it over multiple local files), then in the newer Gensim use Gensim's nativeFastText.load()to load those older files (which will usually handle older formats), then re-pickle that loaded model, for future re-unpickles into the matching latest Gensim.
CodePudding user response:
The documentation for load_facebook_vectors says:
This function uses the smart_open library to open the path. The path may be on a remote host (e.g. HTTP, S3, etc).
There are examples of accessing S3 objects at smart_open. I have not personally tried this but I wanted to make sure you eliminated all options before deciding to forcibly download the object and access it locally.
