Python version: 3.6.9

I've used pickle to dump a machine learning model into a file, and when I try to run a prediction on it using Flask, it fails with ModuleNotFoundError: No module named 'predictors'. How can I fix this error so that it recognizes my model, whether I try to run a prediction via Flask or via the Python command (e.g. python predict_edu.py)?

Here is my file structure:

 - video_discovery
   __init__.py
   - data_science
     - model
     - __init__.py
     - predict_edu.py
     - predictors.py
     - train_model.py

Here's my predict_edu.py file:

import pickle

with open('model', 'rb') as f:
        bow_model = pickle.load(f)

Here's my predictors.py file:

from sklearn.base import TransformerMixin

# Basic function to clean the text
def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()

# Custom transformer using spaCy
class predictor_transformer(TransformerMixin):
    def transform(self, X, **transform_params):
        # Cleaning Text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

Here's how I train my model:

python data_science/train_model.py

Here's my train_model.py file:

from predictors import predictor_transformer

# pipeline = Pipeline([("cleaner", predictor_transformer()), ('vectorizer', bow_vector), ('classifier', classifier_18p)])
pipeline = Pipeline([("cleaner", predictor_transformer())])

with open('model', 'wb') as f:
        pickle.dump(pipeline, f)

My Flask app is in: video_discovery/__init__.py

Here's how I run my Flask app:

FLASK_ENV=development FLASK_APP=video_discovery flask run

I believe the issue may be occurring because I'm training the model by running the Python script directly instead of using Flask, so there might be some namespace issues, but I'm not sure how to fix this. It takes a while to train my model, so I can't exactly wait on an HTTP request.

What am I missing that might fix this issue?

CodePudding user response：

It seems a bit strange that you get that error when executing predict_edu.py, as it is in the same directory as predictors.py, and thus, using absolute import such as from predictors import predictor_transformer (without the dot . operator) should normally work as expected. However, below are a few options that you could try out, if the error persists.

Option 1

You could add the parent directory of the predictors file to the system PATH variable, before attempting to import the module, as described here.

import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parent))
from predictors import predictor_transformer

Option 2

Use relative imports, e.g., from .predictors import..., and make sure you run the script outside the project's directory, as shown below. The -m option "searches the sys.path for the named module and execute its contents as the __main__ module", and not as the top-level script. Read more about the -m option in the following references: [1], [2], [3], [4], [5], [6]. Read more about "relative imports" here: [1], [2], [3], [4].

python -m video_discovery.data_science.predict_edu

However, the PEP 8 style guide recommends using absolute imports in general.

Absolute imports are recommended, as they are usually more readable and tend to be better behaved (or at least give better error messages) if the import system is incorrectly configured (such as when a directory inside a package ends up on sys.path)

In certain cases, however, absolute imports can get quite verbose, depending on the complexity of the directory structure, as shown below. On the other hand, "relative imports can be messy, particularly for shared projects where directory structure is likely to change". They are also "not as readable as absolute ones, and it is hard to tell the location of the imported resources". Read more about Python Import and Absolute vs Relative Imports.

from package1.subpackage2.subpackage3.subpackage4.module5 import function6

Option 3

Include the directory containing your package directory in PYTHONPATH and use absolute imports instead. PYTHONPATH is used to set the path for user-defined modules, so that they can be directly imported into a Python script. The PYTHONPATH variable is a string with a list of directories that need to be added to the sys.path directory list by Python. The primary use of this variable is to allow users to import modules that have not yet made into an installable Python package. Read more about it here and here.

For instance, let’s say you wanted add the directory /Users/my_user/code to the PYTHONPATH:

On Mac

Open Terminal.app
Open the file ~/.bash_profile in your text editor – e.g. atom ~/.bash_profile
Add the following line to the end: export PYTHONPATH="/Users/my_user/code"
Save the file.
Close Terminal.app
Start Terminal.app again, to read in the new settings, and type echo $PYTHONPATH. It should show something like /Users/my_user/code.

On Linux

Open your favorite terminal program
Open the file ~/.bashrc in your text editor – e.g. atom ~/.bashrc
Add the following line to the end: export PYTHONPATH=/home/my_user/code
Save the file.
Close your terminal application.
Start your terminal application again, to read in the new settings, and type echo $PYTHONPATH. It should show something like /home/my_user/code.

On Windows

Open This PC (or Computer), right-click inside and select Properties.
From the computer properties dialog, select Advanced system settings on the left.
From the advanced system settings dialog, choose the Environment variables button.
In the Environment variables dialog, click the New button in the top half of the dialog, to make a new user variable:
Give the variable name as PYTHONPATH and in value add the path to your module directory. Choose OK and OK again to save this variable.
Now open a cmd window and type echo %PYTHONPATH% to confirm the environment variable is correctly set. Remember to open a new cmd window to run your Python program, so that it picks up the new settings in PYTHONPATH.

Option 4

Another solution would be to install the package in an editable state (all edits made to the .py files will be automatically included in the installed package), as described here and here. However, the amount of work required to get this to work might make Option 3 a better choice for you.

CodePudding user response：

From https://docs.python.org/3/library/pickle.html:

pickle can save and restore class instances transparently, however the class definition must be importable and live in the same module as when the object was stored.

When you run python data_science/train_model.py and import from predictors, Python imports predictors as a top-level module and predictor_transformer is in that module.

However, when you run a prediction via Flask from the parent folder of video_discovery, predictor_transformer is in the video_discovery.data_science.predictors module.

Use relative imports and run from a consistent path

train_model.py: Use relative import

# from predictors import predictor_transformer  # -
from .predictors import predictor_transformer   #

Train model: Run train_model with video_discovery as top-level module

# python data_science/train_model.py                # -
python -m video_discovery.data_science.train_model  #

Run a prediction via a Python command: Run predict_edu with video_discovery as top-level module

# python predict_edu.py                             # -
python -m video_discovery.data_science.predict_edu  #

Run a prediction via Flask: (no change, already run with video_discovery as top-level module)

FLASK_ENV=development FLASK_APP=video_discovery flask run