Autopredictor Tutorial

In critical sectors like healthcare, the choice of an accurate and efficient predictive model is crucial, impacting decision-making and outcomes significantly. The autopredictor package is designed to simplify and expedite the process of model selection and evaluation for continuous data scenarios. This package is especially valuable for healthcare professionals, data scientists, and researchers, offering them more time for insightful data interpretation and strategic decision-making. This tutorial will demonstrate the use of autopredictor with a diabetes dataset, reflecting real-world health data scenarios.

Setting Up and Version Checking

To begin, install and import the autopredictor package and check its version to ensure compatibility with your dataset and analysis requirements.

import autopredictor

print(autopredictor.__version__)

0.2.3

Importing Necessary Modules and Data

Import essential modules and the dataset. For this tutorial, the diabetes dataset from sklearn is used.

from autopredictor.fit import fit
from autopredictor.show_all import show_all
from autopredictor.bestscore import display_best_score
from autopredictor.select_model import select_model
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
import warnings

warnings.filterwarnings("ignore")

Limitations in Handling Continuous and Categorical Responses

The autopredictor package is optimized for continuous response variables, ideal for regression tasks where the outcome is a continuous value. While it can process categorical features within the input data, the response variable (y) should be continuous. This makes the package particularly suitable for scenarios like predicting numerical outcomes in healthcare or finance.

Regression Models

The autopredictor package includes a collection of regression models, each suited for different types of datasets and analysis requirements:

Linear Regression:
- A statistical method for modeling the relationship between a dependent variable and one or more independent variables.
- Assumes a linear relationship between inputs and the target output.
Linear Regression (L1 Regularization):
- A variant of linear regression that includes L1 regularization, which adds a penalty equivalent to the absolute value of the magnitude of coefficients.
- Helps in feature selection by shrinking coefficients of less important features to zero.
Linear Regression (L2 Regularization):
- Similar to L1, but uses L2 regularization where the penalty is the square of the magnitude of coefficients.
- Tends to distribute error among all the terms, useful in avoiding overfitting by penalizing large coefficients.
Linear Support Vector Machine (SVM):
- A type of SVM that’s used for regression tasks. It tries to find a line (or hyperplane in higher dimensions) that best fits the data.
- Effective in high-dimensional spaces and best suited for cases where the number of dimensions exceeds the number of samples.
Support Vector Machine:
- An extension of Linear SVM, capable of performing both linear and non-linear regression.
- Utilizes kernel functions to transform data into a higher dimension where a hyperplane can be used to perform the regression.
Decision Tree:
- A tree-like model of decisions where each node represents a feature, each branch represents a decision rule, and each leaf represents an outcome.
- Simple to understand and interpret, and can handle both numerical and categorical data.
Random Forest:
- An ensemble learning method that operates by constructing multiple decision trees during training.
- For regression tasks, it takes the average prediction of the individual trees, which usually results in better performance and less overfitting.
Gradient Boosting:
- An ensemble technique that builds models sequentially, with each new model being trained to correct the errors made by the previous ones.
- Combines weak predictive models to create a strong model, particularly effective for complex datasets with nonlinear relationships.
AdaBoost:
- Short for Adaptive Boosting, it combines multiple weak learners to increase the accuracy of the model.
- Adjusts the weights of incorrectly classified instances so that subsequent classifiers focus more on difficult cases.

Scoring Metrics

The autopredictor package features a diverse array of scoring metrics, tailored to evaluate the performance of regression models across various datasets and analytical needs.

Mean Absolute Error (MAE): This is the average of the absolute differences between the predicted and actual values. A lower MAE value indicates a model with fewer errors.
Mean Absolute Percentage Error (MAPE): It measures the average percentage error between the predicted and actual values. Like MAE, a lower MAPE value signifies better model accuracy.
R-squared (R2): This metric quantifies the percentage of the variance in the dependent variable that is predictable from the independent variables. A higher R2 value (closer to 1) indicates a better model fit.
Mean Squared Error (MSE): MSE calculates the average of the squared differences between predicted and actual values. A lower MSE value is preferable, indicating higher precision of the model.
Root Mean Squared Error (RMSE): RMSE is the square root of MSE and provides a measure of the magnitude of the error. Like MSE, a lower RMSE indicates a more accurate model.

Each of these models and scoring metrics offers unique advantages and can be selected based on the specific requirements of your dataset and task. Understanding the underlying mechanics of these models and metrics will help you make informed decisions when using the autopredictor package.

Model Fitting with `fit`

The versatile fit function is designed to streamline the process of training and evaluating multiple regression models. Its primary objective is to facilitate the exploration and comparison of various regression models by effortlessly managing both training and test datasets, along with their associated target values. The function is designed to deliver detailed and comprehensive performance scores for each model and a spectrum of evaluation metrics.

Criteria for Input Data Format

The fit function requires data to be preprocessed and formatted appropriately:

Data Cleaning: Ensure all irrelevant or redundant columns are removed. For instance, if your dataset contains columns like ‘ID’ or ‘Timestamp’ that are not relevant to the model, these should be dropped.
Data Type Conversion: Convert categorical data into a numerical format using techniques like OneHotEncoding for nominal categories or OrdinalEncoding for ordinal categories.
Handling Missing Values: Address any missing or NaN values either by imputing them or removing the rows/columns, depending on the scenario.
Data Splitting: The input data should be split into features (X) and the target variable (y). The features should be in a DataFrame format, and the target variable should be a Series or a single column DataFrame.

Before getting started, it is assumed the user will have the data split into X_train, X_test, y_train, and y_test. We recommend using scikit-learn’s sklearn.model_selection.train_test_split in order to acheive this.

For this example, we will be using the diabetes regression dataset from scikit-learn’s sklearn.datasets.load_diabetes().

X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Fitting the Models

To get started, the user should call the fit function and input the parameters for X_train, X_test, y_train, and y_test. The user is given the option to also output the score for the training set. This can be done by setting the parameter return_train to True.

model_scores = fit(X_train, X_test, y_train, y_test, return_train=True)

Linear Regression trained.
Linear Regression (L1) trained.
Linear Regression (L2) trained.
Linear Support Vector Machine trained.
Support Vector Machine trained.
Decision Tree trained.

Random Forest trained.

Gradient Boosting trained.
AdaBoost trained.

The function will return a tuple of two dictionaries. The first dictionary will always be the scores for the test set. If return_train=True the second dictionary will be the scores for the training set, otherwise it will be empty.

print(f'Test scores: \n{model_scores[0]}')

Test scores: 
{'Linear Regression': {'Mean Absolute Error': 42.794094679599944, 'Mean Absolute Percentage Error': 0.3749982636756113, 'R2 Score': 0.4526027629719195, 'Mean Squared Error': 2900.1936284934814, 'Root Mean Squared Error': 53.85344583676593}, 'Linear Regression (L1)': {'Mean Absolute Error': 49.73032753662261, 'Mean Absolute Percentage Error': 0.4711256345340608, 'R2 Score': 0.3575918767219115, 'Mean Squared Error': 3403.5757216070733, 'Root Mean Squared Error': 58.340172450954185}, 'Linear Regression (L2)': {'Mean Absolute Error': 46.138857666974516, 'Mean Absolute Percentage Error': 0.42569291627271466, 'R2 Score': 0.41915292635986556, 'Mean Squared Error': 3077.4159388272296, 'Root Mean Squared Error': 55.47446204180109}, 'Linear Support Vector Machine': {'Mean Absolute Error': 63.277793352149665, 'Mean Absolute Percentage Error': 0.4322026287578635, 'R2 Score': -0.27172042429785237, 'Mean Squared Error': 6737.767789617941, 'Root Mean Squared Error': 82.08390700751238}, 'Support Vector Machine': {'Mean Absolute Error': 56.02372412801096, 'Mean Absolute Percentage Error': 0.4902842140476431, 'R2 Score': 0.18211365770500287, 'Mean Squared Error': 4333.285954518086, 'Root Mean Squared Error': 65.82769899151941}, 'Decision Tree': {'Mean Absolute Error': 54.8876404494382, 'Mean Absolute Percentage Error': 0.4516643574275407, 'R2 Score': 0.04793807751200596, 'Mean Squared Error': 5044.168539325843, 'Root Mean Squared Error': 71.02231015199268}, 'Random Forest': {'Mean Absolute Error': 43.806179775280896, 'Mean Absolute Percentage Error': 0.3983128018887252, 'R2 Score': 0.4377081476557819, 'Mean Squared Error': 2979.1075606741574, 'Root Mean Squared Error': 54.5812015319758}, 'Gradient Boosting': {'Mean Absolute Error': 44.700362222244706, 'Mean Absolute Percentage Error': 0.4000398909188521, 'R2 Score': 0.4485686498415843, 'Mean Squared Error': 2921.5669720286805, 'Root Mean Squared Error': 54.05152145896247}, 'AdaBoost': {'Mean Absolute Error': 45.80278726454998, 'Mean Absolute Percentage Error': 0.439372987402809, 'R2 Score': 0.40958773407411686, 'Mean Squared Error': 3128.093779060867, 'Root Mean Squared Error': 55.92936419324707}}

print(f'Train scores: \n{model_scores[1]}')

Train scores: 
{'Linear Regression': {'Mean Absolute Error': 43.483503523980396, 'Mean Absolute Percentage Error': 0.38919947147960504, 'R2 Score': 0.5279193863361497, 'Mean Squared Error': 2868.549702835578, 'Root Mean Squared Error': 53.55884336723094}, 'Linear Regression (L1)': {'Mean Absolute Error': 52.95878032849505, 'Mean Absolute Percentage Error': 0.4954382200557504, 'R2 Score': 0.3646309911295581, 'Mean Squared Error': 3860.7549830123576, 'Root Mean Squared Error': 62.134973911737966}, 'Linear Regression (L2)': {'Mean Absolute Error': 48.8051936622374, 'Mean Absolute Percentage Error': 0.45064627040176647, 'R2 Score': 0.4424027835503954, 'Mean Squared Error': 3388.18261808013, 'Root Mean Squared Error': 58.208097530155804}, 'Linear Support Vector Machine': {'Mean Absolute Error': 70.49158898119674, 'Mean Absolute Percentage Error': 0.4671152132373019, 'R2 Score': -0.34602896397485394, 'Mean Squared Error': 8179.007722116542, 'Root Mean Squared Error': 90.43786663846367}, 'Support Vector Machine': {'Mean Absolute Error': 58.68582085598522, 'Mean Absolute Percentage Error': 0.49458193771732667, 'R2 Score': 0.16680377163060012, 'Mean Squared Error': 5062.831906490097, 'Root Mean Squared Error': 71.15357971662492}, 'Decision Tree': {'Mean Absolute Error': 0.0, 'Mean Absolute Percentage Error': 0.0, 'R2 Score': 1.0, 'Mean Squared Error': 0.0, 'Root Mean Squared Error': 0.0}, 'Random Forest': {'Mean Absolute Error': 17.661529745042493, 'Mean Absolute Percentage Error': 0.1533997548568964, 'R2 Score': 0.9206943508436927, 'Mean Squared Error': 481.89268895184125, 'Root Mean Squared Error': 21.95205432190439}, 'Gradient Boosting': {'Mean Absolute Error': 25.351692588287143, 'Mean Absolute Percentage Error': 0.22769769955622146, 'R2 Score': 0.8359025987996851, 'Mean Squared Error': 997.1211225895325, 'Root Mean Squared Error': 31.577224744893787}, 'AdaBoost': {'Mean Absolute Error': 40.796634345670725, 'Mean Absolute Percentage Error': 0.3807240683038259, 'R2 Score': 0.6371140328710272, 'Mean Squared Error': 2205.039569602491, 'Root Mean Squared Error': 46.95784886046731}}

Error prevention in `fit`

Here are some of the strategies that were used in order to boost error prevention in fit

Input content check: To begin with model training, the function will first check that all four of the mandatory parameters (X_train, X_test, y_train, and y_test) are not empty (None). Confirm that each of these parameters is provided. In the event that any of these parameters are missing or set to None, the function will raise an exception telling the user to input a valid DataFrame object into the function.

Valid input check: The inherent design of scikit-learn functions requires that mandatory parameters be either DataFrames or Series. Consequently, supplying any other type of input will result in an error. To prevent such errors, verify that all parameters inputted conform to these expected types.

Evaluating all models with `show_all`

After executing the fit function, both the training and testing scores are available in a dictionary format. The show_all function is a versatile tool for efficiently visualize the regression model scores. It transforms the raw model scores into a structured DataFrame. By presenting data in a user-friendly format, show_all not only saves time but also ensures compatibility with other functions in the workflow, like display_best_score and select_model. It’s efficient way to begin the model evaluation process, setting the stage for more detailed analysis.

By converting the dictionary into an organized format and sorting the results alphabetically by model name, show_all offers a quick and efficient mean of comprehending and comparing regression model performance. The tabular presentation enhances readability, simplifies the process of identifying specific model scores, and contributes to a streamlined model evaluation workflow. Consider a real-life application like using the load_diabetes dataset from sklearn. A researcher can train multiple models to predict diabetes progression and then use show_all to access a user friendly view of all the models and their respective metrics.

While it’s possible to achieve similar conversions using pandas manipulation, show_all is purpose-built for this package, ensuring the validity of scoring metrics in the dictionary.

Visualizing training scores with show_all

scores_train = show_all(model_scores[1]) #results_train 

|                               |     MAE |     MAPE |        R2 |      MSE |    RMSE |
|-------------------------------|---------|----------|-----------|----------|---------|
| AdaBoost                      | 40.7966 | 0.380724 |  0.637114 | 2205.04  | 46.9578 |
| Decision Tree                 |  0      | 0        |  1        |    0     |  0      |
| Gradient Boosting             | 25.3517 | 0.227698 |  0.835903 |  997.121 | 31.5772 |
| Linear Regression             | 43.4835 | 0.389199 |  0.527919 | 2868.55  | 53.5588 |
| Linear Regression (L1)        | 52.9588 | 0.495438 |  0.364631 | 3860.75  | 62.135  |
| Linear Regression (L2)        | 48.8052 | 0.450646 |  0.442403 | 3388.18  | 58.2081 |
| Linear Support Vector Machine | 70.4916 | 0.467115 | -0.346029 | 8179.01  | 90.4379 |
| Random Forest                 | 17.6615 | 0.1534   |  0.920694 |  481.893 | 21.9521 |
| Support Vector Machine        | 58.6858 | 0.494582 |  0.166804 | 5062.83  | 71.1536 |

Visualizing test scores with show_all

scores_test = show_all(model_scores[0]) #results_test

|                               |     MAE |     MAPE |         R2 |     MSE |    RMSE |
|-------------------------------|---------|----------|------------|---------|---------|
| AdaBoost                      | 45.8028 | 0.439373 |  0.409588  | 3128.09 | 55.9294 |
| Decision Tree                 | 54.8876 | 0.451664 |  0.0479381 | 5044.17 | 71.0223 |
| Gradient Boosting             | 44.7004 | 0.40004  |  0.448569  | 2921.57 | 54.0515 |
| Linear Regression             | 42.7941 | 0.374998 |  0.452603  | 2900.19 | 53.8534 |
| Linear Regression (L1)        | 49.7303 | 0.471126 |  0.357592  | 3403.58 | 58.3402 |
| Linear Regression (L2)        | 46.1389 | 0.425693 |  0.419153  | 3077.42 | 55.4745 |
| Linear Support Vector Machine | 63.2778 | 0.432203 | -0.27172   | 6737.77 | 82.0839 |
| Random Forest                 | 43.8062 | 0.398313 |  0.437708  | 2979.11 | 54.5812 |
| Support Vector Machine        | 56.0237 | 0.490284 |  0.182114  | 4333.29 | 65.8277 |

Error prevention in `show_all`

To ensure a smooth and error-free experience, it is crucial to be mindful of certain considerations during its usage. Here are some tips:

Type Check: The show_all function expects the input result to be a dictionary. Providing an input of a different type, such as a list or string, will trigger a TypeError. Always verify the input type to ensure compatibility.

Empty Dictionary: Ensure that the input argument contains scores for at least one model. Passing an empty dictionary would result in a ValueError. Before invoking show_all, check that your dictionary is populated with relevant data.

Valid Scoring Metrics: This function expects dictionary scores outputted from the fit function. It ensures that the scoring metrics are valid and complete. Passing an invalid scoring metrics as the dictionary’s value will result in a ValueError.

By adhering to these guidelines, you can maximize the utility of the show_all function while preventing potential errors in its usage.

Selecting the best model with `display_best_score`

Following the execution of the show_all function, a DataFrame is generated containing scoring metric results sorted alphabetically by model names. The display_best_score function within autopredictor plays a crucial role in this accelerated workflow. It simplifies the complex process of determining the optimal model by swiftly identifying the best-performing one based on a specified regression scoring metric.

Selecting the best model based on the scoring metric MSE:

display_best_score(scores_train,'MSE') # Based on Training Set

|               |   MSE |
|---------------|-------|
| Decision Tree |     0 |

	MSE
Decision Tree	0.0

display_best_score(scores_test,'MSE') # Based on Test Set

|                   |     MSE |
|-------------------|---------|
| Linear Regression | 2900.19 |

	MSE
Linear Regression	2900.193628

The display_best_score function ranks models based on a chosen scoring metric. For metrics like MAE, MAPE, MSE, and RMSE, where lower values indicate better model performance, the function identifies the model with the minimum value. In contrast, for the R2 metric, where a higher value (closer to 1) indicates a better fit, the function selects the model with the maximum R2 value. This approach ensures that the best model is selected based on the most appropriate metric for your specific analysis needs.

Selecting the best model based on the scoring metric R2:

display_best_score(scores_train,'R2') # Based on Training Set

|               |   R2 |
|---------------|------|
| Decision Tree |    1 |

	R2
Decision Tree	1.0

display_best_score(scores_test,'R2') # Based on Test Set

|                   |       R2 |
|-------------------|----------|
| Linear Regression | 0.452603 |

	R2
Linear Regression	0.452603

In the context of the diabetes dataset, display_best_score swiftly identifies the most effective model, like a Random Forest or Support Vector Machine, for predicting diabetes progression, using a specified scoring metric. This feature is crucial for researchers and data scientists, allowing them to choose the most suitable model based on their specific needs. This flexibility is essential in healthcare and other fields where the choice of metric significantly impacts research outcomes and real-world applications.

Error Prevention and troubleshooting in `display_best_score` Function

Here are some refined strategies to boost error prevention and troubleshooting in the display_best_score function:

Input Type Check: Prior to utilizing the display_best_score function, validate that the input result is a DataFrame. Passing any other data type will raise a TypeError.

Empty DataFrame: Ensure that the DataFrame provided as an argument contains at least one model’s scoring metrics. Attempting to use an empty DataFrame will result in a TypeError.

Valid Scoring Metrics: Verify that the scoring metrics provided are both valid and comprehensive. If an invalid scoring metric is passed or if the DataFrame lacks essential metrics, a ValueError will be raised.

Users can improve their usage of the display_best_score function by adhering to these guidelines, which helps in minimizing the chance of encountering errors during its application.

Inspecting a specific model with `select_model`

Following the execution of the show_all, the select_model function allows the user to select a specific model from the DataFrame and view its performance metrics. If the model is found to be present in the DataFrame, it returns the performance metrics for that model; otherwise, it provides a list of available models. This function is particularly useful for zooming in on a specific model’s performance or for retrieving the performance of the best model based on a particular metric.

In the context of a diabetes dataset, for instance, if an analyst suspects that a certain model, like a Random Forest Regressor, might be particularly well-suited to handling the complexities of diabetes data (due to its ability to model non-linear relationships and interactions between variables), the select_model function allows them to isolate and closely examine the performance of just this model. This is especially useful in situations where a multitude of models have been trained and evaluated, and there’s a need to drill down into the specifics of one model without getting overwhelmed by the broader data.

Viewing Random Forest model performance on dataset

select_model(scores_test, 'Random Forest')

	MAE	MAPE	R2	MSE	RMSE
Random Forest	43.80618	0.398313	0.437708	2979.107561	54.581202

Expanding beyond healthcare, this function has broad applicability in various fields. For example, in finance, an analyst might want to specifically evaluate the performance of a particular model in predicting stock prices or market trends. Similarly, in environmental science, a researcher could use this function to singularly assess a model’s accuracy in forecasting climate patterns or pollution levels.

The ability to selectively examine a model is crucial when comparing models that might have different strengths and weaknesses depending on the context. This targeted approach enables a more thoughtful and focused analysis, allowing analysts to make more informed decisions about which model to deploy based on specific criteria relevant to their field or problem at hand. It’s a tool that enhances precision in model selection.

Viewing Linear Regression model performance on dataset

select_model(scores_train, 'Linear Regression')

	MAE	MAPE	R2	MSE	RMSE
Linear Regression	43.483504	0.389199	0.527919	2868.549703	53.558843

Error prevention in `select_model`

For a seamless and trouble-free experience, be sure to adhere to these tips:

Input Type Check: Similar to display_best_score, before employing the select_model function, ensure that the input result is a DataFrame. Passing any other data type will raise a TypeError.

Empty DataFrame: Make sure that the DataFrame provided contains at least one model and its respective scoring metrics. Attempting to use an empty DataFrame will result in a TypeError.

The select_model function is not only beneficial for focusing on a specific model’s performance but also serves as a useful tool for verifying the presence of a model within the dataset. When an analyst specifies a model, the function checks if that model is included in the DataFrame’s index. If the model is not found, the function doesn’t just stop at returning an error message; it goes a step further by providing a list of the models that are included. This feature is particularly helpful in multiple ways:

Model Inventory Check: It essentially acts as a quick inventory check, allowing users to confirm which models have been trained and evaluated. This is especially useful in collaborative environments where multiple team members might be working on the same dataset but focusing on different models. It ensures that everyone is aware of the models that are already included in the analysis, helping to avoid redundant work.
Informed Decision Making: By providing a list of available models, it aids in informed decision-making. Analysts can quickly scan through the available models and decide which ones to focus on based on their specific criteria or hypothesis, without having to look through the entire DataFrame.

selected_model_name = 'Other Regressor'
select_model(scores_test, selected_model_name)

"Model 'Other Regressor' not found. Here is the list of the models available: AdaBoost, Decision Tree, Gradient Boosting, Linear Regression, Linear Regression (L1), Linear Regression (L2), Linear Support Vector Machine, Random Forest, Support Vector Machine."

For example, in a real-life scenario, an environmental scientist analyzing a dataset on air quality might be interested in examining a specific model’s ability to predict pollution levels. If they aren’t sure whether the model has been included in the analysis, they can use the select_model function. If the model is not found, the function’s feedback not only informs them of this but also shows which models are available, allowing the scientist to make an informed decision on whether to proceed with an available trained model or to train and evaluate the model of interest.

Conclusion

The autopredictor package offers a streamlined approach to model training, evaluation, and selection, making it an indispensable tool in fields where accuracy is crucial. It facilitates the training and evaluation of a diverse array of models, provides a detailed array of performance metrics for thorough assessment, and ensures a clear, user-friendly presentation of results, aiding in the informed selection of the most effective model. Its robust functionality allow for an efficient workflow, catering to both all levels of data scientists and professionals in various domains. Whether it’s healthcare, finance, or environmental science, autopredictor simplifies the complex task of model selection, empowering users to make informed, data-driven decisions.

Best practices

It is better to always split your dataset into training and testing sets to evaluate model performance.
Consider the specific problem and dataset characteristics when choosing a regression model.
Regularly check for updates to the autopredictor package for improved functionality and bug fixes.

References

Pandala, S. R. (2022). LazyPredict. Retrived from https://pypi.org/project/lazypredict/

Credits

autopredictor was created with cookiecutter and the py-pkgs-cookiecutter template.