Title: | Machine Learning Pipelines for R |
---|---|
Description: | A framework for defining 'pipelines' of functions for applying data transformations, model estimation and inverse-transformations, resulting in predicted value generation (or model-scoring) functions that automatically apply the entire pipeline of functions required to go from input to predicted output. |
Authors: | Alex Ioannides |
Maintainer: | Alex Ioannides <[email protected]> |
License: | Apache License 2.0 |
Version: | 0.1.1.900 |
Built: | 2025-03-06 03:18:15 UTC |
Source: | https://github.com/alexioannides/pipeliner |
cbind_fast
This is not as 'safe' as using cbind_fast
- for example, if df1
has columns with the
same name as columns in df2
, then they will be over-written.
cbind_fast(df1, df2)
cbind_fast(df1, df2)
df1 |
A data.frame. |
df2 |
Another data.frame |
A data.frame equal to df1
with the columns of df2
appended.
## Not run: df1 <- data.frame(x = 1:5, y = 1:5 * 0.1) df2 <- data.frame(a = 6:10, b = 6:10 * 0.25) df3 <- cbind_fast(df1, df2) df3 # x y a b # 1 1 0.1 6 1.50 # 2 2 0.2 7 1.75 # 3 3 0.3 8 2.00 # 4 4 0.4 9 2.25 # 5 5 0.5 10 2.50 ## End(Not run)
## Not run: df1 <- data.frame(x = 1:5, y = 1:5 * 0.1) df2 <- data.frame(a = 6:10, b = 6:10 * 0.25) df3 <- cbind_fast(df1, df2) df3 # x y a b # 1 1 0.1 6 1.50 # 2 2 0.2 7 1.75 # 3 3 0.3 8 2.00 # 4 4 0.4 9 2.25 # 5 5 0.5 10 2.50 ## End(Not run)
Helper function that checks if the object returned from a ml_pipeline_builder
method is
data.frame (if it isn't NULL), and if it isn't, throws an error that is customised with the
returning name.
check_data_frame_throw_error(func_return_object, func_name)
check_data_frame_throw_error(func_return_object, func_name)
func_return_object |
The object returned from a |
func_name |
The name of the function that returned the object. |
## Not run: transform_method <- function(df) df data <- data.frame(y = c(1, 2), x = c(0.1, 0.2)) data_transformed <- transform_method(data) check_data_frame_throw_error(data_transformed, "transform_method") # NULL ## End(Not run)
## Not run: transform_method <- function(df) df data <- data.frame(y = c(1, 2), x = c(0.1, 0.2)) data_transformed <- transform_method(data) check_data_frame_throw_error(data_transformed, "transform_method") # NULL ## End(Not run)
predict
method definedHelper function that checks if the object returned from the estimate_model
method has
a predict
method defined for it.
check_predict_method_throw_error(func_return_object)
check_predict_method_throw_error(func_return_object)
func_return_object |
The object returned from the |
## Not run: estimation_method <- function(df) lm(eruptions ~ 0 + waiting, df) data <- faithful model_estimate <- estimation_method(data) check_predict_method_throw_error(model_estimate) # NULL ## End(Not run)
## Not run: estimation_method <- function(df) lm(eruptions ~ 0 + waiting, df) data <- faithful model_estimate <- estimation_method(data) check_predict_method_throw_error(model_estimate) # NULL ## End(Not run)
Helper function that checks if a ml_pipeline_builder
method is unary function (if it
isn't a NULL returning function), and if it isn't, throws an error that is customised with the
method function name.
check_unary_func_throw_error(func, func_name)
check_unary_func_throw_error(func, func_name)
func |
A |
func_name |
The name of the |
## Not run: transform_method <- function(df) df check_unary_func_throw_error(transform_method, "transform_method") # NULL ## End(Not run)
## Not run: transform_method <- function(df) df check_unary_func_throw_error(transform_method, "transform_method") # NULL ## End(Not run)
A function that takes as its arguement another function defining how a machine learning model should be estimated based on the variables available in the input data frame. This function is wrapped (or adapted) for use within a machine learning pipeline.
estimate_model(.f)
estimate_model(.f)
.f |
A unary function of a data.frame that returns a fitted model object, which must have
a |
A unary function of a data.frame that returns a fitted model object that has a
predict.{model-class}
defined This function is assigned the classes
"estimate_model"
and "ml_pipeline_section"
.
data <- head(faithful) f <- estimate_model(function(df) { lm(eruptions ~ 1 + waiting, df) }) f(data) # Call: # lm(formula = eruptions ~ 1 + waiting, data = df) # # Coefficients: # (Intercept) waiting # -1.53317 0.06756
data <- head(faithful) f <- estimate_model(function(df) { lm(eruptions ~ 1 + waiting, df) }) f(data) # Call: # lm(formula = eruptions ~ 1 + waiting, data = df) # # Coefficients: # (Intercept) waiting # -1.53317 0.06756
Custom error handler for printing the name of an enclosing function with error
func_error_handler(e, calling_func)
func_error_handler(e, calling_func)
e |
A |
calling_func |
A character string naming the enclosing function (or closure) for printing with error messages |
NULL - throws error with custom message
## Not run: f <- function(x) x ^ 2 tryCatch(f("a"), error = function(e) func_error_handler(e, "f")) # Error in x^2 : non-numeric argument to binary operator # ---> called from within function: f ## End(Not run)
## Not run: f <- function(x) x ^ 2 tryCatch(f("a"), error = function(e) func_error_handler(e, "f")) # Error in x^2 : non-numeric argument to binary operator # ---> called from within function: f ## End(Not run)
A function that takes as its arguement another function defining a inverse response variable transformation, and wraps (or adapts) it for use within a machine learning pipeline.
inv_transform_response(.f)
inv_transform_response(.f)
.f |
A unary function of a data.frame that returns a new data.frame containing only the inverse transformed response variable. An error will be thrown if this is not the case. |
A unary function of a data.frame that returns the input data.frame with the inverse
transformed response variable column appended. This function is assigned the classes
"inv_transform_response"
and "ml_pipeline_section"
.
data <- head(faithful) f1 <- transform_response(function(df) { data.frame(y = (df$eruptions - mean(df$eruptions)) / sd(df$eruptions)) }) f2 <- inv_transform_response(function(df) { data.frame(eruptions2 = df$y * sd(df$eruptions) + mean(df$eruptions)) }) f2(f1(data)) # eruptions waiting y eruptions2 # 1 3.600 79 0.5412808 3.600 # 2 1.800 54 -1.3039946 1.800 # 3 3.333 74 0.2675649 3.333 # 4 2.283 62 -0.8088457 2.283 # 5 4.533 85 1.4977485 4.533 # 6 2.883 55 -0.1937539 2.883
data <- head(faithful) f1 <- transform_response(function(df) { data.frame(y = (df$eruptions - mean(df$eruptions)) / sd(df$eruptions)) }) f2 <- inv_transform_response(function(df) { data.frame(eruptions2 = df$y * sd(df$eruptions) + mean(df$eruptions)) }) f2(f1(data)) # eruptions waiting y eruptions2 # 1 3.600 79 0.5412808 3.600 # 2 1.800 54 -1.3039946 1.800 # 3 3.333 74 0.2675649 3.333 # 4 2.283 62 -0.8088457 2.283 # 5 4.533 85 1.4977485 4.533 # 6 2.883 55 -0.1937539 2.883
Building machine learning models often requires pre- and post-transformation of the input and/or response variables, prior to training (or fitting) the models. For example, a model may require training on the logarithm of the response and input variables. As a consequence, fitting and then generating predictions from these models requires repeated application of transformation and inverse-transormation functions, to go from the original input to original output variables (via the model).
ml_pipline_builder()
ml_pipline_builder()
This function produces an object in which it is possible to: define transformation and inverse-transformation functions; fit a model on training data; and then generate a prediction (or model-scoring) function that automatically applies the entire pipeline of transformation and inverse-transformation to the inputs and outputs of the inner-model's predicted scores.
Calling ml_pipline_builder()
will return an 'ml_pipeline' object (actually an environment
or closure), whose methods can be accessed as one would access any element of a list. For example,
ml_pipline_builder()$transform_features
will allow you to get or set the
transform_features
function to use the pipeline. The full list of methods for defining
sections of the pipeline (documented elsewhere) are:
transform_features
;
transform_response
;
inv_transform_response
; and,
estimate_model
;
The pipeline can be fit, prediction generated and the inner model accessed using the following methods:
fit(.data)
;
predict(.data)
; and,
model_estimate()
.
An object of class ml_pipeline
.
transform_features
, transform_response
,
estimate_model
and inv_transform_response
.
data <- faithful lm_pipeline <- ml_pipline_builder() lm_pipeline$transform_features(function(df) { data.frame(x1 = (df$waiting - mean(df$waiting)) / sd(df$waiting)) }) lm_pipeline$transform_response(function(df) { data.frame(y = (df$eruptions - mean(df$eruptions)) / sd(df$eruptions)) }) lm_pipeline$inv_transform_response(function(df) { data.frame(pred_eruptions = df$pred_model * sd(df$eruptions) + mean(df$eruptions)) }) lm_pipeline$estimate_model(function(df) { lm(y ~ 0 + x1, df) }) lm_pipeline$fit(data) head(lm_pipeline$predict(data)) # eruptions waiting x1 pred_model pred_eruptions # 1 3.600 79 0.5960248 0.5369058 4.100592 # 2 1.800 54 -1.2428901 -1.1196093 2.209893 # 3 3.333 74 0.2282418 0.2056028 3.722452 # 4 2.283 62 -0.6544374 -0.5895245 2.814917 # 5 4.533 85 1.0373644 0.9344694 4.554360 # 6 2.883 55 -1.1693335 -1.0533487 2.285521
data <- faithful lm_pipeline <- ml_pipline_builder() lm_pipeline$transform_features(function(df) { data.frame(x1 = (df$waiting - mean(df$waiting)) / sd(df$waiting)) }) lm_pipeline$transform_response(function(df) { data.frame(y = (df$eruptions - mean(df$eruptions)) / sd(df$eruptions)) }) lm_pipeline$inv_transform_response(function(df) { data.frame(pred_eruptions = df$pred_model * sd(df$eruptions) + mean(df$eruptions)) }) lm_pipeline$estimate_model(function(df) { lm(y ~ 0 + x1, df) }) lm_pipeline$fit(data) head(lm_pipeline$predict(data)) # eruptions waiting x1 pred_model pred_eruptions # 1 3.600 79 0.5960248 0.5369058 4.100592 # 2 1.800 54 -1.2428901 -1.1196093 2.209893 # 3 3.333 74 0.2282418 0.2056028 3.722452 # 4 2.283 62 -0.6544374 -0.5895245 2.814917 # 5 4.533 85 1.0373644 0.9344694 4.554360 # 6 2.883 55 -1.1693335 -1.0533487 2.285521
Building machine learning models often requires pre- and post-transformation of the input and/or response variables, prior to training (or fitting) the models. For example, a model may require training on the logarithm of the response and input variables. As a consequence, fitting and then generating predictions from these models requires repeated application of transformation and inverse-transormation functions, to go from the original input to original output variables (via the model).
pipeline(.data, ...)
pipeline(.data, ...)
.data |
A data.frame containing the input variables required to fit the pipeline. |
... |
Functions of class |
This function that takes individual pipeline sections - functions with class
"ml_pipeline_section"
- together with the data required to estimate the inner models,
returning a machine pipeline capable of predicting (scoring) data end-to-end, without having to
repeatedly apply input variable (feature and response) transformation and their inverses.
A "ml_pipeline"
object contaiing the pipeline prediction function
ml_pipeline$predict()
and the estimated machine learning model nested within it
ml_pipeline$inner_model()
.
data <- faithful lm_pipeline <- pipeline( data, transform_features(function(df) { data.frame(x1 = (df$waiting - mean(df$waiting)) / sd(df$waiting)) }), transform_response(function(df) { data.frame(y = (df$eruptions - mean(df$eruptions)) / sd(df$eruptions)) }), estimate_model(function(df) { lm(y ~ 1 + x1, df) }), inv_transform_response(function(df) { data.frame(pred_eruptions = df$pred_model * sd(df$eruptions) + mean(df$eruptions)) }) )
data <- faithful lm_pipeline <- pipeline( data, transform_features(function(df) { data.frame(x1 = (df$waiting - mean(df$waiting)) / sd(df$waiting)) }), transform_response(function(df) { data.frame(y = (df$eruptions - mean(df$eruptions)) / sd(df$eruptions)) }), estimate_model(function(df) { lm(y ~ 1 + x1, df) }), inv_transform_response(function(df) { data.frame(pred_eruptions = df$pred_model * sd(df$eruptions) + mean(df$eruptions)) }) )
Allows you to define, fit and predict machine learning pipelines.
A helper function that takes as its arguement an estimated machine learning model and returns a prediction function for use within a machine learning pipeline.
predict_model(.m)
predict_model(.m)
.m |
An estimated machine lerning model. |
A unary function of a data.frame that returns the input data.frame with the predicted
response variable column appended. This function is assigned the classes
"predict_model"
and "ml_pipeline_section"
.
## Not run: data <- head(faithful) m <- estimate_model(function(df) { lm(eruptions ~ 1 + waiting, df) }) predict_model(m(data))(data, "pred_eruptions") # eruptions waiting pred_eruptions # 1 3.600 79 3.803874 # 2 1.800 54 2.114934 # 3 3.333 74 3.466086 # 4 2.283 62 2.655395 # 5 4.533 85 4.209219 # 6 2.883 55 2.182492 ## End(Not run)
## Not run: data <- head(faithful) m <- estimate_model(function(df) { lm(eruptions ~ 1 + waiting, df) }) predict_model(m(data))(data, "pred_eruptions") # eruptions waiting pred_eruptions # 1 3.600 79 3.803874 # 2 1.800 54 2.114934 # 3 3.333 74 3.466086 # 4 2.283 62 2.655395 # 5 4.533 85 4.209219 # 6 2.883 55 2.182492 ## End(Not run)
Predict method for ML pipelines
## S3 method for class 'ml_pipeline' predict(object, data, verbose = FALSE, pred_var = "pred_model", ...)
## S3 method for class 'ml_pipeline' predict(object, data, verbose = FALSE, pred_var = "pred_model", ...)
object |
An estimated pipleine object of class |
data |
A data.frame in which to look for input variables with which to predict. |
verbose |
Boolean - whether or not to return data.frame with all input and interim variables as well as predictions. |
pred_var |
Name to assign to for column of predictions from the 'raw' (or inner) model in the pipeline. |
... |
Any additional arguements than need to be passed to the underlying model's predict methods. |
A vector of model predictions or scores (default); or, a data.frame containing the predicted values, input variables, as well as any interim tranformed variables.
data <- faithful lm_pipeline <- pipeline( data, estimate_model(function(df) { lm(eruptions ~ 1 + waiting, df) }) ) in_sample_predictions <- predict(lm_pipeline, data) head(in_sample_predictions) # [1] 4.100592 2.209893 3.722452 2.814917 4.554360 2.285521
data <- faithful lm_pipeline <- pipeline( data, estimate_model(function(df) { lm(eruptions ~ 1 + waiting, df) }) ) in_sample_predictions <- predict(lm_pipeline, data) head(in_sample_predictions) # [1] 4.100592 2.209893 3.722452 2.814917 4.554360 2.285521
Helper function that ensures the output of applying a transform function is a data.frame and that this data frame does not duplicate variables from the original (input data) data frame. If duplicates are found they are automatically dropped from the data.frame that is returned by this function.
process_transform_throw_error(input_df, output_df, func_name)
process_transform_throw_error(input_df, output_df, func_name)
input_df |
The original (input data) data.frame - the transform function's argument. |
output_df |
The the transform function's output. |
func_name |
The name of the |
If the transform function is not NULL
then a copy of the transform function's
output data.frame, with any duplicated inputs removed.
## Not run: transform_method <- function(df) cbind_fast(df, q = df$y * df$y) data <- data.frame(y = c(1, 2), x = c(0.1, 0.2)) data_transformed <- transform_method(data) process_transform_throw_error(data, data_transformed, "transform_method") # transform_method yields data.frame that duplicates input vars - dropping the following columns: 'y', 'x' # q # 1 1 # 2 4 ## End(Not run)
## Not run: transform_method <- function(df) cbind_fast(df, q = df$y * df$y) data <- data.frame(y = c(1, 2), x = c(0.1, 0.2)) data_transformed <- transform_method(data) process_transform_throw_error(data, data_transformed, "transform_method") # transform_method yields data.frame that duplicates input vars - dropping the following columns: 'y', 'x' # q # 1 1 # 2 4 ## End(Not run)
A function that takes as its arguement another function defining a set of feature variable transformations, and wraps (or adapts) it for use within a machine learning pipeline.
transform_features(.f)
transform_features(.f)
.f |
A unary function of a data.frame that returns a new data.frame containing only the transformed feature variables. An error will be thrown if this is not the case. |
A unary function of a data.frame that returns the input data.frame with the transformed
feature variable columns appended. This function is assigned the classes
"transform_features"
and "ml_pipeline_section"
.
data <- head(faithful) f <- transform_features(function(df) { data.frame(x1 = (df$waiting - mean(df$waiting)) / sd(df$waiting)) }) f(data) # eruptions waiting x1 # 1 3.600 79 0.8324308 # 2 1.800 54 -1.0885633 # 3 3.333 74 0.4482320 # 4 2.283 62 -0.4738452 # 5 4.533 85 1.2934694 # 6 2.883 55 -1.0117236
data <- head(faithful) f <- transform_features(function(df) { data.frame(x1 = (df$waiting - mean(df$waiting)) / sd(df$waiting)) }) f(data) # eruptions waiting x1 # 1 3.600 79 0.8324308 # 2 1.800 54 -1.0885633 # 3 3.333 74 0.4482320 # 4 2.283 62 -0.4738452 # 5 4.533 85 1.2934694 # 6 2.883 55 -1.0117236
A function that takes as its arguement another function defining a response variable transformation, and wraps (or adapts) it for use within a machine learning pipeline.
transform_response(.f)
transform_response(.f)
.f |
A unary function of a data.frame that returns a new data.frame containing only the transformed response variable. An error will be thrown if this is not the case. |
A unary function of a data.frame that returns the input data.frame with the transformed
response variable column appended. This function is assigned the classes
"transform_response"
and "ml_pipeline_section"
.
data <- head(faithful) f <- transform_response(function(df) { data.frame(y = (df$eruptions - mean(df$eruptions)) / sd(df$eruptions)) }) f(data) # eruptions waiting y # 1 3.600 79 0.5412808 # 2 1.800 54 -1.3039946 # 3 3.333 74 0.2675649 # 4 2.283 62 -0.8088457 # 5 4.533 85 1.4977485 # 6 2.883 55 -0.1937539
data <- head(faithful) f <- transform_response(function(df) { data.frame(y = (df$eruptions - mean(df$eruptions)) / sd(df$eruptions)) }) f(data) # eruptions waiting y # 1 3.600 79 0.5412808 # 2 1.800 54 -1.3039946 # 3 3.333 74 0.2675649 # 4 2.283 62 -0.8088457 # 5 4.533 85 1.4977485 # 6 2.883 55 -0.1937539
Custom tryCatch configuration for pipeline segment segment functions
try_pipeline_func_call(.f, arg, func_name)
try_pipeline_func_call(.f, arg, func_name)
.f |
Pipleine segment function |
arg |
Arguement of |
func_name |
(Character string). |
Returns the same object as .f does (a data.frame or model object), unless an error is thrown.
## Not run: data <- data.frame(x = 1:3, y = 1:3 / 10) f <- function(df) data.frame(p = df$x ^ 2, q = df$wrong) try_pipeline_func_call(f, data, "f") # Error in data.frame(p = df$x^2, q = df$wrong) : # arguments imply differing number of rows: 3, 0 # --> called from within function: f ## End(Not run)
## Not run: data <- data.frame(x = 1:3, y = 1:3 / 10) f <- function(df) data.frame(p = df$x ^ 2, q = df$wrong) try_pipeline_func_call(f, data, "f") # Error in data.frame(p = df$x^2, q = df$wrong) : # arguments imply differing number of rows: 3, 0 # --> called from within function: f ## End(Not run)