Machine learning with PHP

Machine learning with PHP

These days, one of the most popular topics buzzworded is AI or Artificial Intelligence. More technically, is Machine Learning, but Artificial Intelligence sounds more appealing. Nearly every app that we consider AI is built via a process called machine learning and all of these apps are mainly built using two major technologies on the market, Python, with its power set of tools like Scikit-learn, Keras or Matplotlib, and R, a statistical computing and graphics programming language created specially for Data Scientints.

In this article, I will share with you how we can create machine learning models using a programming language that you may not think can be used outside Web Development ... PHP. Crazy, right? This won't be a deep dive into machine learning or artificial intelligence—there are plenty of those resources available from more experienced authors. I will just share my own experiences using it, testing it, and playing around with it and seeing how well can perform in a Data Science context. So let's dive in!

⁉️ What is Machine Learning

Let’s start with a short explanation of what Machine Learning (ML) is. Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed. In other words, machine learning algorithms can learn from data and improve their performance over time without being explicitly told what to do.

Machine learning is used in a wide variety of applications, including fraud detection, product recommendation, and medical diagnosis. It is also used in many popular web applications, such as Google Search and Netflix.

📃 Types of Machine Learning

Classical machine learning is a type of artificial intelligence that allows computers to learn without being explicitly programmed. It is a data-driven approach to problem-solving, where algorithms are trained on data to improve their performance over time.

There are four main types of machine learning:

  • Supervised learning: In supervised learning, the algorithm is trained on labelled data, where each input data point has a known output. The algorithm learns to predict the output for new input data points based on the patterns it has learned from the training data.

  • Unsupervised learning: In unsupervised learning, the algorithm is trained on unlabelled data, where the output is unknown. The algorithm learns to find patterns in the data without being explicitly told what to look for.

  • Semi-supervised learning: Semi-supervised learning is a hybrid approach that combines supervised and unsupervised learning. The algorithm is trained on a mixture of labelled and unlabelled data.

  • Reinforcement learning: In reinforcement learning, the algorithm learns to behave in an environment by trial and error. It is rewarded for taking actions that lead to desired outcomes and penalized for taking actions that lead to undesired outcomes.

The type of machine learning algorithm that is used depends on the type of data that is available and the task that needs to be solved. For example, if you have a dataset of labelled customer reviews, you could use a supervised learning algorithm to predict the sentiment of new customer reviews. If you have a dataset of unlabelled images, you could use an unsupervised learning algorithm to cluster the images into different categories.

However, after doing some research, I came to understand that PHP can also be used for machine learning with the right library. PHP becomes faster and faster with every version that comes out and there are libraries such as Rubix ML or PHP-ml which can be used for machine learning and artificial intelligence.

🔍 Rubix ML Overview

Rubix ML is a free, open-source machine learning library that allows you to build programs that learn from your data using PHP. It provides tools for the entire machine learning life cycle from ETL to training, cross-validation, and production with over 40 supervised and unsupervised learning algorithms.

⁉️ Why to choose Rubix ML?

  1. Comprehensive

    Rubix ML provides a wide range of machine learning algorithms, including supervised and unsupervised learning algorithms. It also provides tools for feature selection, model selection, and hyperparameter tuning.

  2. Easy to use

    Rubix ML has a developer-friendly API that is easy to learn and use.

  3. Powerful

    Rubix ML is a powerful library that can be used to solve complex Machine Learning problems.

  4. Open source

    Rubix ML is an open-source library that is free to use and distribute.

🚀 Installation

Install Rubix ML into your project using Composer:

$ composer require rubix/ml

🌸 Iris example dataset

A lightweight introduction to machine learning in Rubix ML using the famous Iris dataset and the K Nearest Neighbors algorithm.

The Iris dataset consists of 50 samples for each of three species of Iris flower – Iris setosa, Iris virginica, and Iris versicolor (pictured below).

Each sample comprises 4 measurements or features - sepal length, sepal width, petal length, and petal width. Our objective is to train a K Nearest Neighbors (KNN) classifier to determine the species of Iris flower from a set of unknown test samples using the Iris dataset. Let's get started!

use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Extractors\NDJSON;

$training = Labeled::fromIterator(new NDJSON('dataset.ndjson'));

Next, we will select 10 random samples from the training dataset to use later for making example predictions and evaluating the model. To ensure that the samples are truly random, we will use the randomize() method on the dataset object to shuffle the data. Then, we will use the take() method to extract the first 10 rows from the shuffled training dataset and place them in a separate dataset object. We do this because we want to test the model on data that it has not been trained on. This will give us a more accurate assessment of the model's performance.

$testing = $dataset->randomize()->take(10);

Next, we'll instantiate the K Nearest Neighbors classifier and choose the value of the k hyper-parameter. KNN is a distance-based algorithm that finds the k closest samples from the training set and predicts the most common label. For example, if we decide k equal to 5, then we may get 4 labels that are Iris setosa and 1 that is Iris virginica.

use Rubix\ML\Classifiers\KNearestNeighbors;

$estimator = new KNearestNeighbors(5);

Now, we're ready to train the learner by calling the train() method with the training set we prepared earlier.

$estimator->train($training);

With the model trained, we can make predictions using the testing data by calling the predict() method on the testing set.

$predictions = $estimator->predict($testing);

We can also verify the accuracy of the predicted data by using metrics scores provided by the library.

use Rubix\ML\CrossValidation\Metrics\Accuracy;

$metric = new Accuracy();

$score = $metric->score($predictions, $testing->labels());

We obtained the following result:

[2023-10-08 17:15:35] INFO: Loading data into memory
[2023-10-08 17:15:35] INFO: Training
[2023-10-08 17:15:35] INFO: Making predictions
[2023-10-08 17:15:35] INFO: Accuracy is 0.9

We computed 90% of correct predictions to the total number of predictions.

We also have regressors. A nice use case is predicting the Housing price.

🏠 Housing Price example dataset

An example project that predicts house prices using a Gradient Boosted Machine (GBM) and a popular dataset from a Kaggle competition.

The data is in two separate CSV files - house-price-labeled.csv which has labels for training and house-price-unlabeled.csv without labels for predicting. Each feature column is denoted by a title in the CSV header which we'll use to identify the column with our Column Picker. Column Picker allows us to select and rearrange the columns of a data table while the data is in flight. It wraps another iterator object such as the built-in CSV extractor. In this case, we don't need the Id column of the dataset in the first extractor because it is uncorrelated with the outcome so we'll only specify the columns we need. But we will get into another extractor to use it latter. When instantiating a new Labeled dataset object via the fromIterator() method, the last column of the data table is taken to be the labels.

use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Extractors\CSV;
use Rubix\ML\Extractors\ColumnPicker;

$extractor = new ColumnPicker(new CSV('house-price-labeled.csv', true), [
    'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
    'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
    'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
    'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
    'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
    'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
    'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2',
    'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir',
    'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
    'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
    'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces',
    'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars',
    'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF',
    'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
    'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold',
    'SaleType', 'SaleCondition', 'SalePrice',
]);

$idExtractor = new ColumnPicker(new CSV('house-price-labeled.csv', true), ['Id']);

$dataset = Labeled::fromIterator($extractor);

Next, we'll apply a series of transformations to the training set to prepare it for the learner. By default, the CSV Reader imports everything as a string - therefore, we must convert the numerical values to integers and floating point numbers beforehand so they can be recognized by the learner as continuous features. For this case, we can use Numeric String Converter. Since some feature columns contain missing data, we'll also apply the Missing Data Imputer which replaces missing values with a pretty good guess.

Another step will be...transformations!

use Rubix\ML\Transformers\NumericStringConverter;
use Rubix\ML\Transformers\MissingDataImputer;

$dataset->apply(new NumericStringConverter())
    ->apply(new MissingDataImputer())
    ->transformLabels('intval');

Since the labels should also be continuous, we'll apply a separate transformation to the labels using a standard PHP function intval() callback which converts values to integers.
A Gradient Boosted Machine (GBM) is a machine learning algorithm that combines multiple weak learners to form a strong learner. A weak learner is a simple model that can only make slightly better than random predictions. A strong learner is a complex model that can make accurate predictions.

GBMs work by iteratively adding weak learners to the ensemble. At each iteration, the new weak learner is trained to minimize the gradient of the loss function concerning the predictions of the current ensemble. This means that the new weak learner is focused on learning from the errors of the current ensemble.

We can create a GBM learner instance by instantiating Gradient Boost.

use Rubix\ML\Regressors\GradientBoost;
use Rubix\ML\Regressors\RegressionTree;

$estimator = new GradientBoost(new RegressionTree(4), 0.1);

The first two hyper-parameters of Gradient Boost are the booster's settings and the learning rate. For this example, we'll use a standard Regression Tree, but can be used any Learner available, with a maximum depth of 4 as the booster and a learning rate of 0.1.

Now, we're ready to train the learner by calling the train() method with the training dataset as an argument. As we did before...

$estimator->train($dataset);

We will output the progress for each epoch:

2023-10-19 17:56:30] INFO: Training Gradient Boost (booster: Regression Tree (max height: 4, max leaf size: 3, max features: null, min purity increase: 1.0E-7, max bins: null), rate: 0.1, ratio: 0.5, epochs: 1000, min change: 0.0001, window: 5, hold out: 0.1, metric: RMSE)
[2023-10-19 17:56:30] INFO: Epoch: 1, L2 Loss: 6436384156.2221, Loss Change: ↓INF, RMSE: -71708.764286891
[2023-10-19 17:56:30] INFO: Epoch: 2, L2 Loss: 5630242565.7001, Loss Change: ↓806141590.52197, RMSE: -66407.137811077
[2023-10-19 17:56:31] INFO: Epoch: 3, L2 Loss: 4770622503.1757, Loss Change: ↓859620062.52446, RMSE: -61971.853448989
[2023-10-19 17:56:31] INFO: Epoch: 4, L2 Loss: 4043008935.4127, Loss Change: ↓727613567.76301, RMSE: -57810.401653993
[2023-10-19 17:56:31] INFO: Epoch: 5, L2 Loss: 3386871864.078, Loss Change: ↓656137071.3347, RMSE: -52869.658027195
[2023-10-19 17:56:31] INFO: Epoch: 6, L2 Loss: 2881408466.9165, Loss Change: ↓505463397.16143, RMSE: -49126.644816384
[2023-10-19 17:56:31] INFO: Epoch: 7, L2 Loss: 2463658494.2966, Loss Change: ↓417749972.61996, RMSE: -46003.262222827
[2023-10-19 17:56:31] INFO: Epoch: 8, L2 Loss: 2108573957.896, Loss Change: ↓355084536.40055, RMSE: -43538.115210196
[2023-10-19 17:56:32] INFO: Epoch: 9, L2 Loss: 1838640861.0124, Loss Change: ↓269933096.8836, RMSE: -41170.475905339
[2023-10-19 17:56:32] INFO: Epoch: 10, L2 Loss: 1597249748.4343, Loss Change: ↓241391112.57807, RMSE: -39078.526587176
[2023-10-19 17:56:32] INFO: Epoch: 11, L2 Loss: 1377364288.7838, Loss Change: ↓219885459.65057, RMSE: -36556.015988622
[2023-10-19 17:56:32] INFO: Epoch: 12, L2 Loss: 1220782616.4792, Loss Change: ↓156581672.30458, RMSE: -35054.973980198
[2023-10-19 17:56:32] INFO: Epoch: 13, L2 Loss: 1083909422.4105, Loss Change: ↓136873194.06866, RMSE: -33642.037883807
[2023-10-19 17:56:32] INFO: Epoch: 14, L2 Loss: 976162100.86812, Loss Change: ↓107747321.54242, RMSE: -32753.024359858
[2023-10-19 17:56:33] INFO: Epoch: 15, L2 Loss: 877908777.47182, Loss Change: ↓98253323.396307, RMSE: -31491.748610261
[2023-10-19 17:56:33] INFO: Epoch: 16, L2 Loss: 812525758.20675, Loss Change: ↓65383019.26507, RMSE: -30794.105388278
[2023-10-19 17:56:33] INFO: Epoch: 17, L2 Loss: 742968087.22223, Loss Change: ↓69557670.984522, RMSE: -29889.285010238
[2023-10-19 17:56:33] INFO: Epoch: 18, L2 Loss: 684773684.74305, Loss Change: ↓58194402.479171, RMSE: -29883.999471749
[2023-10-19 17:56:33] INFO: Epoch: 19, L2 Loss: 644918395.24951, Loss Change: ↓39855289.49354, RMSE: -29269.832098457
[2023-10-19 17:56:34] INFO: Epoch: 20, L2 Loss: 607709730.21106, Loss Change: ↓37208665.038457, RMSE: -29067.257462084
[2023-10-19 17:56:34] INFO: Epoch: 21, L2 Loss: 581257898.46296, Loss Change: ↓26451831.748101, RMSE: -28814.753357758
[2023-10-19 17:56:34] INFO: Epoch: 22, L2 Loss: 552260769.5807, Loss Change: ↓28997128.882253, RMSE: -28616.458759402
[2023-10-19 17:56:34] INFO: Epoch: 23, L2 Loss: 528735705.56274, Loss Change: ↓23525064.017968, RMSE: -28542.151094341
[2023-10-19 17:56:34] INFO: Epoch: 24, L2 Loss: 500014462.23253, Loss Change: ↓28721243.330209, RMSE: -28085.448905534
[2023-10-19 17:56:35] INFO: Epoch: 25, L2 Loss: 486252300.75998, Loss Change: ↓13762161.472549, RMSE: -27955.365091657
[2023-10-19 17:56:35] INFO: Epoch: 26, L2 Loss: 471179557.08196, Loss Change: ↓15072743.678016, RMSE: -27541.67537264
[2023-10-19 17:56:35] INFO: Epoch: 27, L2 Loss: 456869450.78453, Loss Change: ↓14310106.297434, RMSE: -27610.582905963
[2023-10-19 17:56:35] INFO: Epoch: 28, L2 Loss: 444041161.13959, Loss Change: ↓12828289.644939, RMSE: -27485.426923627
[2023-10-19 17:56:35] INFO: Epoch: 29, L2 Loss: 427145836.58761, Loss Change: ↓16895324.551974, RMSE: -27344.733854838
[2023-10-19 17:56:36] INFO: Epoch: 30, L2 Loss: 416259219.30291, Loss Change: ↓10886617.284702, RMSE: -27218.292857536
[2023-10-19 17:56:36] INFO: Epoch: 31, L2 Loss: 404938178.83119, Loss Change: ↓11321040.471723, RMSE: -27411.143265803
[2023-10-19 17:56:36] INFO: Epoch: 32, L2 Loss: 395135112.16554, Loss Change: ↓9803066.6656482, RMSE: -27324.247250815
[2023-10-19 17:56:36] INFO: Epoch: 33, L2 Loss: 384453056.68525, Loss Change: ↓10682055.480296, RMSE: -27152.817531035
[2023-10-19 17:56:36] INFO: Epoch: 34, L2 Loss: 379789135.84006, Loss Change: ↓4663920.8451867, RMSE: -27269.680926672
[2023-10-19 17:56:37] INFO: Epoch: 35, L2 Loss: 370961571.07631, Loss Change: ↓8827564.7637495, RMSE: -27157.25989971
[2023-10-19 17:56:37] INFO: Epoch: 36, L2 Loss: 365542631.05462, Loss Change: ↓5418940.0216874, RMSE: -27165.776720778
[2023-10-19 17:56:37] INFO: Epoch: 37, L2 Loss: 360459872.10913, Loss Change: ↓5082758.9454899, RMSE: -27131.32879548
[2023-10-19 17:56:37] INFO: Epoch: 38, L2 Loss: 350958599.22501, Loss Change: ↓9501272.8841182, RMSE: -26961.769954788
[2023-10-19 17:56:37] INFO: Epoch: 39, L2 Loss: 348819400.64492, Loss Change: ↓2139198.5800916, RMSE: -26906.291474111
[2023-10-19 17:56:37] INFO: Epoch: 40, L2 Loss: 343310542.18563, Loss Change: ↓5508858.4592963, RMSE: -26861.164591495
[2023-10-19 17:56:38] INFO: Epoch: 41, L2 Loss: 339369923.15262, Loss Change: ↓3940619.0330026, RMSE: -26796.865049005
[2023-10-19 17:56:38] INFO: Epoch: 42, L2 Loss: 334218230.52533, Loss Change: ↓5151692.6272951, RMSE: -26683.211520229
[2023-10-19 17:56:38] INFO: Epoch: 43, L2 Loss: 329859710.99604, Loss Change: ↓4358519.5292879, RMSE: -26473.740677573
[2023-10-19 17:56:38] INFO: Epoch: 44, L2 Loss: 324846295.37684, Loss Change: ↓5013415.619204, RMSE: -26255.345766531
[2023-10-19 17:56:38] INFO: Epoch: 45, L2 Loss: 317414799.99782, Loss Change: ↓7431495.379018, RMSE: -26234.191216171
[2023-10-19 17:56:39] INFO: Epoch: 46, L2 Loss: 311649575.38238, Loss Change: ↓5765224.6154344, RMSE: -26182.272643805
[2023-10-19 17:56:39] INFO: Epoch: 47, L2 Loss: 307952526.91157, Loss Change: ↓3697048.470809, RMSE: -26129.674044143
[2023-10-19 17:56:39] INFO: Epoch: 48, L2 Loss: 300672369.44642, Loss Change: ↓7280157.4651565, RMSE: -25988.23966495
[2023-10-19 17:56:39] INFO: Epoch: 49, L2 Loss: 297869431.07494, Loss Change: ↓2802938.3714737, RMSE: -26023.519200172
[2023-10-19 17:56:39] INFO: Epoch: 50, L2 Loss: 293476541.42341, Loss Change: ↓4392889.6515364, RMSE: -26046.382216165
[2023-10-19 17:56:40] INFO: Epoch: 51, L2 Loss: 290400510.78881, Loss Change: ↓3076030.6345947, RMSE: -25984.784412911
[2023-10-19 17:56:40] INFO: Epoch: 52, L2 Loss: 285575907.04241, Loss Change: ↓4824603.7464082, RMSE: -26005.007942422
[2023-10-19 17:56:40] INFO: Epoch: 53, L2 Loss: 280953745.41581, Loss Change: ↓4622161.626599, RMSE: -26143.27895326
[2023-10-19 17:56:40] INFO: Epoch: 54, L2 Loss: 277628568.90542, Loss Change: ↓3325176.510388, RMSE: -26206.483797568
[2023-10-19 17:56:40] INFO: Epoch: 55, L2 Loss: 273730914.82035, Loss Change: ↓3897654.0850713, RMSE: -26140.948991003
[2023-10-19 17:56:41] INFO: Epoch: 56, L2 Loss: 268587011.86472, Loss Change: ↓5143902.9556223, RMSE: -26051.624836158
[2023-10-19 17:56:41] INFO: Model state restored to epoch 51
[2023-10-19 17:56:41] INFO: Training complete

During training, the learner records the validation score and the training loss at each iteration or epoch. The validation score is calculated using the default RMSE metric on a hold-out portion of the training set. Contrariwise, the training loss is the value of the cost function (in this case the L2 or quadratic loss) computed over the training data.

Rubix ML also provides us with some features to visualize the training progress by plotting these metrics. To output the scores and losses you can call the additional steps() method on the learner instance. Then we can export the data to a CSV file by exporting the iterator returned by steps() to a CSV file.

use Rubix\ML\Extractors\CSV;

$extractor = new CSV('progress.csv', true);

$extractor->export($estimator->steps());

We can also save the trained model to use later! To do this, we need to wrap the learner into a PersistentModel class.

use Rubix\ML\PersistentModel;
use Rubix\ML\Regressors\GradientBoost;
use Rubix\ML\Regressors\RegressionTree;
use Rubix\ML\Persisters\Filesystem;

$estimator = new PersistentModel(
    new GradientBoost(new RegressionTree(4), 0.1),
    new Filesystem('housing.rbx', true)
);

After that, you can call save() method.

$estimator->save();

Now we're ready to execute the training script by calling it from the command line.

php .\rubix-ml-cli train:house-price-predictor

Our goal is to predict the correct sale prices of each house given a list of unknown samples. We'll start by importing the unlabeled samples from the house-price-unlabeled.csv file.

use Rubix\ML\Datasets\Unlabeled;
use Rubix\ML\Extractors\CSV;
use Rubix\ML\Transformers\NumericStringConverter;

$dataset = Unlabeled::fromIterator(new CSV('house-price-unlabeled.csv', true))
    ->apply(new NumericStringConverter());

Now, let's load the persisted Gradient Boost estimator built in the last step, into our script using the static load() method on the Persistent Model class by passing it a Persister instance pointing to the model in storage.

use Rubix\ML\PersistentModel;
use Rubix\ML\Persisters\Filesystem;

$estimator = PersistentModel::load(new Filesystem('housing.rbx'));

To obtain the predictions from the model, call the predict() method with the dataset containing the unknown samples.

$predictions = $estimator->predict($dataset);

Then we'll use the CSV extractor to export the IDs and predictions to a file that we'll submit to the competition.

use Rubix\ML\Extractors\ColumnPicker;
use Rubix\ML\Extractors\CSV;


$ids = array_column(iterator_to_array($extractor), 'Id');

array_unshift($ids, 'Id');
array_unshift($predictions, 'SalePrice');

$extractor = new CSV('predictions.csv');

$extractor->export(array_transpose([$ids, $predictions]));
$ php .\rubix-ml-cli load:house-price-predictor housing.rbx

Now you can also submit the predictions with their IDs to the contest page to see how well you did. 😊

Have a look at the Gradient Boost documentation page to get a better sense of what the learner can do. Try tuning the hyper-parameters for better results. Consider filtering out noise samples from the dataset by using methods on the dataset object. For example, you may want to remove extremely large and expensive houses from the training set.

📏 Tokenizers

The job of a tokenizer is to break up a stream of text into tokens, where each token is (usually) a sub-sequence of the characters in the text. We have plenty of them!

Word Stemmer

Word Stemmer reduces inflected and derived words to their root form using the Snowball method. For example, the sentence "Majority voting is likely foolish" stems from "Major vote is like foolish."

use Rubix\ML\Tokenizers\WordStemmer;

$tokenizer = new WordStemmer('english');

Word Tokenizer

The Word tokenizer uses a regular expression to tokenize the words in a blob of text.

use Rubix\ML\Tokenizers\Word;

$tokenizer = new Word();

Whitespace Tokenizer

Tokens are delimited by a user-specified whitespace character.

use Rubix\ML\Tokenizers\Whitespace;

$tokenizer = new Whitespace(',');

Sentence Tokenizer

This tokenizer matches sentences starting with a letter and ending with a punctuation mark.

use Rubix\ML\Tokenizers\Sentence;

$tokenizer = new Sentence();

N-gram Tokenizer

N-grams are sequences of n-words of a given string. The N-gram tokenizer outputs tokens of contiguous words ranging from min to max number of words per token.

use Rubix\ML\Tokenizers\NGram;

$tokenizer = new NGram(1, 3);

K-Skip-N-Gram Tokenizer

K-skip-n-grams are a technique similar to n-grams, whereby n-grams are formed but in addition to allowing adjacent sequences of words, the next k words will be skipped forming n-grams of the new forward-looking sequences. The tokenizer outputs tokens ranging from min to max number of words per token.

use Rubix\ML\Tokenizers\KSkipNGram;

$tokenizer = new KSkipNGram(2, 3, 2);

🤗 Thank you for reading!

Thank you for sticking with me until the end. I hope you will benefit from this article and incorporate PHP into your Machine Learning experiments.

📕References

Rubix ML

Kaggle Competition

Repository