3.10 AutoML
Contents
3.10 AutoML#
Automated workflow for hyper-parameter tuning and optimal model finder
In this tutorial, we will try some cool technique that has been used widely to make AI/ML less tedious and boost your ML workflow efficiency.
If you have learned 3.6, you might be amazed but also annoyed by all those parameter tuning efforts and many back-n-forth iterations needed to figure out which configuration will be optimal for your case. It has been known as the major reason for low productivity in the AI/ML world. People come up with an idea that it seems most work in that tuning and iteration are very simple, can we automate it? The answer is yes, and that will be the technique we will introduce here: AutoML.
There are many AutoML solutions on the market, e.g., AutoKeras, auto-sklearn, H2O, Auto-WEKA, etc. Here we will focus on PyCaret which is a popular one in both academia and industry and very easy to use.
In the following tutorial, we will use the Pycaret Docker Image to run the tutorial. In Terminal, call docker
to pull the PyCaret image and start a jupyter notebook:
docker pull pycaret/full
docker run -it -p 8888:8888 -e GRANT_SUDO=yes pycaret/full
Installations on M1 Mac can be tricky - especially when using lighgbm library. Try to install both libraries.
You will then be able to edit a notebook with the following cells:
First we get data ready#
As usual, data collection is the first step. To better demonstrate the point of AutoML, we will use the same data as 3.6 Random Forest.
!pip install wget
Requirement already satisfied: wget in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (3.2)
[notice] A new release of pip is available: 23.3.1 -> 24.2
[notice] To update, run: pip install --upgrade pip
import wget
wget.download("https://docs.google.com/uc?export=download&id=1pko9oRmCllAxipZoa3aoztGZfPAD2iwj")
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[2], line 2
1 import wget
----> 2 wget.download("https://docs.google.com/uc?export=download&id=1pko9oRmCllAxipZoa3aoztGZfPAD2iwj")
File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/wget.py:526, in download(url, out, bar)
524 else:
525 binurl = url
--> 526 (tmpfile, headers) = ulib.urlretrieve(binurl, tmpfile, callback)
527 filename = detect_filename(url, out, headers)
528 if outdir:
File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/urllib/request.py:239, in urlretrieve(url, filename, reporthook, data)
222 """
223 Retrieve a URL into a temporary location on disk.
224
(...)
235 data file as well as the resulting HTTPMessage object.
236 """
237 url_type, path = _splittype(url)
--> 239 with contextlib.closing(urlopen(url, data)) as fp:
240 headers = fp.info()
242 # Just return the local path and the "headers" for file://
243 # URLs. No sense in performing a copy unless requested.
File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/urllib/request.py:214, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
212 else:
213 opener = _opener
--> 214 return opener.open(url, data, timeout)
File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/urllib/request.py:517, in OpenerDirector.open(self, fullurl, data, timeout)
514 req = meth(req)
516 sys.audit('urllib.Request', req.full_url, req.data, req.headers, req.get_method())
--> 517 response = self._open(req, data)
519 # post-process response
520 meth_name = protocol+"_response"
File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/urllib/request.py:534, in OpenerDirector._open(self, req, data)
531 return result
533 protocol = req.type
--> 534 result = self._call_chain(self.handle_open, protocol, protocol +
535 '_open', req)
536 if result:
537 return result
File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/urllib/request.py:494, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
492 for handler in handlers:
493 func = getattr(handler, meth_name)
--> 494 result = func(*args)
495 if result is not None:
496 return result
File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/urllib/request.py:1389, in HTTPSHandler.https_open(self, req)
1388 def https_open(self, req):
-> 1389 return self.do_open(http.client.HTTPSConnection, req,
1390 context=self._context, check_hostname=self._check_hostname)
File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/urllib/request.py:1346, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args)
1344 try:
1345 try:
-> 1346 h.request(req.get_method(), req.selector, req.data, headers,
1347 encode_chunked=req.has_header('Transfer-encoding'))
1348 except OSError as err: # timeout error
1349 raise URLError(err)
File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/http/client.py:1285, in HTTPConnection.request(self, method, url, body, headers, encode_chunked)
1282 def request(self, method, url, body=None, headers={}, *,
1283 encode_chunked=False):
1284 """Send a complete request to the server."""
-> 1285 self._send_request(method, url, body, headers, encode_chunked)
File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/http/client.py:1331, in HTTPConnection._send_request(self, method, url, body, headers, encode_chunked)
1327 if isinstance(body, str):
1328 # RFC 2616 Section 3.7.1 says that text default has a
1329 # default charset of iso-8859-1.
1330 body = _encode(body, 'body')
-> 1331 self.endheaders(body, encode_chunked=encode_chunked)
File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/http/client.py:1280, in HTTPConnection.endheaders(self, message_body, encode_chunked)
1278 else:
1279 raise CannotSendHeader()
-> 1280 self._send_output(message_body, encode_chunked=encode_chunked)
File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/http/client.py:1040, in HTTPConnection._send_output(self, message_body, encode_chunked)
1038 msg = b"\r\n".join(self._buffer)
1039 del self._buffer[:]
-> 1040 self.send(msg)
1042 if message_body is not None:
1043
1044 # create a consistent interface to message_body
1045 if hasattr(message_body, 'read'):
1046 # Let file-like take precedence over byte-like. This
1047 # is needed to allow the current position of mmap'ed
1048 # files to be taken into account.
File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/http/client.py:980, in HTTPConnection.send(self, data)
978 if self.sock is None:
979 if self.auto_open:
--> 980 self.connect()
981 else:
982 raise NotConnected()
File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/http/client.py:1454, in HTTPSConnection.connect(self)
1451 else:
1452 server_hostname = self.host
-> 1454 self.sock = self._context.wrap_socket(self.sock,
1455 server_hostname=server_hostname)
File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/ssl.py:501, in SSLContext.wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session)
495 def wrap_socket(self, sock, server_side=False,
496 do_handshake_on_connect=True,
497 suppress_ragged_eofs=True,
498 server_hostname=None, session=None):
499 # SSLSocket class handles server_hostname encoding before it calls
500 # ctx._wrap_socket()
--> 501 return self.sslsocket_class._create(
502 sock=sock,
503 server_side=server_side,
504 do_handshake_on_connect=do_handshake_on_connect,
505 suppress_ragged_eofs=suppress_ragged_eofs,
506 server_hostname=server_hostname,
507 context=self,
508 session=session
509 )
File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/ssl.py:1074, in SSLSocket._create(cls, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, context, session)
1071 if timeout == 0.0:
1072 # non-blocking
1073 raise ValueError("do_handshake_on_connect should not be specified for non-blocking sockets")
-> 1074 self.do_handshake()
1075 except (OSError, ValueError):
1076 self.close()
File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/ssl.py:1343, in SSLSocket.do_handshake(self, block)
1341 if timeout == 0.0 and block:
1342 self.settimeout(None)
-> 1343 self._sslobj.do_handshake()
1344 finally:
1345 self.settimeout(timeout)
KeyboardInterrupt:
Display the data columns#
Show the columns and settle on the target variables and the input variables. In this chapter, we will use
# Pandas is used for data manipulation
import pandas as pd
# Read in data and display first 5 rows
features = pd.read_csv('temps.csv')
features.columns
Temp_2 : Maximum temperature on 2 days prior to today.
Temp_1: Maximum temperature on yesterday.
Average: Historical temperature average
Actual: Actual measure temperature on today.
Forecast_NOAA: Temperature values forecasted by NOAA
Friend: Forecasted by Friend (Randomly selected number within plus-minus 20 of Average temperature)
We will use the actual
as the label, and all the other variables as features.
Check the data shape#
features.shape
# One-hot encode the data using pandas get_dummies
features = pd.get_dummies(features)
# Display the first 5 rows of the last 12 columns
features.iloc[:,5:].head(5)
Split training and testing#
As we already did all the quality checks in 3.6, we will not repeat them here and directly go to AutoML experiment. First, split the data into training and testing subsets.
train_df = features[:300]
test_df = features[300:]
print('Data for Modeling: ' + str(train_df.shape))
print('Unseen Data For Predictions: ' + str(test_df.shape))
train_df
Run PyCaret (no hassle)#
Directly get to the point. Expect PyCaret to tell you what is going wrong. It should be able to automatically recognize the columns and assign appropriate data types to them.
First step, PyCaret need you to confirm the data columns are correctly parsed and their data types match their values. If yes, please enter in the popup text field.
from pycaret.regression import *
exp_reg101 = setup(data = train_df,
target = 'actual',
# imputation_type='iterative',
fold_shuffle=True,
session_id=123)
Compare Models#
Once you confirmed the data types are correct, run the comparison using one single line of code:
best = compare_models(exclude = ['ransac'])
Get Best Model#
It looks great! PyCaret automatically did all the work under the hood and give us the best model! You need to look at the RMSE and R2 columns in the comparison table, and the best RMSE and R2 are both achieved by Random Forest, which is much clear and can save you a lot of time to compare them. These results are professionally calculated at the point where PyCaret thinks it is neither overfitting nor underfitting. So the comparison results are very solid and reliable.
Next step is to extract the best model’s hyperparameter configuration, and you can consider the hyperparameter tuning step is done, and go ahead and train your model.
best
If you don’t think the best model is the most cost wise model and need to check more models, you can print out more models by top3 = compare_models(exclude = ['ransac'], n_select = 3)
and top3
will be a list and return the first 3 models.
Model Interpretation#
You can get more details about why the best model is the best. PyCaret provides a function called interpret_model
. It will produce a figure showing the influence of each input variable on the results. It is actually the same result of SHAP library and PyCaret integrates it.
interpret_model(best)
Evaluate More Metrics#
PyCaret provides some awesome widgets and plots to give you an easy way for visualizing and checking many other useful metrics during its training.
evaluate_model(best)
TroubleShooting#
First time runners might meet this issue on M1: https://github.com/microsoft/LightGBM/issues/1369 Please reinstall pycaret and lightgbm and see if the problem is gone. If not, please create a new issue on the Github repository issue page.