Transcript

Inspired by a desire to understand why more engineers weren’t using Ruby for their machine learning (ML) projects, I embarked on a journey to determine if I could build a project to do something non-trivial using ML in Ruby. Once again, it turns out that you can indeed use Ruby to do amazing things! By leveraging various libraries, building machine learning models becomes a breeze.

There is a large array of tools, libraries, and resources available that help facilitate the construction of ML models for Rubyists, and the list keeps growing. Now has never been a better time to dive into machine learning and build interesting projects. It is clear that AI has dramatically shaped the software industry. As such, it makes sense that Rubyists will also partake in these rapidly evolving technologies and create tools and applications never before thought possible!

Resources

00:00
(upbeat music)
00:04
- Welcome. We had a little bit of technical difficulties
00:07
but things seem to be rolling right now, so...
00:10
Thank you for coming to my talk.
00:14
Oops, wrong way.
00:16
Hi, I'm Landon.
00:18
I'm a Senior Monkey Patcher at Test Double.
00:21
That is a name I came up with for myself.
00:24
I'm a senior software consultant at Test Double.
00:28
If you'd like to reach out to me, I'm on LinkedIn,
00:31
Mastodon, and the Bird app.
00:34
So the reason I'm giving this talk is
00:39
about several months ago maybe a year ago maybe,
00:42
I've been thinking about it for a while.
00:44
I was thinking about machine learning, AI, and Ruby
00:48
and a lot of people are doing machine learning and Ruby,
00:52
or sorry, machine learning and Python.
00:54
And I was curious why isn't anyone doing machine learning
00:58
and Ruby?
00:59
Like that's what I want to do, that's my native language.
01:02
I don't want to have to write Python, right?
01:04
Can I get a clap for that? I kept hearing that.
01:06
Yeah, I don't want to have to write Python every
01:09
time I want to do something in my main coding language.
01:12
So I want to use Ruby.
01:14
So this talk is going to walk
01:17
through an entire project that I did
01:19
and I have a gift for you all at the end.
01:21
But I'm going to walk through the entire project
01:24
and kind of present to you how to go
01:27
about doing machine learning projects
01:29
because I want you to be able to do it
01:31
in Ruby and not have to learn a bunch of Python.
01:34
So to kind of start us off
01:36
this is sort of like the agenda for the talk.
01:38
So I'm going to set up a problem.
01:41
We're going to collect a little bit of data,
01:43
we're going to do some data preparation.
01:44
We're going to train our own machine learning model.
01:47
So for many of you,
01:48
this is going to be the first time doing that,
01:50
and then we're going to make some predictions.
01:52
So before we get to that, I want to talk about two things.
01:56
I want to talk about tools and I want to talk about libraries.
02:01
So as developers, one of our main tools is our code editor.
02:08
But when you're doing data science work,
02:12
one of the main tools is going to be Jupyter Notebooks,
02:15
which is a program that lets you kind of
02:18
build out your data science project
02:21
in a way that's like re-shareable.
02:23
And you can also execute code so it kind of runs top down.
02:27
So this is an example, sorry.
02:30
And so traditionally Jupyter notebook use
02:33
it has like Python in it.
02:35
So you write your Python code in the notebook
02:37
and then you can execute the code in the notebook.
02:42
I'll click down so you can actually see the notebook there.
02:45
And so it'll be Python,
02:46
but here we're going to execute Ruby and that.
02:49
And we're using a a tool called iRuby to do that.
02:54
So here we're doing some basic addition
02:57
in Ruby and then we have
03:00
I defined a method that just prints hello world.
03:03
And you can just do that sequentially.
03:07
In Jupyter notebook,
03:08
you can also have some really cool visualization tools.
03:10
So here this is the only bit of like Python code
03:14
that will be in this presentation
03:15
but I'm calling a Python library that does
03:19
some visualization stuff.
03:20
There's also some visualization Ruby gems as well.
03:24
But I just want to show you like, hey,
03:25
you can have some visualizations
03:27
so you can kind of download this file
03:30
and like show your business stakeholders
03:32
and kind of show them a whole project that you've done.
03:36
So next I want to talk a little bit about libraries.
03:39
So for this machine learning project
03:42
I'm using three libraries, one's called numo
03:45
one's called daru and one's called rumale.
03:49
Numo is numerical, N-dimensional array class
03:52
for fast data processing and easy manipulation.
03:56
Daru is a gem that gives you a data structure
04:01
called a data frame, which allows you to do analysis
04:05
manipulation and visualization in data.
04:07
So I'm not sure how familiar you are with Python
04:11
but numo and duru have a synonymous like Python library,
04:17
I guess, called Pandas and NumPy.
04:20
So those are replacements for those.
04:22
And then rumale is a gem that allows you to
04:28
use different machine learning algorithms.
04:31
So first we're going to set up the problem.
04:34
So I want to predict the weather
04:36
'cause I think that's super cool.
04:38
And specifically I want to predict the max temperature
04:42
for a weather data set.
04:43
So first we need to collect our data.
04:47
So I went online and I found a data set
04:50
from the National Centers for Environmental Information
04:53
and they have a ton of weather data that you can download,
04:57
you can use.
05:00
And specifically I downloaded the weather data set
05:03
for the Atlanta airport and it goes back
05:06
to like 1960 something 'cause I thought it'd be cool
05:11
like we're all in Atlanta and so let's predict
05:16
the max temperature for some given input.
05:20
The next step is data preparation.
05:23
So now that we have our data, we're going to prepare it
05:26
and we're going to import that data into our Jupyter notebook.
05:31
And then we're going to note the rows in the columns.
05:33
We'll see that there's about 20,000 rows
05:36
and there's like 48 columns there.
05:39
And the next line is just duplicating that data.
05:44
So when you're working on a data science project,
05:46
you have, you want to pull in your data
05:47
and there's going to be a lot of changes
05:49
that you're going to make to that data.
05:51
You don't want to actually change the data that
05:52
you're importing 'cause you might have to reference
05:54
that later.
05:55
So say you have 48 columns and you drop a bunch
05:58
of them and you only have five columns left
06:00
you might want to reference those other columns
06:02
but you drop them so they're not there.
06:04
So I'm making a duplication that I can work
06:06
off of and continue working on my project.
06:09
So we're actually going to do that.
06:10
I'm going to drop, go up for a minute,
06:12
I'm going to drop all the rows,
06:15
sorry, all the columns except five.
06:18
So the data set that I got from the website shows
06:24
that there are like five core values that they define.
06:26
So I'm going to use these five core values
06:29
to kind of simplify my project for this example.
06:31
And I'm going to use these as the predictors to predict
06:34
the future max temperature.
06:37
So I'm going to go ahead and I'm going to drop all the
06:40
I'm going to create a new data frame
06:42
I'm going to drop all the other columns
06:44
and then this dot head method will just look
06:46
at the top five rows in that data frame and you can,
06:51
it's just basically like the CSV file column or data.
06:56
So it just kind of gives you
06:57
like an overview of what the data looks like.
07:02
And as part of this like data cleanup,
07:05
a lot of times this data processing
07:07
a lot of times we're going to have to clean up the data.
07:09
So you can't just use the data that you get
07:12
and just throw it into a machine learning algorithm.
07:14
There's a lot more work you have to do
07:16
and that work takes a lot of time.
07:18
So sometimes you have to manage or handle missing values.
07:22
You'll have like nils, you have to decide
07:25
well, do I just want to make the nil a zero
07:27
but that's going to really throw off my data set, right?
07:29
Or do I want to just drop the the nil rows or do
07:32
I want to try to do something called imputing
07:35
where you can take an average
07:36
of all the values for like that specific column
07:39
and just like drop it in there.
07:41
That's a lot of nuance there and you're going to
07:43
have to decide how you're going to want to handle it.
07:45
Sometimes you'll have outliers that are going to
07:47
like really throw off your data set.
07:49
So you might have one through a hundred
07:50
and then you might have a million
07:51
and that's going to affect how your model performs
07:54
and it's going to over-optimize
07:56
for this outlier data data point.
07:59
And you don't want that.
08:00
You're going to have to handle that.
08:01
Sometimes you're going to have malformed data,
08:03
you're going to have misspellings.
08:05
Sometimes you're going to have duplicate rows
08:08
in your data as well
08:09
and you're going to have to handle that as well.
08:11
So, so this is me.
08:17
I had to clean up some data using the daru library.
08:20
So this is actually, I'm just dropping the nil rows.
08:25
The code here is a little bit gnarly
08:27
and I'm not very happy with it.
08:29
There's different data frame libraries
08:32
that just give you a really nice function
08:33
that you can just drop those nil values.
08:37
But I didn't use those this time.
08:40
Whew, now I'm tired.
08:42
Data cleaning turns out to be a lot of work.
08:45
So much so that there's a name for it.
08:48
It's the 80/20 rule for like data scientists and
08:53
and basically it says that you're going to spend about 80%
08:55
of your time, 80% of your time cleaning up the data,
08:59
doing all that data manipulation stuff that I talked about.
09:03
And you're going to be spend about like 20%
09:04
of your time like building models
09:06
and trying different models and doing everything else.
09:09
So it's very time consuming, it's very tedious.
09:12
But the good thing is we're already 80% of the way there.
09:15
So the last 20% we're going to train our model
09:18
and make those predictions.
09:22
So as we go about training our models,
09:26
we're going to have to split the dataset
09:27
before we're able to train.
09:30
So about 80% of the dataset is going to be used
09:34
for training data and about 20%
09:36
of the dataset is going to be used for testing.
09:40
So what's the difference
09:41
between the training dataset and the testing dataset?
09:44
Well, so the training data is going to be used
09:49
to train the model and then you're going to need a way
09:52
to like validate that it works
09:55
and you're going to want some data points to like put
09:57
into your model to kind of test it.
09:59
So that's what the testing dataset is assigned for.
10:06
So what I did here is I just split the data into two.
10:12
This looks really complicated.
10:13
I basically took the first 80%
10:15
of the rows and I said that's going to be my training dataset.
10:21
And I said the last 20% are just going to be
10:23
my testing dataset.
10:26
And since we're using linear regression
10:28
I want to talk a little bit about that.
10:29
That's like the model I chose.
10:31
So there's a lot of different models that you can choose.
10:33
I'm picking linear regression
10:34
'cause I think it's a little bit more simpler to understand
10:37
'cause I think some of us have had like maybe exposure
10:40
to some algebra.
10:42
You might, I guess I'll read this.
10:46
Linear regression is an attempt to model the relationship
10:49
between two variables
10:50
by fitting a linear equation to the observed data.
10:55
So you may remember this equation.
10:57
Does anybody know what this is?
10:59
Slope?
11:00
- If you don't, your teachers yelled at you, failed you.
11:03
No, just kidding. Yeah, slope.
11:04
It's the, this is the equation for a line.
11:08
So y equals mx plus b someone said slope.
11:11
So the m is the slope and the b is the y intercept.
11:15
So I prefer it written this way because it,
11:21
it kind of pulls upon sort of our intuition
11:25
as developers where we program with functions and methods
11:29
and I see f of x equals mx plus b, that's just a function
11:34
and I can put some x value in which, oh, our x values
11:37
turn out to be all the data that we want to use to
11:40
predict some other value, which is our y value.
11:44
So if you can imagine like all the data that we have
11:48
we're going to put into that x and out
11:50
it's going to pop some prediction.
11:52
For this example, it's technically multi linear regression
11:56
cause we have multiple X values, not just one.
11:59
And those are going to be the column, the five data points
12:02
that we kind of separated the precipitation
12:06
and the snowfall and things like that.
12:08
We're going to use those to predict the max temperature.
12:13
So imagine
12:15
and this isn't actually what our data set looks like
12:17
but imagine our data set, if we plotted it
12:20
it looks sort of like this, right?
12:23
It kind of has this like linear pattern.
12:26
So if we're doing a linear regression model
12:30
and we want to plot a line,
12:33
we have to plot the line somehow through this data
12:37
so that it is close to all the different points.
12:41
And that's not really necessarily the best way to do this.
12:45
Like there are other machine learning models
12:48
that can kind of trace through
12:51
all the different data points, you know,
12:53
and have really fine tuned predictions.
12:56
So I'm going to leave that to all of you to kind of look
12:59
at my project and tear it apart and be like, oh,
13:02
I found a different model that works better
13:04
than what the presenter presented.
13:06
So, and I would love for you to message me
13:08
and throw it in my face and say like, look what I did.
13:11
I would love that.
13:14
So this line, this straight line that minimizes the distance
13:19
between all the data points.
13:21
This is called like the best fit line.
13:26
So in order to build the linear regression model,
13:30
it's super, super simple.
13:32
So all you have, this is basically all the code
13:35
that you need to train your model.
13:37
So you're taking all of those x values, the precipitation,
13:40
the minimum temperature, I think it was like the snowfall
13:45
and you're shoving them into the x value and then the
13:50
the model's going to fit your data
13:54
and produce a linear regression model
13:56
and you're going to be able to use that to do predictions.
13:59
So we have our model, we're done.
14:02
Okay, now I can go home
14:05
hope you had a great Rails Conf, see you next year.
14:09
So that's basically it.
14:11
Now where does kind of Rails come
14:14
into this building applications.
14:16
You showed me this really interesting project
14:18
and you did a bunch of stuff like what does that mean?
14:20
Like how can I use this into my app?
14:22
Well, we're going to use it to make predictions
14:26
and this is the line of code that we're going to,
14:32
we have our test data and we're going to put our test data
14:35
into this predict function and it's going to pop
14:40
out that y value, remember y equals mx plus b
14:45
or the way I like to write it is f of x equals mx plus b.
14:48
So we shove all those b values our predictors into it,
14:52
those are called independent variables.
14:55
And then out pops a dependent variable which is
14:58
the prediction that we want.
15:00
So theoretically if you're writing a Rails app
15:04
and you did all these steps and then you have your model,
15:07
you can wrap this code right here in in some sort of method
15:12
and call it anytime you wanted to predict something
15:15
for your users on your Rails app.
15:17
So I think that's really nifty.
15:21
So we set up our problem, we collected some data,
15:25
we have some data preparation
15:27
we trained the model and we made some predictions.
15:30
So really that's all there is to it.
15:36
So I just want to thank some folks,
15:38
I want to thank Test Double.
15:40
Andrew Kane is someone who, he's been working a lot
15:46
on machine learning in the Ruby space.
15:49
You should check out his blog.
15:51
And then I've been taking some courses
15:53
from Great Learning just to kind of build
15:55
out Python projects and like been trying to
16:00
figure out how to adapt it towards Ruby.
16:03
So I also have a present for you.
16:06
I told you all
16:07
I have something special to kind of give away.
16:09
So I published the project onto GitHub
16:14
and you can kind of look at it.
16:16
So my goal ultimately for this talk is
16:19
that people can download this, look at it, tweak it,
16:22
and kind of use their own data sets that they have
16:26
in their work or just fun data sets that
16:29
they found and and to see that like,
16:32
this project that I have really isn't that complicated
16:36
and I'm hoping that you'll use your own data sets
16:40
and tweak it and do something interesting with it
16:43
because ultimately, the only way we're going to
16:46
get more machine learning into Ruby and Rails
16:51
is if all of you start working on projects
16:53
and you really don't need some sort of PhD to do this stuff.
16:59
I think there's like academic side
17:01
and there's a place for that.
17:02
But then there's a place for all of us who just
17:04
want to like tinker around and and play with things.
17:06
So I hope you help me out with that.
17:11
And that's all I had.
17:13
So Test Double, we have a email list
17:16
that you can sign up for, wanted you to check that out.
17:20
And I'm just going to leave a little bit of time for like
17:22
Q and A and questions as I know a lot of folks have time
17:26
for that or are interested.
17:27
So I see three here, I don't know who was first?
17:30
1, 2, 3
17:33
When we were doing the prediction
17:35
was the x value
17:37
like the day of the year
17:38
and the y value is the temperature? I guess...
17:42
- Is that the prediction line?
17:45
Okay. Yeah, so the x value is all the values
17:48
that we kind of set aside.
17:50
So that would've been the
17:53
let's see if I go here, that would've been these values.
18:00
So it's a precipitation snowfall,
18:02
snow depth and the minimum maximum temperature
18:07
which you could use the minimum maximum temperature
18:08
but it's basically, I just reduced the number
18:12
of parameters that we're using to predict
18:14
just to kind of simplify it.
18:16
There's not as many things there
18:17
and you can see it kind of in the project.
18:20
I think one piece I sort of missed was
18:22
like I was also using the maximum temperature to
18:25
predict the future maximum temperature
18:28
for like the next day.
18:29
So I kind of took the like,
18:32
for today, there's a max temperature
18:35
for today and then tomorrow there will be a max temperature.
18:38
But I kind of like took tomorrow's max temperature
18:40
from the historical data
18:42
and like kind of moved it upward in the data set.
18:45
You'll see that in the project
18:46
and I'm using the max temperature from the day
18:48
before to predict the max temperature for the next day.
18:51
'Cause those seem to be like slightly correlated.
18:53
Like if it's 60 degrees today,
18:55
it's either going to be like, you know,
18:57
maybe 60 or 62 the next day.
18:59
Also, this isn't like a perfect science and full disclaimer
19:02
like a lot of the things that I did here were to
19:06
kind of present like what a project would look like.
19:09
I would not use linear regression model
19:12
for like a real like forecasting sort of thing like this,
19:16
I'd use something different.
19:18
There's different things you can use.
19:21
Can we still use that?
19:22
Yeah, yeah, so the question is like how
19:24
do you actually like use the model?
19:26
Like, you know, is there a way to like persist it?
19:29
If I recall correctly
19:31
like I haven't gotten to that part yet.
19:32
I think there's a way to like
19:33
there's definitely a way to export the model, but so if you
19:38
so in your app also, if you train the model
19:41
and you're using it, it's going to be the same model.
19:45
You're not going to have to retrain it.
19:46
The only time you're going to have to retrain it is, you know,
19:48
you deploy the model, you get new data,
19:51
you want to optimize your model to be a little bit better
19:53
and you can kind of like redeploy it.
19:56
But yeah, I probably need to look more
20:00
into like the exporting and actually like inputting
20:03
into like the Rails app, but it shouldn't be too hard.
20:06
There's like a way to export the model and things like that.
20:08
So a lot of that's going to be in
20:10
like the daru documentation and Andrew Kane site as well.
20:17
I think I'm at time, but thank you.
20:20
If you have any questions, you can talk to me after.

Landon Gray

Person An icon of a human figure Status
Double Agent
Hash An icon of a hash sign Code Name
Agent 0083
Location An icon of a map marker Location
Rochester, NY