Hey y'all! Same blog, different (Lanyon themed) look.
If you are here for my theoretical articles or notes, they are here.
The sidebar also contains all posts separated by categories here.
Hope you enjoy!

Subscribe to Moments of Clarity Blog

* indicates required

Intuit Mailchimp

Falling Forward, Not Back: Drake

Keywords: Aubrey Drake Graham, Falling Back Off, Random

This post is for pseudo-informational purposes and entertainment only.

So by this point, you’ve probably heard of Drake. He’s the guy who gave us the infamous Hotline bling meme, the most amount of Billboard Music Awards, and easily over a hundred grammies. Needless to say, he is a big deal.

However, a lot of people have argued that his music has been steadily declining for the last five years since 2015. I wouldn’t go to that extent, but there is some truth in the fact that I find myself listening to a lot more of his past songs than his current songs. Maybe that’s a problem within rap itself, but for current rap music - Migos, 21 Savage, Juice WRLD (RIP) - I find myself listening to 2017-2019 songs compared to Drake - where I frequently go as back as 2011-2013.

Ironically, Headlines - easily one of Drake’s best songs - has this line in it:

I had someone tell me I fell off, ouh I needed that

Yet somehow, still so many feel he fell off.

The Good: Drake’s Best

Let’s start of with the good. Drake for sure has plentiful good songs - after all, you don’t enjoy success without some talent.

A list of (my opinion ofc) Drake’s best songs are:

  • Headlines (2011)
  • Marvin’s Room (2011)
  • Energy (2015)
  • Forever (2009)
  • Too Good (2016)
  • Do Not Disturb (2017)
  • Fake Love (2017)
  • Worst Behavior (2013)
  • From Time (2013)
  • Take Care (2011)
  • Hold On, We’re Going Home (2013)
  • Pound Cake + Paris Morton 2 (2013)
  • A Keeper (2022)
  • Sticky (2022)
  • Portland (2017)
  • Falling Back (2022)

I didn’t bold the 2017 songs, even though that’s five years ago!

You might be suprised by the last two. I do think those are decent songs by Drake in the future, and promising of perhaps a new wave of talent. Same goes with his featured verse on Churchill Downs - his verse was good to the point where videos have arised of his section only.

But from the list, I think it’s clear for the most part we’re 5 years past his prime.

Why the downfall?

I’m not really experienced in Drake or analyzing celebrities, but I think based on my preliminary research it mostly comes down to the lack of originality and high predictability.

Lot of people criticize for Drake not trying since his albums Views From the Six. Since then, he has released infamous albums Certified Love Boy and Honestly, Nevermind. Aside from Falling Back or Sticky, neither really had that great songs compared to his past. CLB in particular has been described as having boring beats and using a monotone, tuned out voice. This is in contrast to his past songs like Marvin’s Room and Headlines which had more complex tunes and beats. Worst Behavior and Energy both also demonstrate a good use of a passionate voice than something with a tuned-out voice like Girls Want Girls. While the argument could be made that there is just a natural appreciation for loud, energetic sounds, a lack of diversity in an album is never a pro. Maybe this is just a meta point about rap songs by artists these days than anything else.

Is this to be expected?

To some extent, yes. The pressure of being a star is insurmountable and expecting consistent hits is unreasonable. But, artists like Kendrick Lamar have delivered consistently creative songs. People wouldn’t place this much attention to Drake’s supposed downfall if they didn’t believe he couldn’t do better.

This is the conclusion of this rant. Don’t take anything here seriously.

Gradient Descent Revisited As Euler's Method

Keywords: Gradient Descent, Euler’s Method, Differential Equation

I’ve already talked a fair amount about Gradient Descent. One of the most fascinating things about Gradient Descent is the amount of ways in which it secretly packs calculus concepts together. Aside from the fact that it literally uses gradients, or partial derivatives, gradient descent is also derived from a Taylor Series which I’ve detailed before. Today I’d like to detail Gradient Descent in further detail as actually an application of Euler’s Method.

Gradient Descent as a Differential Equation

As a refresher Gradient Descent is:

\[\theta_{t} = \theta_{t - 1} - \alpha\nabla_{\theta_{t - 1}} J\]

Unlike with the Taylor Series article, this time we will be keeping the learning rate \(\alpha\) into consideration.

Gradient Descent can actually be summarized as this Differential Equation.

\[\dfrac{∂\theta_{t}}{∂t} = -\alpha \dfrac{∂J(\theta_{t})}{∂\theta_{t}}\]

This differential equation is obviously unsolvable. A common way of approximating a function when an unsolvable differential equation is given is to use Euler’s method.

Primer on Euler’s Method

Euler’s Method is a way to approximate any function’s value giving an initial starting point and it’s differential equation. For the given \(y(0)=0\) and unsolvable differential equation \(\dfrac{dy}{dx} = \lvert x \rvert\), we can solve for \(y(2)\) by choosing a set number of iterations \(n\) and working our approximation up.

So for example if we want to do only 2 iterations, we would increase \(x\) by 1 each time. The step to first calculate \(y(1)\) is shown below.

\[y(1) \approx y(0) + \Delta x*y'(0)\]

where in this case \(y'(0)\) would represent the derivative of \(y\) at \(x=0\) and \(\Delta x\) gives the change in x (which in our case is 1). This would give us an approximation of \(y(1) \approx 0\).

We could redo this one more time and get \(y(2) \approx y(1) + y'(1)\), yielding \(y(2) \approx 1\). Of course this approximation is not accurate - however as the stepsize \(\Delta x\) approaches to 0 the approximations will get more accurate but there will be a lot more iterations. Hey, with gradient descent they always say a lower learning rate of \(\alpha\) does better but requires more iterations – I sense some very big similarities.

Gradient Descent Conclusion

Let’s look at our algo again - gradient descent is given by \(\theta_{t} = \theta_{t - 1} - \alpha\nabla_{\theta_{t - 1}} J\), thus we can see that it is basically euler’s method with a step size of \(\alpha\), however done in reverse (hence the negative in the step size.)

Yet another perspective on how to view Gradient Descent. More to come.

Photograph Birds List

Keywords: Bird Species, Random

Note: These are all of birds near San Francisco, California. East-coast birds that I have seen are mostly excluded here.

Bird Background

For the last year, I’ve gone outside and taken photos of specifically birds (yes those things) in my backyard, neighborhood, and local parks and forests. It’s pretty fun and one of the bird things I’ve always wanted to do is to maintain a list of all the bird’s I’ve photographed (the second is to make a tier list).

Birds Photographed

  • California Scrub Jay
  • Lesser Goldfinch
  • Black Phoebe
  • House Finch
  • Western Bluebird
  • American Robin
  • White-Throated Sparrow
  • Chestnut-Backed Chickadee
  • Dark-eyed Junco
  • California Towhee
  • Mourning Dove
  • White-crowned sparrow
  • Gold-crowned sparrow
  • American Robin
  • Barn Swallow
  • Yellow-rumped warbler
  • Anna’s Hummingbird
  • Ruby Throated Hummingbird
  • Bald Eagle
  • Pine Siskin
  • American Tree Sparrow
  • Brown-headed cowbird
  • Tree Swallow
  • Acorn Woodpecker
  • Brewer’s Blackbird

Locations Photographed

Includes even one photograph.

  • Bay Area, California
  • San Francisco, California
  • Half Moon Bay, California
  • Denali, Alaska
  • Princeton, New Jersey
  • Philadelphia, Pennsylvania
  • Anne Arundel County, Maryland
  • Ithaca, New York 😉

Bird Tiers

I will be ranking birds now in terms of my experiences photographing them and my general appreciation of their beauty in my photographs.

Tier 3: Mid Birds

  • Brewer’s Blackbird
  • Pine Siskin
  • Brown-Headed cowbird
  • Mouring Dove
  • California Towhee

Tier 2: Birds that I’ll happily take a photograph of

  • Lesser Goldfinch
  • Pine Siskin
  • Black Phoebe
  • Dark-Eyed Junco
  • House Finch
  • Western Bluebird
  • White-crowned / Gold-crowned Sparrow
  • Yellow-rumped warbler

Tier 1: Birds that Are Rare

  • California Scrub Jay
  • American Robin
  • Tree Swallow
  • Barn Swallow
  • Song Sparrow
  • Bald Eagle
  • Spotted Towhee
  • American Tree Sparrow
  • Acorn Woodpecker

Hidden Taylor Series in Theoretical Machine Learning

Keywords: Taylor Series, Calculus, Gradient Descent, Polynomial Regression, Theoretical ML

This article hopes to provide an alternate look at some of the most foundational algorithms in machine learning, namely Gradient Descent and Polynomial Regression, from an angle of Taylor Series. Taylor Series are an extremely nice approximation method from calculus and are actually quite common in machine learning.

Machine Learning is about creating functions to model (i.e. approximate) functions in data - Taylor series is about approximating functions as well; it should come as no suprise that they are related in many cases.

Taylor Series Primer

So, what are taylor series? I won’t go into the proof here, because I don’t remember it and I don’t think its required, but the main detail is that Taylor Series are a way to approximate any function \(f(x)\) at a given point \(c\) using an infinite sum of polynomials. It’s formula is shown below:

\[f(x)=\sum_{n=0}^\infty f^{(n)}(c)\frac{(x-c)^n}{n!} = f(c) + f^{(1)}(c)(x-c) + \dfrac{f^{(2)}(c)(x-c)^2}{2} + \ldots\]

Where \(f^{(n)}(c)\) represents the \(n\)-th derivative (or just \(f(c)\) if \(n\) is 0), at a point of \(x=c\). Above, we only explicitly show the summation up to the second degree term - if we were to remove the other terms \(\ldots\), we would be left with the second-degree approximation of \(f(x)\). Using finite approximations of a taylor series will become important later, as infinite summations are not always possible in a computer.

On with the examples!

Taylor Series in Gradient Descent

Before reading about the usage of Taylor Series in gradient descent, keep in mind that many different calculus concepts (not just taylor series!) play into gradient descent.

Gradient Descent is an iterative algorithm to generally optimize some function overtime based on its gradient (vector of partial derivatives.) The gradient yields a vector that tells the fastest way to ascend a curve - with gradient descent you try to go down the curve and thus constantly move in the negative of this gradient (moving with the positive is called gradient ascent). Gradient Descent is shown below.

\[\theta_{t} = \theta_{t - 1} - \alpha * \nabla_{\theta_{t - 1}} J\]

For more clarification, \(\theta_{t}\) are the parameters of the model (gradient descent is model agnostic, so this could range from linear regression to GPT-3) at a timestep \(t\) and \(\nabla_{\theta_{t - 1}} J\) represents the gradient of the objective function \(J\) that we are trying to minimize. It’s not shown, but \(J\) takes in the parameters \(x, y\) and \(\theta\). As stated, we move in the direction of the negative of the gradient. \(\alpha\) is the learning rate and is applied to scale the gradient and prevent too high movements.

Gradient Descent can be re-thought of as finding some value \(\Delta \theta\) to adjust \(\theta_{t - 1}\) at each iteration \(t\) such that \(J(\theta_{t - 1} + \Delta \theta)\) is less than \(J(\theta_{t - 1})\). This is where taylor series come in - they help us find this required change. The taylor series to approximate the new value of the objective function to the first-degree is shown below.

Disclaimer: My linear algebra and multivariable calculus skills are kinda DNE, so please don’t try to inspect every term and make sure that the shapes all match up (probably missing a transpose here and there). Instead try to built intuition of the general approach.

\[J(\theta_{t - 1} + \Delta \theta) \approx J(\theta_{t - 1}) + \nabla_{\theta_{t - 1}} J * \Delta \theta\]

This is a bit confusing, so let’s clarify, in this case the \(x\) in the approximation of \(f(x)\) is actually \(\theta_{t - 1} + \Delta \theta\) and thus the taylor series is centered at a point \(c\) of \(\theta_{t - 1}\). This means that the first-degree term \(x - c\) actually equals just \(\Delta \theta\).

This is where it gets clever. Given that we set \(J(\theta_{t - 1})\) or \(J(\theta_{t - 1} + \Delta \theta)\) to be less than \(J(\theta_{t - 1})\), to make both sides of the approximation equal, we basically need to find a way to balance the right and left side by reducing \(J(\theta_{t - 1})\) with negatives from the first-degree expression (\(\nabla_{\theta_{t - 1}} J * \Delta\theta\)) to it. We could just set \(\Delta \theta\) to \(-1\), but because \(\nabla_{\theta_{t - 1}} J\) is a matrix - there is no guarantee that all of the numbers inside of it will necessarily be the same sign. Thus we need to find some value of \(\Delta \theta\) that will make \(\nabla_{\theta_{t - 1}} J\) ALL negative.

This is actually much easier than expected. Just set \(\Delta \theta\) to the negative of \(\nabla_{\theta_{t - 1}} J\) and we guarantee all elements are negative (as a gradient squared yields all positive values.) This leads to the approximation having any chance of balancing out. As we add more degrees to the taylor polynomial, this approximation would keep on getting better.

This means that in our (minimization) algorithm of Gradient Descent, with \(\alpha\) for stability, we have:

\[\theta_{t} = \theta_{t - 1} + \Delta \theta_{t - 1} = \theta_{t - 1} - \alpha * \nabla_{\theta_{t - 1}} J\]

That’s gradient descent!

A lot of this information was taken from here and summarized here. We can expand this method to actually include second and third derivatives as well - but those aren’t used as much due to requiring more computational power (sorry for the explanation that literally everyone gives.)

Taylor Series in Polynomial Regression

Weeee that was a lot of work for gradient descent! Luckily, applications of taylor series in polynomial regression is much more straightforward.

Polynomial regression is basically a type of linear regression where a single input (let’s stick to one-dimensional input for simplicity) is raised to higher powers and then the appropriate coefficients are found. The prediction function for a two-degree polynomial regression is shown below.

\[\hat{y} = h(x) = \beta_{0} + \beta_{1}x + \beta_{2}x^{2}\]

As shown above, \(\hat{y}\) are the predictions and \(\beta_{n}\) are the coefficients for \(x\) raised to the \(n\)-th power. Already looks like a taylor series (or maclaurin series as \(c = 0\)) right?

In fact, it is! It’s that simple. Let’s see if this actually works in practice.

Empirical Experiment with Exponentials

Alliteration, huh? Okay, so the famous taylor series of the function \(e^{x}\) is shown below.

\[e^{x} = \sum_{n = 0}^{\infty} \dfrac{x^{n}}{n!} = 1 + x + \dfrac{x^{2}}{2} + \ldots\]

Will Polynomial Regression (to a degree of 2) actually learn this specific taylor series? To reiterate, if I train a Linear Regression model with scikit-learn’s in Python, that takes in \(x\) and \(x^{2}\) as input - will it learn \({1, 1, 0.5}\) as \(\beta_{0}, \beta_{1}, \beta\_{2}\) respectively?

Will this taylor series be learned, from an algorithm derived from taylor series 🤯 ?

from sklearn.linear_model import LinearRegression
import numpy as np

"""
we'll generate from a normal distribution as the
second-degree polynomial approximates best in this range.
"""
X = np.abs(np.random.randn(1000, )) # 1000 standard normal samples
y = np.power(np.e, X) # labels

We can now construct the \(X^{2}\) data. We’ll merge into X_poly which will contain two columns - the first one for \(X\) and the second one for \(X^{2}\).

X_2 = np.power(X, 2)
X_poly = np.concatenate((X.reshape(-1, 1), X_2.reshape(-1, 1)), axis=1)

Let’s apply Linear Regression.

lin_reg = LinearRegression() # lin reg algorithm
lin_reg.fit(X_poly, y)

Let’s inspect the coefficients \(\beta_{1}, \beta_{2}\).

>>> lin_reg.coef_
array([-1.24106858,  2.25427569])

And the bias \(\beta_{0}\).

>>> lin_reg.intercept_
1.5053263303981406

Looks like it won’t.

My hypothesis is that if you use a higher-degree approximation (and of course have much more data), it will be more likely the coefficients will slowly fall in line with the true taylor polynomial. In our case, given that it only had a sample of \(e^{x}\) and not the whole infinite set of values, it is to be expected that not the precise values were found as these values actually could have minimized the mean-squared error more for this specific training set \(X\).

Also, if Polynomial Regression is a taylor series polynomial then linear regression is a first-degree taylor series?

Review

Congrats on making it here!

Taylor Series are a great way to approximate any function into polynomials. They have a lot of hidden applications in machine learning mathematics. We went over their role in helping determine how to move iteratively in Gradient Descent and it’s very clear application to Polynomial Regression.

Please let me know what more theoretical machine learning applications you find! Thanks for reading.

An explanation of Pareto's Principle

Keywords: Pareto’s Principle, 80/20 rule, Distribution of Outcomes, Applications

A while ago (more like 4 years ago) I discovered Pareto’s principle - also known as the 80/20 rule - which states that’s 80% of outputs result from 20% of the inputs. Less abstractly, this means that 20% of the input to anything, such as time or effort, will lead to 80% of the actual outputs. For example, only 20% of the features on Microsoft Word will actually be used by 80% of the users whereas the remaining 80% of the features will be used by 20% of the users.

Pareto’s distribution is stunningly applicable. 80% of the crimes are commited by 20% of registered criminals, and 80% of a business’s wealth comes from only 20% of the clients. 20% of the apps on your phone likely are used 80% of the time. This distribution does not need to be exactly 80/20; in some cases it can even go down to 95/5 or even 99/1 (e.g. distribution of wealth across population).

So, how can we use Pareto’s principle in our daily lives? Internalizing the exponential nature in our lives is probably the best way to implement it. If we want a 90 on a test, we’ll only need to put in about half of the effort and time studying compared to getting a 97+. The same goes for standardizing testing; scores are gradually harder to get at the top. Keeping in mind the 80/20 rule can help us decide when the extra mile is excessive or essential.

So now we’ve gone over the how and what - only the why remains. Any ideas on why our world and outcomes are distributed this way? My guess is that the underlying reason of the 80/20 rule is disparity and inequity - one group having much more extreme outliers that overpower all others. The world is moving at an exponential rate; the most successful stocks (which are extremes in themselves) almost always follow exponential curves as compared to linear ascents. Moore’s law literally forecasts that the number of transistors on a microchip double every 2 years; it’s no secret that we also think the world is constantly growing faster than it ever has before. In short, Pareto’s principle seems to be very common when disparity or a gap grows uncontrollably. In our world, that’s not too uncommon.