Probably Credible
https://www.probablycredible.com/index.html
A blog on tools and techniques in Bayesian inference and machine learningquarto-1.3.450Wed, 01 Nov 2023 07:00:00 GMTPhone Number Collisions in the Waiting RoomHector
https://www.probablycredible.com/blog/phone-number-collision-waiting-room/index.html
Some medical waiting rooms use the last 4 digits of a phone number (in America at least) as a unique and private identifier for patients. When checking in, the patients provide their phone number and then they can check wait times on a screen that displays a list of the last 4 digits of phone numbers. In addition to quasi-anonymity, I think this system also has the benefit of allowing an automated audio system to call a person by their number, rather than mangle a last name. But I recently began to wonder how frequently there might be a collision, where two people in the waiting room share the same last 4 digits of their phone numbers. I couldn’t stop thinking about it, so I decided to find out.

North American Telephone System

In America, we have a 10-digit telephone system with three sections:

area code: unique code assigned to a geographic (mostly) region

telephone prefix: unique code assigned to smaller regions within an area code

line number: 4-digit code unique within area code+telephone prefix

In theory, if all phone numbers were available and valid (they’re not), there could be at most 10^{10} (10,000,000,000) telephone numbers in North America.

Considering the 4-digit line number, there would be 10^{4} (10,000) unique numbers. Out of 10,000 possible line numbers, what are the chances that two people share the same one? At first, this might look like a version of the birthday problem.

Birthday Problem

The birthday problem is a probability question that asks, in a set of randomly selected people, what is the probability that at least two people share a birthday. Or asked another way, how many people do you need to select to have a greater than 50% chance of getting a matching birthday? I heard this problem for the first time during my first year of graduate school and was surprised that as little as 23 people are needed. The easiest way to solve this problem is to look at it from the other direction:

What is the probability that in a group of people, no one shares a birthday?

If is the probability that at least two people in a group of share a birthday, is the probability that no one shares a birthday ().

To avoid anyone sharing a birthday, we would want to add people to the group who have a new birthday. As we keep adding people to the group, the available dates grow smaller. The first person added to the group can have any of the 365 days of the year, the second person must have one of the 364 remaining days, and the third person must have one of the 363 remaining days. The probability of selecting these people (and dates) is grounded in the 365 days of a year. The probability of picking the first person who doesn’t share a birthday is , the probability of picking the second person who doesn’t share a birthday is , and the probability of picking the first person who doesn’t share a birthday is . Since these events are independent, we multiply them to get the probability that three selected people don’t share a birthday:

For every additional person, we would multiply an additional term:

This simplifies^{1} to:

indicates a Permutation which is the number of ways that we can draw items from a pool of 365 items, where order matters.

Let’s run this calculation in Python and compare a range of group sizes (Figure 2).

from math import permimport matplotlib.pyplot as pltimport numpy as npimport seaborn as snsfrom cycler import cyclerfrom matplotlib import rcParams%config InlineBackend.figure_format ='retina'%load_ext watermarkpetroff_6 = cycler(color=["#5790FC", "#F89C20", "#E42536", "#964A8B", "#9C9CA1", "#7A21DD"])rcParams["axes.spines.top"] =FalsercParams["axes.spines.right"] =FalsercParams["axes.prop_cycle"] = petroff_6# array of 2-100 people in our groupn_array = np.arange(2, 101, dtype="int")prob_array = np.zeros(shape=n_array.shape)# stopping criteria if our probability is nearly 1eps =1e-3for i, n inenumerate(n_array):# calculate probability for each number of people in group prob_array[i] =1- perm(365, n.item()) / (365** n.item())# stop early if we approach probability of 1if (1- prob_array[i]) < eps: prob_array = prob_array[:i] n_array = n_array[:i]break# plot probability for each group size, indicating first group size with probability > 0.5def plot_match_curve(n_array, prob_array, ax): ax.plot(n_array, prob_array) idx = np.argmax(prob_array >0.5)print(f"Group size (n): {n_array[idx]}, probability: {prob_array[idx]:.3f}") y_lim = ax.get_ylim() x_lim = ax.get_xlim() ax.vlines(n_array[idx], ymin=y_lim[0], ymax=prob_array[idx], linestyle="--", color="C1") ax.hlines(xmin=x_lim[0], xmax=n_array[idx], y=prob_array[idx], linestyle="--", color="C1") ax.set_xlim(x_lim) ax.set_ylim(y_lim) ax.set_xlabel("$n$") ax.set_ylabel("$p(n)$")fig, ax = plt.subplots(figsize=(7, 5), layout="constrained")plot_match_curve(n_array, prob_array, ax);

Group size (n): 23, probability: 0.507

We got that right answer ( = 23), which is good. Let’s apply the same method with our telephone line number collision.

Line Number Collision as Birthday Problem

If we follow the birthday problem structure, instead of 365 days of the year, we have 10,000 unique line numbers. We have a new probability formula, which we can use in our Python code (Figure 3):

# array of 1-500 people in our groupn_array = np.arange(1, 501, dtype="int")prob_array = np.zeros(shape=n_array.shape)# stopping criteria if our probability is nearly 1eps =1e-3for i, n inenumerate(n_array):# calculate probability for each number of people in group prob_array[i] =1- perm(10000, n.item()) / (10000** n.item())# stop early if we approach probability of 1if (1- prob_array[i]) < eps: prob_array = prob_array[:i] n_array = n_array[:i]break# plot probability for each group size, indicating first group size with probability > 0.5fig, ax = plt.subplots(figsize=(7, 5), layout="constrained")plot_match_curve(n_array, prob_array, ax);

Group size (n): 119, probability: 0.506

This tells us that if we have at least 119 in the waiting room at the same time, there is at least a 50% probability that there are at least two people with the same 4-digit line number. I don’t think a waiting room would ever have more than 50 people, so 119 wouldn’t be a worry. But, like the birthday problem, it’s surprising that with 10,000 line numbers, we only need 119 people to have a greater than 50% chance of having a collision.

Solve Without Replacement

But this is a simple approximation of our real problem. In the birthday problem, when adding people, we are drawing ‘with replacement’. All dates are valid (although in our calculation, we were trying to add people with new birthdays), and our denominators remained the same (365).

But telephone numbers are finite and unique. Combining the permutations of both the area code and prefix numbers, there are 10^{6} telephone numbers that can have the same line number. We will call this quantity . As we add people to our waiting room, their phone number is no longer available in the pool of telephone numbers. In this case, we are drawing ‘without replacement’. Additionally, with every person/phone number added to our waiting room, there are an additional 10^{6} numbers we are trying to avoid moving forward.

where is a product operator.

Let’s see how handling the finite reality of phone numbers changes our result (Figure 4):

# array of 1-2000 people in our groupn_array = np.arange(1, 2_000, dtype="int64")prob_array = np.zeros(shape=n_array.shape)num_line_numbers =10**4num_area_prefixes =10**6# stopping criteria if our probability is nearly 1eps =1e-3prob =1for i, n inenumerate(n_array):# calculate probability for each number of people in group prob_num = (num_line_numbers - n +1) * num_area_prefixes prob_denom = num_area_prefixes * num_line_numbers - n +1 prob_n = prob_num / prob_denom prob = prob * prob_n prob_array[i] =1- prob# stop early if we approach probability of 1if (1- prob_array[i]) < eps: prob_array = prob_array[:i] n_array = n_array[:i]break# plot probability for each group size, indicating first group size with probability > 0.5fig, ax = plt.subplots(figsize=(7, 5), layout="constrained")plot_match_curve(n_array, prob_array, ax);

Group size (n): 119, probability: 0.506

Interestingly, we get the same result! We must have so many that it is as though we are sampling with replacement.

But that makes me wonder, what is the threshold at which starts to make a difference? 10^{6} is sufficiently high enough, but let’s see where the results change (Figure 5).

n_array = np.arange(1, 2_000, dtype="int64")prob_array = np.zeros(shape=n_array.shape)num_line_numbers =10**4num_area_prefixes_array = np.arange(2, 101)greater_half_prob = np.zeros(shape=num_area_prefixes_array.shape)eps =1e-3for j, num_area_prefixes inenumerate(num_area_prefixes_array): prob =1for i, n inenumerate(n_array):# calculate probability for each number of people in group prob_num = (num_line_numbers - n +1) * num_area_prefixes prob_denom = num_area_prefixes * num_line_numbers - n +1 prob_n = prob_num / prob_denom prob = prob * prob_nif (1- prob) >0.5: greater_half_prob[j] = n_array[i]breakfig, ax = plt.subplots(figsize=(7, 5), layout="constrained")ax.step(num_area_prefixes_array, greater_half_prob, where="post")ax.set_xlabel("#area-prefixes")ax.set_ylabel("$n$");

We can see that with 60 , we need 119+ people in the waiting room to have a greater than 50% chance of a collision. And with fewer than 60 , we need 120-167 people in order to have a 50% chance of collision. 60 is far lower than the 10^{6} we were previously working with!

In reality, the United States (and its territories) uses 358 area codes, and 210 telephone prefixes are invalid (none start with 0, prefixes that end with 11 are reserved, and two others aren’t used). This gives us 790 . Even if we further restrained our problem to consider a waiting room that only served a single area code, there would still be 790 telephone prefixes, and therefore 790 in use.

If we revisit the probability formula we created above:

we can see that as grows, the term in the denominator becomes insignificant, and effectively goes to 0. At that point, can be factored out of every term in the fraction, and we are left with:

which is the same setup as our birthday problem formula.

Conclusion

Starting with a simplification and using the birthday problem as a model, we found that we need at least 119 people in the waiting room to have a telephone line number collision. Even when considering the valid/in-use area codes and prefixes used in the United States, we still need at least 119 people. This holds true even if we used a single area code (and all valid prefixes within it)! There is still a chance of collisions, but we can see how unlikely they are in a normal-sized waiting room.

Appendix: Environment Parameters

%watermark -n -u -v -iv -w

Last updated: Wed Nov 01 2023
Python implementation: CPython
Python version : 3.11.5
IPython version : 8.16.1
seaborn : 0.13.0
numpy : 1.25.2
matplotlib: 3.8.0
Watermark: 2.4.3

Footnotes

Birthday problem probability simplification: ↩︎

]]>ProbabilityPythonhttps://www.probablycredible.com/blog/phone-number-collision-waiting-room/index.htmlWed, 01 Nov 2023 07:00:00 GMTHow Much Will My Dog Weigh? Using Bayesian Modeling to Predict My Dog’s WeightHector
https://www.probablycredible.com/blog/bayesian-model-dog-weight/index.html
Earlier this year, I got a puppy and always wondered how large she would grow to be. From the moment she came home, I have been weighing her consistently to get a feel for how fast she was growing and guess her weight at maturity. Based on her type, I thought she might grow to 45 lbs, but it was hard to tell where she might stop during the rapid growth phase of puppyhood. My dog’s growth has started to slow down, so I figured I might start to predict my dog’s weight at maturity. I thought it would be fun to write up that process and share it with you all! We will use both Bayesian inference with pymc and non-linear least squares and compare results.

Growth Models

I am sure there is a lot of research around dog or mammal growth curves, and which models are most appropriate. But, I wanted to compare a few models, so I selected three growth models from GraphPad’s website:

Logistic Growth:

Gompertz:

Exponential Plateau:

These three models have the same three parameters that are used differently in the equations:

: independent time variable. This will be the age of the dog in weeks.

: Initial weight at = 0 (birth). You can see that at =0, the exponential terms are raised to zero (which is equal to 1), and the equations all collapse to .

: Maximum asympototic weight. You can also see that at = , the exponential terms are raised to negative infinity, which zeroes out the term, leaving .

: growth rate during exponential phase. All three equations conserve , but the placement of this term alters how fast the growth really is.

For comparison, here are all three curves with = 10, = 50, and = 1 (Figure 1). You can see that Logistic Growth starts growing more slowly than the rest, picks up, but still grows slower than the others. Exponential Plateau starts growing much faster than the rest. I am not sure which of these models will fit our data, but let’s find out! Next, we’ll start to build our Bayesian models.

Bayesian Inference

In this section, we have the following objectives:

Outline Bayesian model

Select priors that we will use in our Bayesian models for the growth curves

Check the priors can give us sensical models (without data)

Run Bayesian inference models

Evaluate the quality of posteriors

Evaluate the quality of our models with some reserved recent data

Determine when we can be more certain which model is right, and what age my dog will reach her final weight

Bayesian Model

The bulk of this post is devoted to using Bayesian methods to estimate the parameters of our growth curves and make inferences. So, we will need to start building a Bayesian model (that will be similar for the three curve models mentioned above). We’re going to assume that a weight predicted by a growth curve is an idealized weight, but our observations will be distributed around that weight. So given a particular age (), there is an ideal/average weight (), from which many individual can be sampled.

is the standard deviation of this distribution and indicates the nature of the natural variance or error in our measurements.

We already know that is governed by our growth curve models and input ages, but the other parameters of the growth curves are also sampled from distributions. These are our priors, which we will reason through specifying in the next section.

Bayesian inference status:

Outline Bayesian model

Select priors that we will use in our Bayesian models for the growth curves

Check the priors can give us sensical models (without data)

Run Bayesian inference models

Evaluate the quality of posteriors

Evaluate the quality of our models with some reserved recent data

Determine when we can be more certain which model is right, and what age my dog will reach her final weight

Priors

We will be fitting some models to arrive at our best estimates of the , , and terms. I’ll use the same priors for these parameters for each of the three models. To start, let’s pick a prior for .

Again, this is the weight of the puppy at birth (Aside, I don’t really care about learning the birth weight, but it’s part of the model. I won’t be upset if our model doesn’t cover the weight at younger ages well). We know must be positive and will be relatively close to zero. From some online searches, puppies in general can be anywhere from 0.25 lbs to 2.25 lbs at birth. Given our proximity to 0, let’s use a Gamma distribution as a prior for .

We can use a midpoint of this birthweight range as our mean ( = 1.75 lbs), and assume this range corresponds to ± 3 standard deviations ( = 0.5). This will break down a bit because the Gamma distribution is not symmetric, but it’s fine for our purposes. We will use the (shape) and (rate) parametrization of Gamma, and can solve for those parameters from this mean and standard deviation:

The probability density function of this prior is below with the others (Figure 2).

For this dog type, I’m led to believe that 45 lbs is a decent guess for final weight. I could imagine this final weight ranging anywhere from 37.5 to 52.5 lbs, so let’s use that range as ± 3 standard deviations and use a Normal distribution with mean of 45 and sigma of 2.5 for our . This value should also be positive, but it’s far enough away from 0 that I’m not worried about getting negative numbers from this distribution. The probability density function of this prior is below with the others (Figure 2).

I know much less about . My intuition is that it should be small, on the order of 1? In our case, it should also be positive, so let’s use a Gamma distribution again. We can set its mean at 1 ( = 1) and give it a standard deviation of 1 ( = 1).

A Gamma distribution with = 1 is an Exponential distribution! The Exponential distribution is a special case of the Gamma distribution.

The probability density function of this prior is below with the others (Figure 2). You can see now that this really does have an Exponential distribution that is different from the typical Gamma distribution.

Visualize Priors

We can use the python package preliz (from the team behind arviz) to easily plot these priors below. You could also use this package to help pick the priors more interactively.

import arviz as azimport matplotlib.pyplot as pltimport numpy as npimport pymc as pmfrom cycler import cyclerfrom matplotlib import rcParams%config InlineBackend.figure_format ='retina'%load_ext watermarkpetroff_6 = cycler(color=["#5790FC", "#F89C20", "#E42536", "#964A8B", "#9C9CA1", "#7A21DD"])rcParams["axes.spines.top"] =FalsercParams["axes.spines.right"] =FalsercParams["axes.prop_cycle"] = petroff_6

Select priors that we will use in our Bayesian models for the growth curves

Check the priors can give us sensical models (without data)

Run Bayesian inference models

Evaluate the quality of posteriors

Evaluate the quality of our models with some reserved recent data

Determine when we can be more certain which model is right, and what age my dog will reach her final weight

Prior Predictive Check

With these priors in hand, for all three models, let’s perform a prior predictive check to see if we would get reasonable parameter estimations without seeing any data. Since we are choosing to use the same priors for all three models, we can share the same prior predictive sampling for all three models. We’ll generate 50 samples, build growth curves using the , , and from each of the 50 samples and see if our growth curves look reasonable and valid. (Figure 3)

RANDOM_SEED =14rng = np.random.default_rng(RANDOM_SEED)with pm.Model() as model_priors: ym = pm.Normal("ym", 45, 2.5) y0 = pm.Gamma("y0", 12.25, 7) k = pm.Gamma("k", 1, 1) idata_priors = pm.sample_prior_predictive(samples=50, random_seed=rng)

Looking at the prior predictive checks for all three models, we can see that they mostly seem to plateau around 40 - 50 lbs (we already knew that from Figure 2). But we can also see the range of different growth rates. We did give a wide prior, so our data will be helpful here to help identify the most probable rates. Some of these curves seem to plateau within 10-20 weeks (2.5-5 months), which seems unlikely. I’m guessing will end up on the lower end of our prior.

Bayesian inference status:

Outline Bayesian model

Select priors that we will use in our Bayesian models for the growth curves

Check the priors can give us sensical models (without data)

Run Bayesian inference models

Evaluate the quality of posteriors

Evaluate the quality of our models with some reserved recent data

Determine when we can be more certain which model is right, and what age my dog will reach her final weight

Final Bayesian Model

We now have a final Bayesian model we can use for each curve model. We can now load our data and start building our Bayesian models in pymc.

Data Inspection

So far, we haven’t touched the data at all. We have collected some different growth models and created some priors for the growth model parameters from our pre-existing knowledge or best guesses. Now, we can load the data and finish building our Bayesian models in pymc.

I collected this data since my dog was 9.5 weeks old, and I measured her weight roughly two times a week. The data shown in this post includes up to 38.5 weeks old; hopefully in a few months we’ll be able to use the subsequent data to see which model was truly the best. The weight was measured after a morning walk, but before breakfast, with the same scale, so hopefully the data is as consistent as reasonably possible.

Our goal is to pick a model that fits our current data as well as future months’ weight data. I will reserve the last 4 weeks of this data to validate these models. Hopefully the remaining data captures enough of the slowdown in growth to give an idea of the final weight and ideal model.

pd.concat((df.head(), df.tail()))

Age (days)

Age (weeks)

Weight (lbs)

0

67

9.571429

13.1

1

71

10.142857

13.2

2

74

10.571429

13.6

3

77

11.000000

14.2

4

81

11.571429

15.2

48

256

36.571429

39.6

49

259

37.000000

39.2

50

263

37.571429

39.0

51

266

38.000000

39.6

52

270

38.571429

40.2

# Label data within the final four weeks as Test set, rest is Traindf["Set"] =""max_age = df["Age (weeks)"].max()df.loc[df["Age (weeks)"] >= (max_age -4), "Set"] ="Test"df.loc[df["Age (weeks)"] < (max_age -4), "Set"] ="Train"

df_train = df.loc[df["Set"] =="Train"]df_test = df.loc[df["Set"] =="Test"]print(f"Train Set Size: {df_train.shape[0]}, "+f"Train Set Percentage: {df_train.shape[0]/df.shape[0]:.1%}")print(f"Test Set Size: {df_test.shape[0]}, "+f"Test Set Percentage: {df_test.shape[0]/df.shape[0]:.1%}")

Train Set Size: 44, Train Set Percentage: 83.0%
Test Set Size: 9, Test Set Percentage: 17.0%

We’ll now build the Bayesian models we outlined in Final Bayesian Model section using pymc. Every line in our model has been translated to a line of code.

Instead of using the raw training input directly (df_train[“Age (weeks)”]), we are plugging it into pm.MutableData(). This has the benefit of allowing us to run our model once, make update the mutable input data, and run the model for a new purpose (like making predictions). This will save us a lot of effort.

Finally, pm.sample() runs our model with a specified number of tuning steps, samples to collect, and parallel chains.

Let’s start to compare the model fits by looking at the summary of the parameter posterior distributions. For one of our models, we can generate a summary with az.summary():

az.summary(idata_log, kind="stats")

mean

sd

hdi_3%

hdi_97%

ym

39.921

0.475

39.027

40.799

y0

4.460

0.230

4.033

4.886

k

0.139

0.004

0.131

0.147

sigma

0.569

0.065

0.449

0.689

For all our parameters (remember sigma is a feature of our regression, but not the growth models themselves) we have means, standard deviations (sd) and 94% highest density interval (HDI) (hdi_3% -> hdi_97%). We will use the 94% HDI as our Bayesian credible interval. Given our model and data, we are 94% certain that the true value of our parameters lies within the 94% HDI.

From this summary of the Logistic Growth model, we can see that the most likely value for is 39.9 lbs but may range between 39.0 and 40.8 lbs. We ultimately care about the comparison of the whole models, but we can aggregate these summaries for the different models and get some insight into differences between the fits.

Let’s take this dataframe and make some simple forest plots of the means and HDI of each parameter (Figure 5). From the table above and the graphs below, we can see that for , Logistic Growth and Gompertz Growth estimate similar values of 39.9 lbs and 42.7 lbs, respectively. The 94% HDI of these two estimates are close, but don’t overlap. from the Exponential Plateau model, on the other hand, is estimated to be 54.1 lbs, with an 94% HDI that spans 8 lbs.

I didn’t really care about because my purpose is to predict future weights, but there is also quite a range of dog birthweights predicted by the different models. The estimates for Gompertz Growth and Exponential Plateau are reasonable, but the Logistic Growth model estimates a birthweight of 4.5 lbs, which is extremely high!

I originally guessed that the parameters for the model would be somewhere about 1 (for no reason), but it looks like I was off by an order of magnitude or two. Finally, sigma is indicative of the distribution of error of our observed data with respect to the fit model. sigma for the Exponential Plateau is extremely high! This is a good early warning sign that we may have a poor fit with that model.

Select priors that we will use in our Bayesian models for the growth curves

Check the priors can give us sensical models (without data)

Run Bayesian inference models

Evaluate the quality of posteriors

Evaluate the quality of our models with some reserved recent data

Determine when we can be more certain which model is right, and what age my dog will reach her final weight

Compare Growth Curve Posteriors

Let’s finally look at the estimates of these models, putting together the estimates of the individual parameters. We can generate posterior predictive samples using our existing models, which are built off our training data.

Following our Bayesian principles, instead of taking the mean values of our fit parameters, we will plot the curves using 94% HDI bands (Figure 6). Using our training data, and only looking at the range of ages covered by our training data, we can see that the Logistic Growth and Gompertz Growth models are nearly identical and seem to be a good approximation of the underlying function. The Exponential Plateau model, on the other hand, doesn’t match the underlying function. It overestimates the weight at low ages and underestimates it at higher ages. This is why sigma and the current HDI are so large.

Select priors that we will use in our Bayesian models for the growth curves

Check the priors can give us sensical models (without data)

Run Bayesian inference models

Evaluate the quality of posteriors

Evaluate the quality of our models with some reserved recent data

Determine when we can be more certain which model is right, and what age my dog will reach her final weight

Predict Future Dog Weights

Now, let’s see how well our models perform on the 4 weeks of recent data that we withheld from the training. We will generate samples from the posterior predictive, again. We want to get these samples based off our actual testing dataset, as well as a range of ages that will show us predicted weights in the future. This is why we used pm.MutableData() for our input age data when building the models. For each model, we can update the input data, and run pm.sample_posterior_predictive(). Since we need to do this for three models, and for two different sets of test data, we will make a small function (ironically, the lines of code are now longer to run it all!).

# Age (week) values ranging from final value in training set, to 70 weeksweeks_test_range = np.linspace(df_train["Age (weeks)"].max(), 70)def evaluate_posterior_predictive(model, idata, test_input, random_seed):with model:# Update mutable input data `weeks` model.set_data("weeks", test_input)# Generate new posterior predictive samplesreturn pm.sample_posterior_predictive(trace=idata, random_seed=random_seed)posterior_predictive_gompertz_test_range = evaluate_posterior_predictive( model=model_gompertz, idata=idata_gompertz, test_input=weeks_test_range, random_seed=rng,)posterior_predictive_log_test_range = evaluate_posterior_predictive( model=model_log, idata=idata_log, test_input=weeks_test_range, random_seed=rng,)posterior_predictive_exp_test_range = evaluate_posterior_predictive( model=model_exp, idata=idata_exp, test_input=weeks_test_range, random_seed=rng,)

We’ll make similar plots to Figure 6, but will incldue the testing data, and future predictions. It’s a lot of duplicate code, so we’ll make a function to handle it all for us.

Looking at the plots in Figure 7, we can see that the testing data is already starting to deviate from the projection in the Logistic Growth model, but the Gompertz Growth model still looks good.

Lastly, we can calculate Root Mean Square Error (RMSE) from our test set and samples from the posterior predictive. The testing dataset has some strange trends in it (due to changes in my dog’s appetite and activity, I’m guessing), but we can still use RMSE as a metric of fit. We will update our input data with the ages in our test set and generate new samples.

So, we will take the average across chains and draws to give a single predicted value, which we can compare to our original observed data when calculating the MSE.

gomp_rmse = calc_rmse(posterior_predictive_gompertz_test, df_test["Weight (lbs)"])print(f"Gompertz Growth Model RMSE: {gomp_rmse:.2f}")exp_rmse = calc_rmse(posterior_predictive_exp_test, df_test["Weight (lbs)"])print(f"Exponential Plateau Model RMSE: {exp_rmse:.2f}")log_rmse = calc_rmse(posterior_predictive_log_test, df_test["Weight (lbs)"])print(f"Logistic Growth Model RMSE: {log_rmse:.2f}")

Gompertz Growth Model RMSE: 0.59
Exponential Plateau Model RMSE: 0.85
Logistic Growth Model RMSE: 1.12

On average, we would expect to see an error of 0.58 lbs when predicting weight in the future with the Gompertz Growth model, 0.86 lbs with the Exponential Plateau model, and 1.12 lbs (!) with the Logistic Growth model. I think these values could be greater if our test data was from the exponential growth period.

Although the Exponential Plateau model seemed to fit the training data more poorly, and has greater uncertainty, it still seems valid for our small testing dataset.

Bayesian inference status:

Outline Bayesian model

Select priors that we will use in our Bayesian models for the growth curves

Check the priors can give us sensical models (without data)

Run Bayesian inference models

Evaluate the quality of posteriors

Evaluate the quality of our models with some reserved recent data

Determine when we can be more certain which model is right, and what age my dog will reach her final weight

Final Growth Curve Selection

From what we’ve seen thus far, I think the Gompertz Growth model fits our data best. If we look into the future (Figure 8), at around 60 weeks, the separation between the Gompertz Growth and Logistic Growth models will have stabilized. If one of those models is correct, my dog’s weight will have stabilized too! I’ll make a follow-up post a bit later (April 2024) to see if my dog’s weight is between 41.6 - 43.9 lbs.

Select priors that we will use in our Bayesian models for the growth curves

Check the priors can give us sensical models (without data)

Run Bayesian inference models

Evaluate the quality of posteriors

Evaluate the quality of our models with some reserved recent data

Determine when we can be more certain which model is right, and what age my dog will reach her final weight

Non-linear Least Squares

Well, that ended up much longer than I thought, but I hope you’re still with me. We will quickly fit the same models using Non-linear Least Squares using scipy’s curve_fit() and plot confidence bands with the help of the uncertainties package.

curve_fit() accepts a function (in the style that we already defined above for each model), independent data, dependent data, and an optional initial guess of parameters. The initial guess isn’t always needed, but can speed up the process, or allow it to happen at all. I will use the means of our priors from above (I’ll give a closer guess for because it really needed it).

uncertainties holds onto numbers with uncertainties. It has its own functions that mimic numpy functions, but are built to handle and carry uncertainty. After running curve_fit(), we’ll pass our optimal parameters and covariance matrix through to get uncertain versions of our parameter fits. Since we have np.exp() in our functions, we will also define new versions of our functions that use unp.exp() instead.

import uncertainties as uncimport uncertainties.unumpy as unpfrom scipy.optimize import curve_fitfrom uncertainties.core import AffineScalarFunc

When comparing the Least-Squares estimates and the means of our Bayesian posteriors, the estimates for from Logistic Growth and Gompertz Growth are within 0.3 lbs. The estimate for from Exponential Plateau is within 4.6 lbs. The Exponential Plateau fit is pretty true to the data this time.

You may have noticed that this is because it was allowed to fit birthweight () to a negative value! When using our Gamma distribution prior in the Bayesian modeling, we prevented negative birthweights. We could also restrain our parameters when using curve_fit(), too. If you don’t need a fully valid model and only care about predicting future weights, loosening our priors could result in a better fit.

Conclusion

We now have some different growth curve models to verify in a few months. I’m going to follow the Gompertz model for now, and predict my dog will reach 41.6 - 43.9 lbs.

Appendix: Environment Parameters

%watermark -n -u -v -iv -w

Last updated: Sat Oct 28 2023
Python implementation: CPython
Python version : 3.11.5
IPython version : 8.16.1
preliz : 0.3.6
xarray : 2023.10.1
numpy : 1.25.2
arviz : 0.16.1
pandas : 2.1.1
matplotlib : 3.8.0
seaborn : 0.13.0
pymc : 5.9.1
uncertainties: 3.1.7
Watermark: 2.4.3

]]>PythonPyMCBayesianTime Serieshttps://www.probablycredible.com/blog/bayesian-model-dog-weight/index.htmlSat, 28 Oct 2023 07:00:00 GMTVersion Control Your Jupyter Notebooks with JupytextHector
https://www.probablycredible.com/blog/version-control-jupyter-with-jupytext/index.html

Version Control with Jupyter Notebooks is Bothersome

If you’ve ever accidentally opened a Jupyter notebook (.ipynb) in a text editor you’ve seen something like this:

This is just a simple Jupyter notebook with a markdown cell and two Python cells:

This is what a simple Jupyter notebook looks like, but once you start generating plots or displaying images, you will end up with a file with blobs that look like this:

This is how Jupyter stores your images in the notebook, which allows it to display the plots when you open up a previously executed notebook. These outputs along with the Javascript behind the scenes makes Jupyter notebooks messy to version control. These image/figure outputs make Git want to treat the whole .ipynb file as a binary file, which means that it can be added or updated, but you won’t get the line-by-line changes you would get with other file types. There are some solutions to the need to track changes in Jupyter notebooks (discussed at the end), but one option I’ve used since 2021 is Jupytext.

What is Jupytext?

Jupytext is a tool that synchronizes your Jupyter notebooks (.ipynb) with other file types that are plain text, and can therefore be used seamlessly with Git. Jupytext was started in 2018 by Marc Wouts and was first publicly announced in a blog post in September 2018.

Jupytext lets you pair Jupyter notebooks with multiple other code or markdown file formats. My preferred format is the “percent Script” which I will detail below. Returning to our simple notebook containing a single line of code, here it is in “percent Script” .py form:

You’ll notice we have a header with some of our Jupytext configuration, as well as our Jupyter kernel details. Besides the header and copious # %% comments, it looks a lot like python code. Depending on the purpose of your notebook, you could run this from the commandline like any other Python file. Each of our notebook cells is delimited by # %%, which can be executed individually in IDEs like Spyder, VS Code, or PyCharm. You’ll also notice that we no longer have the output of our cells; this is just our code (this could be a deal breaker for you). The output still exists in the original .ipynb file, but this .py file is kept clean. Depending on how you configure Jupytext, the paired code files can update every time you save the .ipynb file.

Working with .py Files is Easy!

Most of the benefit of using Jupytext (for me) stems from the ease of working with Python files directly. If this is not a big deal for you, there isn’t much reason to use Jupytext.

These are the benefits I see:

Version control is simple and clean

All your favorite formatters and linters (black, isort, ruff) work natively with .py files, although.ipynb files are getting more support than they used to

Jupyter’s interactivity is unbeatable, but IDEs like VS Code or PyCharm are more feature rich and are more responsive when writing code

If you use the “percent Script” file format, you can run your notebook in other IDEs, too

Jupytext Usage

Once Jupytext is installed (I install it in my main python environment where Jupyter is installed, not in each Python virtual environment/Jupyter kernel), you can choose to pair individual Jupyter notebooks with the format of your choice by bringing up the Jupyter Command Palette (Ctrl/Command + Shift + C):

This will create a file in the same folder with the same file name, but with a new file extension according to the selected format. If you delete either of these paired files, Jupytext will re-generate it after you save the other paired file! Using the Jupyter Command Palette method, you can pair multiple file types, if they all have different file extensions. To stop pairing, open the Jupyter Command Palette and uncheck what you selected (or select Unpair Notebook at the bottom).

Open Files as Notebooks

After installing Jupytext, you’ll see that the icons of Jupytext-eligible pair file types have notebook icons, even if they haven’t been paired to anything.

Furthermore, if you right click on one of these Jupytext-eligible files, you can choose to open it as a notebook. With the default settings (which can be changed: Always Open .py as Notebook), all of these files will still open as their plain, original file type in Jupyter.

You can treat .py files (that start empty for best results) as Jupyter notebooks by opening them in this way, without ever pairing them to an actual .ipynb. Jupytext is maintaining one for you somewhere else, but you don’t have to worry about that.

Configuring Jupytext with pyproject.toml/jupytext.toml

In addition to pairing notebooks on an individual basis, you can also use a configuration file in your folder or repo to set some default Jupytext behavior. For example, this is how I like to setup my analysis projects:

I work with the notebook .py files and completely ignore the .ipynb files that are tucked away in nb_ipynb/. I assume they are there, but I never look at them! I ignore .ipynb files in my .gitignore too!

Here is the jupytext.toml file I use to accomplish this:

I hardly ever open plain Python files in Jupyter. I use Jupyter for interactive coding but leave package development or writing shared files to VS Code. Therefore, you can set Jupytext as the default viewer for various file types, so when double clicking on a .py or .rmd file, they will open as a Jupyter notebook. You can still right click one of these files and choose to open it its original form.

Building a Quarto website with Jupytext (this website!)

I have flipped things about a bit, but I am also using Jupytext with this website, which is made using Quarto. Quarto pages can be rendered from .md, .qmd, or .ipynb files. If you are running Python cells in a .qmd file, I believe the .qmd file is converted to an .ipynb file first (and run) before being rendered. While it would be easy to keep everything in Jupyter notebooks, RStudio does offer some benefits when working with .qmd files, so it has been convienent to keep the file type on hand. Some posts (like this one), don’t need any Python at all, and it is pure .qmd. When building the website, I believe Quarto looks for the file types it can render and renders them all. It ignores files or folders that start with an underscore, so I put the replicate files there. This is how my directory looks:

notebook_metadata_filter="all,-widgets,-varInspector"# Pair .qmd, .py versions in folders to .ipynb in main folderformats="ipynb,_notebooks_py//py:percent,_notebooks//qmd"

You can see the format at the end is a little different, but this is how I was able to get the results I wanted.

Who would not want to use Jupytext?

I really enjoy using Jupytext, and I hope you will try it, but it’s not for everyone or every situation.

To start, you can see that none of the output is included in the synced Jupytext files. All that information still exists in the .ipynb files, but not in the .py files. Not even text-based output. I am mostly fine with this. This behavior is similar to that of .rmd files, so it must work for R users.

If you rely on noticing changes in plots when re-running notebooks to detect something has changed, you will need some other method. I like to save modeling fits to files anyway (and version control them), which will notify me if something has unexpectedly changed.

If you are collaborating with others and don’t have tightly adhered requirements, your collaborator might re-run your notebook and not realize that they’re seeing a different plot than you are. Again, generating and versioning text file artifacts can help here.

It’s really nice to read through a tutorial for a Python package on its website, and download its original .ipynb file. I always re-run the files anyway, but it would be less useful if the notebook were devoid of all output.

Jupytext alternatives

I believe nbdime is the most popular tool to work with Jupyter notebooks and version control. You can download and install it yourself, but I believe it is incorporated in notebook diff comparisons in VS Code and GitHub. Although Jupytext is another tool to rely on, I like converting to plain text files ahead of time, so I can just use Git or GitHub and not have to worry about fussing with a rich, graphical diff of a file.

I hope you give Jupytext a try and contribute to the Jupytext community if there is a special case that is not covered yet.

]]>JupyterVersion ControlPythonhttps://www.probablycredible.com/blog/version-control-jupyter-with-jupytext/index.htmlWed, 25 Oct 2023 07:00:00 GMT