Goal: Predict the time to calculate a full 443392 element correlation matrix based on timings for smaller matrices
%matplotlib notebook
import matplotlib.pyplot as plt
from scipy.stats import linregress
import numpy as np
N = [10.,100.,500.,1000.,2000.,10.,100.,1000.,2000.,9740.]
t = [1.16e-3,85.2e-3,1.71,6.71,26.5,27e-3,80.7e-3,6.83,26.6,10*60.+38]
fig = plt.figure()
plt.plot(N,t,"bo")
plt.xlabel("transcripts")
plt.ylabel("time/s")
We expect quadratic scaling. More generally, the times should fit:
t=aNb
This is linear in log space:
logt=blogaN=bloga+blogN=A+blogN
So, log transform:
lN = np.log(N)
lt = np.log(t)
fig = plt.figure()
plt.plot(lN,lt,"bo")
plt.xlabel("ln(transcripts)")
plt.ylabel("ln(time/s)")
Yep, looks linear (but note the noise for very small N).
Linear fit:
slope, intercept, r_value, p_value, std_err = linregress(lN,lt)
fig = plt.figure()
plt.plot(lN,lt,"bo")
plt.plot([lN[0],lN[-1]],[lN[0]*slope+intercept,lN[-1]*slope+intercept],"r-")
plt.xlabel("ln(transcripts)")
plt.ylabel("ln(time/s)")
print "lt = %s*lN+%s r = %s p = %s std_err = %s" % (slope, intercept, r_value, p_value, std_err)
Nice fit, but note that we underpredict the last data point. Also note that the slope is smaller than the expected value of 2 (suggesting subquadratic scaling).
Let's drop the noisy small N points and refit:
N = [100.,500.,1000.,2000.,100.,1000.,2000.,9740.]
t = [85.2e-3,1.71,6.71,26.5,80.7e-3,6.83,26.6,10*60.+38]
lN = np.log(N)
lt = np.log(t)
slope, intercept, r_value, p_value, std_err = linregress(lN,lt)
fig = plt.figure()
plt.plot(lN,lt,"bo")
plt.plot([lN[0],lN[-1]],[lN[0]*slope+intercept,lN[-1]*slope+intercept],"r-")
plt.xlabel("ln(transcripts)")
plt.ylabel("ln(time/s)")
print "lt = %s*lN+%s r = %s p = %s std_err = %s" % (slope, intercept, r_value, p_value, std_err)
Improved fit with better prediction for large N. Also, note that the slope is now close to 2, as expected for quadratic scaling.
Let's plot the curve in the original space:
x = np.arange(1,9740,10)
y = np.exp(np.log(x)*slope+intercept)
fig = plt.figure()
plt.plot(N,t,"bo")
plt.plot(x,y,"r-")
plt.xlabel("transcripts")
plt.ylabel("time/s")
Note that we're still slightly underpredicting large N. We could try improving our linear estimate with non-linear regression or by collecting additional data between 4000 and 8000, but this is good enough for a rough extrapolation.
What is our prediction for the full matrix?
In seconds:
44339*slope+intercept
In days:
(44339*slope+intercept)/60./60./24.
So, last night's estimate was too high. The full calculation would take a bit more than a day.
(Given sufficient memory)