Paleowave

Pubsplained #1: How to fit a straight line through a set of points with uncertainty in both directions?

Publication

Thirumalai, K., Singh, A., & Ramesh, R. (2011). A MATLAB™ code to perform weighted linear regression with (correlated or uncorrelated) errors in bivariate data. Journal of the Geological Society of India, 77(4), 377–380. 
doi: 10.1007/s12594–011–0044–1

Summary

We present a code that fits a line through a set of points (“linear regression”). It is based on math first described in 1966 that provides general and exact solutions to the multitude of linear regression methods out there. Here is a link to our code.

Pubsplainer

Fitting a straight line through a bunch of points with X and Y uncertainty.

Fitting a straight line through a bunch of points with X and Y uncertainty.

My first peer-reviewed publication in the academic literature described a procedure to perform linear regression, or, in other words, build a straight line (of “best fit”) through a set of points. We wrote our code in MATLAB and applied it to a classic dataset from Pearson (1901).

“Why?”, you may ask, perhaps followed by “doesn’t MATLAB have linear regression built into it already?” or “wait a minute, what about polyfit?!”

Good questions, but here’s the kicker: our code numerically solves this problem when there are errors in both x and y variables… and… get this, even when those errors might be correlated! And if someone tells you that there is no error in the x measurement or that errors are rarely correlated - I can assure you that they are most probably erroneous.

York was the first to find general solutions for the “line of best fit” problem when he was working with isochron data where the abscissa (x) and ordinate (y) axis variables shared a common term (and hence resulted in correlated errors). He first published the general solutions to this problem in 1966 and subsequently published the solutions to the correlated-error problem in 1969.

If these solutions were published so long ago, why are there so many different regression techniques detailed in the literature? Well, it’s always useful to have different approaches to solving numerical problems, but as Wehr & Saleska (2017) point out in a nifty paper from last year, the York solutions have largely remained internal to the geophysics community (in spite of 2000+ citations), escaping even the famed “Numerical Recipes” textbooks. Furthermore, they state that there is abundant confusion in the isotope ecology & biogeochemistry community about the myriad available linear regression techniques and which one to use when. I can somewhat echo that feeling when it comes to calibration exercises in the (esp. coral) paleoclimate community. A short breakdown of these methods follows.

Ordinary Least Squares (OLS) or Orthogonal Distance Regression (ODR) or Geometric Mean Regression (GMR): which one to use?!

Although each one of these techniques might be more appropriate for certain sets of data versus others, the ultimate take-home message here is that all of these methods are approximations of York’s general solutions, when particular criteria are matched (or worse, unknowingly assumed).

  • OLS provides unbiased slope and intercept estimates only when the x variable has negligible errors and when the y error is normally distributed and does not change from point to point (i.e. no heteroscedasticity).
  • ODR, formulated by Pearson (1901), works only when the variances of the x and y errors do not change from point-to-point, and when the errors themselves are not correlated. ODR also fails to handle scaled data i.e. slopes and intercepts devised from ODR do not scale if the x or y data are scaled by some factor. Note that ODR is also called “major axis regression”.
  • GMR transforms x and y data and can thus scale estimates of the slope and intercept but works only under the condition when the ratio of the standard deviation of x to the standard deviation of the error on x is equal to that same ratio in the y coordinate.

Most importantly, and perhaps quite shockingly, NONE of these methods involve the actual measurement uncertainty from point-to-point in the construction of the ensuing regression. Essentially, each method is an algebraic approximation of York’s equations, and whereas his equations have to be solved numerically in their most general form, they provide the most unbiased estimates of the slope and intercept for a straight line. In 2004, York and colleages showed that his 1969 equations, (based on least-square estimation) were also consistent with (newer) methods based on maximum likelihood estimation when dealing with (correlated or uncorrelated) bivariate errors. Our paper in 2011 provides a relatively fast way to iteratively solve for the slope and estimate.

In our publication, besides the Pearson data, we also applied our algorithm to perform “force-fit” regression - a unique case where one point is almost exactly known (i.e. very little error and near-infinite weight) - on meteorite data and showed that our results were consistent with published data.

All in all, if you want to fit a line through a bunch of points in an X-Y space, you won’t be steered too far off course by using our algorithm.

References

Uncertainty in paleorecords built from foraminifera

PSU Solver applied to a paired record of planktic foraminiferal measurements in the Pigmy Basin.

PSU Solver applied to a paired record of planktic foraminiferal measurements in the Pigmy Basin.

We have a new paper out in Paleoceanography which describes a new computational toolkit that I built during my Ph.D. to characterize uncertainty in foraminiferal reconstructions of temperature and salinity. The toolkit, called Paleo Seawater Uncertainty Solver (or PSU Solver), constrains analytical, sampling, calibration, dating, and preservation-based uncertainty in reconstructions that use paired measurements of stable isotopes of oxygen (δ18O; apologies that I cannot make letters superscript in this interface) and magnesium-to-calcium (Mg/Ca) ratios in foraminifera (or forams: a type of plankton that deposit calcium carbonate shells). In the paper, we show how signficant this characterization of uncertainty can be while inferring climate processes in the past using forams.

δ18O has been measured in foram-calcite for over half-a-century to understand past climatic processes. In a closed system, changes in foram-δ18O ought to be governed by thermodynamics. However, in the open ocean, the foram-δ18O composition reflects not just the temperature of the seawater, but also the δ18O content of the seawater in which the foram grew its calcite shell (δ18Osw; where sw is in subscript and stands for 'seawater'). This parameter, δ18Osw, which may vary independent of temperature, can be used as a proxy for salinity changes, and on longer timescales, a proxy for ice-volume (more info here and here). The problem is, if you only measure the δ18O forams, how do you know which changes are due to temperature and which changes are due to δ18Osw of the ocean? In other words, the foram-δ18O signal is convoluted by temperature and salinity/ice-volume signals.

This is where the Mg/Ca paleothermometer comes into the picture: over the last two decades, researchers have shown that Mg/Ca ratios in forams can give quantitative insights into past seawater temperature changes i.e. foram-based Mg/Ca varies as a function of temperature! This appears to solve the deconvolution problem – if one measured both Mg/Ca and δ18O on co-deposited forams, then you could solve for δ18Osw and temperature of the seawater where the forams lived. Mathematically,

  • δ18O = g(T, δ18O­sw)
  • Mg/Ca = f(T)

So, with both of these measurements on the foram shells, one could tease out the δ18Osw and temperature of the seawater to infer past climate change. This technique has been used successfully to understand a variety of significant climate processes (we list many of them in our paper).

However, as more nuanced calibrations (i.e. as functions f and g above keep getting updated) and inferences are made from these paired measurements, and as climate model-data comparisons increasingly demand for more quantitative paleorecords, there is a strong need to quantify the uncertainty in the above deconvolution of temperature and δ18O­sw. Transferring analytical (/measurement/instrumental) errors, sampling uncertainty due to the number of foraminifera used in the measurement, calibration uncertainties, preservation effects, and age-model uncertainties into temperature and δ18O­sw space is not straightforward due to a variety of reasons, one of which is that the Mg/Ca-T relationship is non-linear in forams. Theoretical error propagation exercises predict uncertainties way too large to interpret even glacial-interglacial signals where such practices cannot handle complex issues such as the influence of salinity on Mg/Ca variability and the non-stationarity of the relationship between salinity and δ18O­sw.

In any case, all these things call for a computational approach to address quantitative uncertainty in foram paleoclimate. This is what PSU Solver does. The code is written in MATLAB and is available here, on Mathworks, or on GitHub. It can produce uncertainty profiles for temperature, δ18O­sw, and salinity paleorecords from paired foramininferal δ18O­ and Mg/Ca datasets. It can be as customized as the user wants it to be (or as simple as a user wants it to be) where the user dictates all options that they require. The user input can be as easy as time, δ18O,­ and Mg/Ca, if so desired.

Go ahead, download it, and try to play around with the code (here are the help files)! It can easily be coupled to more sophisticated algorithms or different sampling schemes as well. Furthermore, there’s an option to couple output from BACON to incorporate radiocarbon-based uncertainty in PSU Solver as well. I will have a post up shortly that explains how PSU Solver works in detail soon. Until then, here’s the type of plot that you could be producing right now (paleorecord from the Pigmy basin, Gulf of Mexico). Go try it out! We show that particular excursions in your reconstructions may (or may not) be sensitive to easily-changable options in the PSU Solver algorithm, and hence be an artifact of transposition issues. Do let me know if you have any questions/doubts/criticisms!

P.S. If you are not a super expert in coding (which I am not either), PSU Solver has been written keeping us in mind: it is not meant to intimidate! Try it out...! Thanks to Julie Richey for being patient with the algorithm and also to Chris Maupin for testing PSU Solver.

Sticky Statistics: Getting Started with Stats in the Lab

Courtesy:   xkcd

Courtesy: xkcd

A strong grasp of statistics is an important tool that any analytical laboratory worker should possess. I think it is immensely important to understand the limitations of the process by which any data is measured, and the associated precision and accuracy of the instruments used to measure said data. Apart from analytical constraints, the samples from which data are measured aren't perfect indicators of the true population (true values) and hence, sampling uncertainty must be carefully dealt with as well (e.g. sampling bias).

In most cases, both analytical (or measurement) uncertainty and sampling uncertainty are equally important in influencing the outcome of a hypothesis test. In certain cases, analytical uncertainty may be more pivotal than sampling uncertainty, whereas in others, sampling uncertainty may prove to be more influential to the outcome while testing a hypothesis. Regardless, in all these cases, both analytical and sampling uncertainty must be accounted for when testing (and conceiving) a hypothesis.

Consider a paleoclimate example where we measure stable oxygen isotopes in planktic foraminiferal shells with a mass spectrometer whose precision is 0.08‰ (that's 0.08 parts per 1000), based on known standards. With foraminifera, we take a certain number of shells (say, n) from a discrete depth in a marine sediment core and obtain a single δ18O number for that particular depth interval. This depth interval represents Y years, where Y can represent decades to millennia depending on the sedimentation rate at the site where the core was collected. The lifespan of foraminifera is about a month (Spero, 1998). Therefore the measurement represents the mean of n months in Y years. It does not give you the mean of the continuous δ18O during that time interval (true value). Naturally, as n increases and/or Y decreases, the sampling uncertainty decreases. There may be several additional sampling complications such as the productivity and habitat of the analyzed species' shells that may bias the data to say, summer months (as opposed to a mean annual measurement), or deeper water δ18O (as opposed to sea-surface water) etc. Hence, both foraminiferal sampling uncertainty (first introduced by Schiffelbein and Hills, 1984) along with the analytical uncertainty must be considered while testing a hypothesis (e.g. "mean annual δ18O signal remains constant from age A to age D" - the signal-to-noise ratio invoked by your hypothesis will determine which uncertainty plays a bigger role).

Here are two recent papers that are great starting points for working with experimental statistics in the laboratory (shoot me an email if you want pdf copies):

  1. Know when your numbers are significant - David Vaux
  2. Importance of being uncertain - Martin Krzywinski and Naomi Altman

Both first-authors have backgrounds in biology, a field which I am led to believe that heinous statistical crimes are committed on a weekly (journal) basis. Nonetheless, statistical crimes tend to occur in paleoclimatology and the geosciences too (and a myriad of other fields too I'm sure). The first paper urges experimentalists to use error bars on independent data only:

Simply put, statistics and error bars should be used only for independent data, and not for identical replicates within a single experiment.

What does this mean? Arvind Singh, a friend and co-author at GEOMAR (whom I have to thank for bringing these papers to my attention), and I had an interesting discussion that I think highlights what Vaux is talking about:

Arvind: On the basis of Vaux's article, errors bars should be the standard deviation of 'independent' replicates. However, it is difficult (and almost impossible) to do this for my work, e.g., I take 3 replicates from the same Niskin bottle for measuring chlorophyll but then they would be dependent replicates so I cannot have error bars based on those samples. And as per Vaux's statistics, it appears to me that I should've taken replicates from different depths or from different locations, but then those error bars would be based on the variation in chlorophyll due to light, nutrient etc, which is not what I want. So tell me how would I take true replicates of independent samples in such a situation. I've discussed this with a few colleagues of mine who do similar experiments and they also have no clue on this.

Me: I think when Vaux says "Simply put, statistics and error bars should be used only for independent data, and not for identical replicates within a single experiment." - he is largely talking about the experimental, hypothesis-driven, laboratory-based bio. community, where errors such as analytical error may or may not be significant in altering the outcome of the result. In the geo/geobio community at least, we have to quantify how well we think we can measure parameters especially field-based measurements, which easily has the potential to alter the outcome of an experiment. In your case, first, what is the hypothesis you are trying to put forth with the chlorophyll and water samples? Are you simply trying to see how well you can measure it at a certain depth/location such that an error bar may be obtained, which will subsequently be used to test certain hypotheses? If so, I think you are OK in measuring the replicates and obtaining a std. dev. However, even here, what Vaux says applies to your case, because a 'truly independent' measurement would be a chlorophyll measurement on a water sample from another Niskin bottle from the same depth and location. This way, you are removing codependent measurement error/bias which could potentially arise due to sampling from the same bottle. So, in my opinion, putting an error bar to constrain the chlorophyll mean from a particular depth/location can be done using x measurements of water samples from n niskin bottles; where x can be = 1.

While Vaux's article focuses on analytical uncertainty, the second paper details the importance of sampling uncertainty and the central limit theorem. The Krzywinski and Altman article introduced me to the Monty Hall game show problem, which highlights that statistics can be deceptive on first glance!

Always keep in mind that your measurements are estimates, which you should not endow with “an aura of exactitude and finality”. The omnipresence of variability will ensure that each sample will be different.

In closing, another paper that I would highly recommend for beginners is David Streiner's 1996 paper, Maintaining Standards: Differences between the Standard Deviation and Standard Error, and When to Use Each, which has certainly proven handy many times for me!