PROBABILITIES IN EXPERIMENTAL PHYSICS : EPISTEMIC LESSONS AND CHALLENGES

There is one way with which Nature responds to the questions we direct her about the correctness of our understanding of her ways: by means of experiments. In this paper, the pivotal role probability theory plays in experimental physics is presented: it allows us to combine observations that are seemingly analytically incompatible. The main concepts used for such task are introduced and explained. A brief historical sketch of the development of some of such concepts is drawn and it was used as a casestudy to defend the position that physics and philosophy are interlinked affairs. Some philosophical consequences of how the intrinsically probabilistic character of experimentation reverberates in our epistemic access of the world are also drawn.


Introduction
"What is probability?" is a question, one could argue, no one can answer.This statement is supported by the vast literature produced by philosophers and scientists while struggling (A) to come up with a non-circular definition and (B) to interpret the meaning of "probability" (Sklar, 1979).The existence to this day of such debates is, by itself, astonishing if we consider that probability theory had its bases built (at least) more than three hundred years ago and, since then, has been an indisputably successful tool for assessing the world.Such success makes of probability an object of interest for a wide range of academics of different areas.Philosophers, for instance, often write about its interpretation and definition (Earman, 1992).On the other hand, scientists use probability on a daily basis to interpret experiments, to solve problems and even to forge core concepts of theories (Von Weizsäcker, 1973;Sachkov, 1928).Looking at this scenario, it is clear that, from the debate, an opportunity arises: the discussion about the meaning and definition of probability is a battlefield where philosophers and scientists fight side by side.Consequently, the history of probability is an experiment where epistemologists can find high quality data; it is a rich source of information about the intersection between philosophy and science.From this perspective, this article aims to present (a) a summary of how scientists use probability for measuring physical quantities, including constant parameters of functions, and (b) a brief description of the history of the development of part of such mathematical tools.Once (a) and (b) are completed, we shall advance towards (c) identifying which features of the presented facts are of interest for academics working in the intersection between science and philosophy, and which questions these features may help answering.

Measurement and uncertainty
Our first task is to present the status quo regarding how scientists measure physical quantities and what role probability plays in measurements.The content of this topic is no more than a summary of what physics undergraduate students usually learn in their first couple of experimental physics courses.Same facts can be found in any introductory textbook of data analysis (see, for instance, Vuolo, 1996;Taylor, 1997).
To begin, suppose that, in the context of an experiment, a physical quantity, call it w, must be measured.We can imagine, for simplicity, that w is the size of an object.To do so, it is necessary to choose a measuring instrument: a measuring-tape, a rule, a caliper rule or a micrometer, for instance.
After choosing the instrument, the measurer measures the object and gets a result w1.However, as strange as this may seem for those not used to experimentation, if another measurement is performed on the same object, a different result w2 will be obtained.Repeating the procedure N times will yield a set ) and so on.Note that this distribution enables a quick calculation of relative frequencies.The relative frequency of a measurement being within a certain interval is the height of the column relative to this interval over the sum of the heights of all columns.
We can abstract the situation to a continuous case with infinitesimally small intervals and represent the histogram by a curve.In this case, the relative frequency of an interval will be yield by the area under the curve delimited by such interval over the total area under the curve, i.e., the relative frequency equals the relative area.If we take relative frequencies to represent probabilities, then the area under the curve delimited by an interval is the probability of a measurement being within that interval.
For this reason, this kind of curve is called probability density function (PDF).Now, we can finally understand that the final result of our experiment is not a numerical value from the set of real numbers, as one would expect.It is, rather, a distribution we commonly represent by a PDF.
There is, nonetheless, a way of representing (at least partially) the PDF with a couple of real numbers.If the curve is symmetric, for instance, it is intuitively noticeable that we can represent the distribution by its central value (mean) plus a number representing the distribution's "width".In fact, this is exactly what scientists do.Usually, experimental results are presented in the form:  ̅ ± . ̅ is the arithmetic mean of w over the N measurements.σ, called "standard deviation", is nothing more than an indicative of how far the data fluctuate around the mean.One way of grasping the meaning of σ is the following: it is expected that, if another measurement is performed, it will give a result whose difference from the mean is not much larger than σ.
So far, we have given a description of what we have called "experimental result" and we have explained how it ought to be expressed in mathematical terms.If, however, we want to use such a result to, say, falsify3 a theory or model, then an interpretation of its meaning must be presented-meaning that can be expressed as the following: the existence of a true value4 w0 for the quantity w is assumed; the direct measurement of w0 is, however, impossible due to (i) the fallibility of the measurer, (ii) the fallibility of the measuring instrument and (iii) environmental interference with the measured object.
The "width" of the distribution is determined by these three factors combined.The only course of action the experimenter can take in this scenario is to extract from the distribution the best possible estimative of w0 and, more importantly, to assess how good this estimate is.The way this is done is by putting the so called central-limit theorem (CLT) to use.The standard deviation of this gaussian distribution is called uncertainty.The CLT implies that the probability of a new experimental result  ̅  , obtained by measuring N times the size w, to be within a certain interval from the true value w0 can be calculated.That is because CLT yields the PDF, the gaussian curve, that describes the distribution of means.This allows the exact calculation of areas under the curveof probabilities, that is.
Figure 1:illustration of the central limit theorem.The distributions of the measurements do not need to be gaussian for the distribution of the means to be gaussian.
We close this section by making the following remark: falsification of theories and comparison between experimental results can only be expressed in terms of probability.An experiment attempting to falsify a theory can only yield conclusions such as "there is high (or low) probability such and such that theory is compatible with experimental results" and nothing of greater epistemological import.
Probabilistic considerations, note, condition our very epistemic relation with experience.

Least squares method
The least squares method (LSM) is a method for measuring parameters of functions, or, equivalently, for measuring functions.Its importance can be summed up as follows: when one has data in one's hands, one can depict it in a graph that makes explicit important physical traits of such datathe velocity of an object in function of its position, say.Graphical analysis allows one to grasp important knowledge about the data depicted.There are, however, many ways functions can be drawn.More precisely, when one knows her data can be described by a certain kind of In a first glance, LSM may appear to have nothing (or little) to do with the quantification of uncertainty of one single physical quantity, but there are at least three strong connections between them: (I) a single physical quantity can be thought of as the particular case of a function f(x) = constant; (II) parameters of functions must also have uncertainty associated to them; (III) historically, LSM and the uncertainty concept are closely related.Without further delay, let us see how the method works.
In the previous section we have seen that the usual measurement procedure of a physical quantity leads to a set of data which is interpreted as being not identical to but distributed around the true value of this quantity.Now, let us use this lingo in a situation where a function is to be measured, i.e., the data is a set of pairs (xi, yi).x is here taken to denote the independent variable; y, the dependent one.Reproducing the previously presented argument, the parameters which determine the function are also taken to have true values.Consequently, for each true value the variable x can take, y will also have a true value.Given the fact that all we can do is measure fluctuations around true values, however, follows that each measured pair (xi, yi) is a frame of the fluctuation around the function.In this scenario, our job is to make the best possible estimate of the "true function" (true values of the function's parameters).
The first thing to do is to establish which function we are trying to measure, which function best describes the behavior of the data.Let's take the simple case of the equation of a straight line: To make it concrete, think of a body in uniform linear motion.In this case, x denotes time; f(x) denotes the position (as a function of time, of course).These are the quantities that one will measure in this case: at the end of the day one will have a data set of measured pairs (x, f(x)).a and b, by their turn, denote velocity and initial position respectively (not directly measured).These are the parameters one would seek to indirectly quantify.Assuming the existence of true values for parameters a and b we conclude that, for each measurement xi of x, there is a f(xi) representing the true value of the quantity y -let's call it μi.However, our measurement yi is just a frame of the unavoidable fluctuation around μi.
We call "error" and denote by E the difference between the measurement yi and the true value μi.
=   −   . (2) From a practical point a view, the error definition is not very useful, because we never know the true value μi.The closest we can get to quantifying the error is quantifying the difference between the measurement yi and the estimated value of   .This difference is called residual.We can also define, so to speak, a kind of total residual for the whole set of measurements.One convenient way of doing so is taking the sum over i of Ri² over the uncertainty of yi: But why should we do so?Well, (4) allows us to define a criterion for estimating a and b: the best estimate for a and b is that which minimizes the value of χ².The way we defined χ² implies it must have an expected value which allows assessment of the quality of the estimates.The denominator in expression (4) works a statistical weight for each term of the sum: the more precise the measurement, the greater the weight it has on the sum.
Our final task is to find the expressions for a and b which minimize χ².Once χ² is defined, finding the needed expressions is just a simple calculus exercise.The value of a for which χ² is minimum is that which satisfies the equation: The same is true for b: Finally, we have the expressions for the estimative of a and b and they only depend on the measured (xi, yi) values.
The reader must have in mind the fact that the quality of the least square estimate must itself be assessed.A detailed presentation of how such an analysis works in practice would take us too far astray, however, and will not be done here.Suffices to note that an assessment of LSM is done by checking if the obtained value of χ² is probabilistically compatible with its PDF.Empowered by such tools, note, we know not only to assess how well the experiment was performed and how experimental results reflect on theory, but we can also guarantee such analysis is solid.The problems displayed in sections two and three are, in fact, the same: combining observations (elements of a data set) which are apparently analytically incompatible. 5The short introduction above given may hide the nontrivial character of such problem.Its complexity is made explicit by the acknowledgment of the long history behind its solution, a history that is briefly sketched in the next section.Focus will be given to the development of LSM.With such a sketch we hope to pave the way and give ground to the philosophical and epistemological considerations drawn in section 5.

A brief history of the conception of uncertainty
The problems of determining the value of a physical quantity from a set of measurements of this quantity and of determining parameters of functions from a set of measurements date both from at least 300 years BCE.For instance, between 500 and 300 years BCE, Babylonians developed mathematical tools, which required the estimation of parameters, for calculating the motion of some celestial bodies as a function of time.Unfortunately, no material remained for indicating how such estimates were done-it is only clear that they had to do estimations somehow in order to use their tools.Another example is reported by Manitius (1913): Hipparchus' endeavor, around 300 BCE, to determine whether or not the passage of the Sun through the same solstitial point is truly periodic.Hipparchus concluded the Sun does not pass the same point periodically by comparing his measurement with an error estimate of his creation.He determined the maximum variation in the duration of a year is ¾ of a day whilst his measurements' error could not be higher than ¼ of a day.He did not establish a way of calculating a representative value from a set of measurements, nor constructed a universal method for quantifying error.Neither mathematics nor concepts were ripe enough for doing so.His procedure did contain, however, at least implicitly, the idea of "fluctuations of measurements" due to errors.
A very interesting historic case in which the necessity of comparing experimental values has proved to be pressing is the so-called "trial of the Pyx", which was extensively studied by Stigler (1977).
The trial of the Pyx is an event which occurs from time to time in Great Britain since, at least, 1248.Its purpose is to evaluate the quality of the coins produced by the Royal Mint.Even though details of the trial have changed through time, its general aspects can be summarized as follows: everyday, one coin out of a pre-established number of coins is taken from the Mint's production and stored in a box called "Pyx".After two or three years of repeated storage, the trial happens; the Pyx is opened and an assessment of the stored coins is performed in order to check if the coins meet predetermined standards regarding its weight and fineness.If they do not, the master of the Mint could face severe punishment.
The trial, in a nutshell, is a straightforward case of estimating a population out of a sample; and we know it cannot be done without considering statistical fluctuations.What is particularly interesting here is the fact that, in fact, they did account for statistical fluctuations, even though in a rudimentary manner: a remedy (tolerance) was allowed; the coins had to be within this remedy.Stigler (1977) concluded that the remedy was very permissive with the master of the Mint.The remedy was too large; a skilled master could enrich greatly by pocketing a small fraction of silver and gold while still attending to the remedy.
Ironically, the most prominent master of the Mint was Sir Isaac Newton, a mind no one would object to call "skilled".Newton served the position from 1699 to 1727.It is a historical fact that he became wealthy during his years in service of the Royal Mint, which raises the question whether or not he was taking advantage of the excessive tolerance.Actually, he faced charges regarding the fineness of the coins during the 1710 trial.However, de Villamil's (1931) and Craig's (1946) investigations lead us to believe Newton became wealthy as a result of his fair earnings and finance management.
Newton's "not guilty" verdict raises another question: why Newton did not defraud the Royal Mint?Was he following a moral compass or he simply did not realize the shortcoming in the trial's assessment?The answer is unclear, but Newton's last work -The Chronology of Ancient Kingdoms Amended, published post mortem, in 1728allows for speculations that he did know something about error theory.The mentioned work, a report to Princess Caroline, is an attempt to estimate the mean duration of reigns.Newton had at his disposal a table with mean reign durations of 12 kingdoms.This table had values ranging from 11.6 (Babylon) to 25.18 (Egypt) years.In the absence of statistical knowledge, we can speculate, one would state that the mean duration of a reign is something in the neighborhood of 18.4 ± 6.8 years, i.e., the average between extremes ± half the difference between extremes; or the arithmetic mean ± half the difference between extremes; or just a straightforward "from 11.6 to 25.18".However, Newton's assertion was that kingdoms last "about eighteen to twenty years".
Here is the astonishing part: the calculation, using Newton's table, of the arithmetic mean ± uncertainty, as presented in topic 1, leads to 19.1 ± 1.0 years!Saying that Newton had all error theory solved is certainly too much of an extrapolation.However, risking being accused of whiggish historicism notwithstanding, one can speculate that is very likely that Newton understood, at least intuitively, the inverse proportion between uncertainty and size of data set.
Let us now present the historical facts more directly connected to the development of tools for combining observations.Before late XVI century it was common procedure that, when one wanted to compare a result with a set of results, one would arbitrarily choose values within such a set to draw conclusions.Plackett (1958) states that Tycho Brahe, in the decade of the 1580s, appears to be the first to combine measurements in order to obtain a single value of a physical quantity.Brahe measured the right ascension 6 of the star α Arietis using different techniques and calculated the arithmetical mean between obtained values, in a clear attempt to remove systematic errors 7 from his results.However, the 6 α Arietis is the brightest star in the northern zodiacal constellation of Aries.Right Ascension is, together with declination, a celestial coordinate for indicating a point on the celestial sphere.
7 Systematic error is an error generated by an accuracy deficiency in an experiment.A clear example of systematic error is given by an uncalibrated instrument; which will cause an equal shift from the true value in all elements of a data set.However, unlike the uncalibrated instrument example, sometimes systematic errors cannot be avoided.(Plackett, 1958).By 1750, some astronomers had realized that combining observations making use of arithmetic mean could be advantageous somehow, but they would only combine measurements which were considered to have the same accuracy (Stigler, 1986: 16), i.e., measurements performed exactly with same conditions (same measurer, time, space, instrument etc.).Possibly, Roger Cotes was the first one to express this idea objectively in a posthumously published work in 1722.
Leonhard Euler's study of Saturn and Jupiter shows clearly that, by mid-XVII century, combining observations was still not a well-established idea (Stigler, 1986: 25).In 1748, the Academy of Science in Paris announced a prize for the one who would provide the best explanation for inequalities observed in the orbit of Saturn and Jupiter.Over 50 years earlier Halley had observed that the former planet appeared to be retarding while the latter was accelerating; he also proposed the mutual attraction between the planets was the reason behind these inequalities.Euler, in his 1749's work, engaged in solving this problem.Assuming Saturn and Jupiter orbit the Sun elliptically and that the ellipses are not in the exact same plane, Euler came up with a fifteen parameter equation; seven of which were directly observable variables and the other eight were constants that could only be extracted from fitting.Euler then faced an embarrassment of riches: he had seventy-five sets of observations, i.e., equations with experimental values, at his disposal, but only eight unknowns to find.He combined small sets of equations with similar coefficients, subtracting one from the other in an attempt to make those coefficients disappear.However, there were not enough observations with similar coefficients -leading him to a dead-end.Even though Euler managed to win the prize offered by the Academy of Science, he clearly failed to provide any meaningful way of combining observations.His failure catches the eye even more when contrasted to Tobias Mayer's successful work published just one year after.
As many great scientific endeavors of the eighteenth century, Mayer's work was also about astronomy and was closely related to technology and matters of state of the time (Stigler, 1986: 16).In 1714, seeking to improve the power of localizing ships on the sea, England established the "commissioners for the discovery of longitude at sea", an institution that offered prizes for the ones who would help find a way of determining longitude at the sea.In 1747, Mayer engaged in solving the commissioners' problem by studying the Moon -a task that led to a publication, in 1750, on the libration of the moon.8During the two preceding years, Mayer performed measurements of the position of some lunar features, which allowed him to infer characteristics of the lunar orbit.His observations were described by a linear equation with 3 measurable variables and 3 unknown constants.Therefore, he only needed three observations to solve the problem mathematically.His measurements, however, were …, wn) of measurements.1Wecan represent this set in the form of a histogram, i.e., a column graph representing how many results were obtained between the real numbers 2 (x1, x1+Δx), (x1+Δx,