How calibrated are you?

epistemic status: write up of a presentation I will give/am currently giving/gave to my flatmates

Intro

In thinking about future events, we have to deal with uncertainty. We are not sure if X will happen, so we have to make a probability estimate. We have to quantify our uncertainty. Usually this happens quite coarse-grained and without a lot of thought (“I think it likely/not likely, that…”). Being precise about your uncertainty (e.g. “40%” rather than “maybe”) helps us think more clearly and makes our predictions falsifiable.

We cannot always be right. We are working under the constraint of incomplete information and limited computation power and time, so we have to take guesses, give probabiliity estimates. While we cannot change the fact of living in an uncertain world and therefore have to keep living with our inability to always correctly predict future outcomes, we can indeed improve our calibration. Calibration here meaning the reliability of a probability estimate: if you give an estimate of 70% on 10 events, you should be correct on 7 of those events. If your guess was right on all the events, you are underconfident. Likewise, if you were right only four times, you would be overconfident. While having different real-world effects, we obviously want to be neither, nonetheless.

In giving estimates, we can either give discrete probabilities for binary or categorical outcomes (e.g. yes/no, win/loss/draw), or we can use confidence intervals for a continous variable (e.g. “how many liters of water will I drink this week?” “90% sure it is between 12-24 liters”). In the following, we will focus on discrete probabilities. You can also turn any continous question into a discrete one by choosing suitable “buckets” of confidence intervals as discrete options.

Scoring Rules

To measure one’s calibration, we can use scoring rules. These generate a single numeric value based on your estimates and actual outcomes, that indicates the quality of your prediction. Over many different predictions together, they give a sense on how calibrated you are.

There are an infinite number of possible scoring rules. One simple example would be the logarithmic score, where you take the natural logarithm of the probability of the event that happened. E.g. 80% guessed, ln(0.8) = -0.22. The goal is to get as close to zero as possible. $\mathbf {S} (\mathbf {p} ,i)=\ln(p_{i})$

We now want to specifically look at one famous scoring rule which has the property of being a “proper scoring rule”: one where participants maximize their expected payout by reporting their true beliefs. They will always perform worse if they make a prediction that does not match their real belief. One scoring rule that satisfies this property is the so called Brier Score.

Brier Score

The Brier Score is a type of quadratic scoring rule where we take the mean squared difference of a given estimate and actual outcome (it happened = 1, didn’t happen = 0), add it up and divide by the amount of events we looked at. See the equation for it here, where N = total amount of predictions, $f_t$ = assigned probability of outcome for item t, and $o_t$ = actual outcome of item t: $BS={\frac {1}{N}}\sum \limits {t=1}^{N}(f{t}-o_{t})^{2},!$

If we calculate this, we get a score between 0 and 1 where we want to get as close to 0 as possible, which would mean perfect predictions. Always guessing 50% would result in a score of 0.25, always confidently picking the wrong choice would result in a score of (or close to) 1.

Plotting this on a graph, we can see our calibration curves. The following image gives three example predicting behaviours - one is perfect (not achievable by humans), one is overconfident, one is underconfident. We all lean towards either over- or underconfidence, depending on personality and context.

calibrationcurve

Getting calibrated

Getting more calibrated is as easy as doing more predictions and getting feedback on it, to update your world model. Some possibilities to do that:

Confidence interval training to increase precision of knowledge. We are so bad at CIs per default, studies have shown that when people give a “90% confidence interval,” the true answer only falls inside the range about 50% of the time. Some smart people developed a tool where you can give your confidence interval on a huge amount of questions, where you can see how calibrated you are and in the process improve the same. Try it out here
Prediction Journal. Tracking private forecasts, for example with Cleodora
Prediction Markets. The same, but public, on larger questions and with the possibility to win (play) money. See Manifold Markets, for example
Betting on stuff. Use betting as a way to draw out the real probabilities you assign on certain outcomes. Talk is cheap, and a lot of times people are overconfident in their claims, probably to have the appereance of being confident. Having something on the line that you could lose might make you reevaluate your claims and be more honest. “Betting is a tax on bullshit”. Use this tool for example to automatically calculate payoffs based on your Brier Scores

Exercise

One way to try this out is by going through the following example questions. Give a prediction about each one and note it down in a spreadsheet or something similar. Pluspoints if you also write a short sentence for each question about your thinking behind your probability assignment. Then, after two weeks, go through the questions and note down the actual result. Calculate the Brier score, but also think about each question and your thinking. Where did it maybe go wrong?

The following example questions are more tailored towards the context in which I used this introduction first, i.e. with my flatmates. You can choose different questions, e.g. by letting your favorite LLM generate some for you. Big questions about world states are good to get calibrated on objective things, but I also like to predict behavioral stuff about myself, e.g. “will I go to the gym 3 times per week for a month” where you can influence the outcome but which can give valuable insight into how you work.

Will it rain on more than half the days (>7 of 14) in [city] over the next two weeks? Resolves via: weather.com / weather station data
Will the highest temperature recorded in [city] over the next 14 days exceed 25°C? Resolves via: DWD / weather station data
Will there be a thunderstorm in [city] at any point in the next 14 days? Resolves via: DWD storm records
Will a new film top the German cinema charts (Kinocharts) this coming weekend that wasn’t #1 last weekend? Resolves via: Blickpunkt:Film / kinokino.de
Will a new album debut at #1 in the German Albumcharts (Offizielle Deutsche Charts) in the next two weeks? Resolves via: offiziellecharts.de
Will any member of the flat cook a meal they have never cooked before in the next 14 days? Resolves via: Group honour system
Will the flat have a spontaneous guest stay overnight at least once in the next 14 days? Resolves via: Group honour system
Will anyone in the flat receive a physical letter (not a package or flyer) in the next 14 days? Resolves via: Check the mailbox
Will there be a power or internet outage in the flat lasting more than 15 minutes in the next 14 days? Resolves via: Group memory
Will the most-streamed song on Spotify Germany change at least once in the next 14 days? Resolves via: Spotify charts (charts.spotify.com)