Part 2, Types of Sampling

This is the second of a three part series on sampling. The third part will come more quickly than the second part. 🙂

There are four types of sampling.  Simple Random Sample, Stratified, Cluster, and Systematic.  I will give a brief definition as well as an example for each.

Simple random sample is the most basic.  It is where every person of interest (the population) has an equal chance of being selected for the survey.  We will talk more about it later. 

Stratified sampling is one in which the population is divided into groups, and the sample is obtained with respect to the relative sizes of each group.  For instance, if the sampling is to come from any individual in either of two cites, and one city has 1000 people and the other city has 2000 people, then the sample would consist of 1/3 of it subjects (people) from the first city (since it has 1/3 of the total) and 2/3 from the second city.

Cluster sampling is similar to systematic in that the population is divided into groups (called clusters in this case) but for cluster sampling, one (or more) of the clusters is chosen and represents other clusters.

So, for example. assume that voters across the country are to be surveyed.  Assume also we would like to sample proportional to the states’ population.  But instead of going to each state, they may just sample from a handful of states if it is believed that one state is representative of other states.  Perhaps they sample only from Oregon to represent three states on the west coast.

Finally, and in no particular order there is Systematic Sampling.  Systematic sampling occurs when every kth sample is obtained (where k is some natural number, such as 4).  So, for example, assume that one wants to sample hospital patients, and are interested in patients in some 24-hour period, perhaps a Saturday.  Assume also that they expect about 400 patients in a day.  Thus, they would like to sample about 100 people.  Assume they sample every fourth person on the register sheet.   The advantage of this is that by sampling people throughout the day, they are more apt to avoid peculiarities related to time of day.

For example, if they sample the first 100 people that go to the hospital some Saturday morning, it might be that they are getting a different type of patient.  Perhaps the people going early in the day are more apt to be giving blood.  Thus, if the sample is to try to ascertain the reasons people go in, they are likely to get a distorted picture.

Back to a Simple Random Sample.  When taking a Simple Random Sample, it is usually impossible (for all intents and purposes) to give everyone an equal chance of being selected.  For instance, in the example of polling for the Presidential election, I am not quite sure of their exact methods, but one thing I am sure of is that not every registered voter has an equal chance of being selected.  For instance, if they do this by way of telephone, not everybody has a telephone (though in this day and age, just about everybody does) but some people may not pick up their phone, or more so, if asked, do not want to divulge who they are leaning toward.

Also, just about any sample that is not a Simple Random Sample is going to include a Simple random sample (loosely speaking, as we discussed in the paragraph above) within it.  Take the example given above with stratified sampling.  Perhaps it is not expedient to give everyone an equal chance of being selected.  You might not have their names, their phone numbers, etc.  But you try to make it as random as is reasonably possible.

At the crux of all sampling is bias, and specifically the ability to avoid it.  Bias is where the sampling is a distorted representation of the population.

In part 3, we will discuss the most famous case of bias, as well as the polling in the Presidential election.

Many aspects to discuss with that.  Everybody seems to have an opinion on it. 

Trying to make sense of all of this

You sure do hear a lot of numbers flying around with respect to COVID-19.  It is often hard to believe (or understand) what you hear from the media, as well as from the experts. 

What makes this complex is that there are many factors involved, and it is hard to tease out the importance of each as well as the relationship between each (i.e. is one factor mutually exclusive from another?).

What makes the understanding of all of this so important is that decisions, and many of them life altering, will be made from these numbers.

You hear all kinds of probabilities.  From advocates of opening up everything, you might hear “A person has a 99.3% chance of surviving COVID-10, etc.”  Is that accurate?  Is that even good?  How do you put all of this in context?

Currently, many states are trying to decide what restrictions to put on people and businesses.  To do this, we need to understand these numbers, and probability is involved in much of this as it usually is.

As I have mentioned before, and as my website name implies, probability is at the crux of so much of decision making.

So, where to start?  In most places, most businesses are fully able to operate except where there may be too much of a crowd; places like concerts, games, restaurants. 

Let’s start with an assessment of restaurants.  Should restaurants, by and large, be open?

I will leave it to you to answer yourself, based on my interpretation of numbers (which will give a plausible range).

Perhaps the best way to go about this is to compare it to the flu.   The CDC estimates that about 50,000 people die each year from the flu. 

I think there should be no question that some kind of intervention had to happen.  The fact is that there have been viruses that have killed tens of millions in the past.  But with the hindsight of at least observing what has happened the last few months, is it time to allow restaurants to open, and allow them to define the capacity with social distancing required?  Note: I think it too hard to require face masks be worn (obviously, when one is eating, it is not logical).  We could require them to be worn when entering and leaving the building.

To date, about 100,000 Americans have died from COVID-19.  This is with intervention.  But many died before precautions were taken.  In other words, in the last several months, you have had a ‘mixed bag’ of people dying of all ages, of all heath types, and before intervention, as well as after (though we are pretty certain that intervention has helped to some extent).

Again, to focus on restaurants being open for now, what if we allowed people to go with both social distancing and leave off the facemasks? 

The great majority of the 100k that have passed away due to COVID 19 was after intervention was put in place.  But also to be considered is what part of the 100K were people that either already had it, were sickly to begin with, were in a nursing home where the chances were much greater of getting it, etc. compared to people that are in reasonable health that are generally going about their lives?  This is the most important piece of the puzzle in my opinion, and we are essentially missing this information, and although it is possible to piece together this information from various sources, it will still be inexact. 

Let’s approach this way.  Let’s define group A as people that are either in nursing homes, or will not venture out (either due to age or sickness or desire).

Let’s define group B as all other people.  How many are in each group?  Just thinking of people that I know and see, and having ball park ideas of numbers with respect to some demographics, lets’ put it at

100,000,000 people in group A and the rest (about 230,000,000) in group B.

How many of the people in group A were among the 100,000 deaths thus far?  The reason this number is important is because the assumption is that, by and large, only group B will be the ones venturing out to restaurants.

What if it is 50,000 people? 

50,000 people in two months  is 600,000 over a whole year, and further since there are 200,000,000 people in that group, if we extrapolate out to 330,000,000  (the population of the U.S.) to index it to the flu numbers I gave earlier, then we have upwards of 1 million people, and a twenty fold difference between that and the flu.

But, it sure seems silly to assume that it is 50,000.  It is probably closer to between 5 and 20,000, I surmise, and I would guess that it is closer to 5000.

What it if is 20,000?  Then that translates to 400,000 people per year.

What if it is 5000? That translates to 100,000 per year, and just twice the rate of the flu.

Now, this assumes the precautions of social distancing, but not using face masks.

If you were told you could go to your favorite restaurant without a face mask (of course, the social distancing would take care of itself, the restaurant would not put you too close to others), and you had eight times the chance of catching and dying from the flu, many people would still go.

If it were twice the rate, a great majority of people would go. 

In conclusion, these numbers would put the increased chance of catching the flu between twice the rate and eight times the rate. 

There was a Gilligan’s Island episode where Wrong Way Feldman, an aviator who had a horrible sense of direction, tried to describe where the island was, and by what he told authorities, it was somewhere between the Bay of Naples and the Arctic Ocean!  Now, my range leaves a lot to be desired, and I am certain I am not allowing for factors, and they are too numerous to mention here, but I think I have narrowed down a little better than Wrong Way.

These numbers are just meant to mostly be a partial “off the cuff” assessment, a plausible ‘drill down’ of what is going on.

One thing I am pretty sure I am sure of.  I am glad I do not have to make the decisions related to lock downs.

Sampling

This is the first of a three-part installment on sampling and inferences made from sampling.

As the Presidential race kicks into full gear (we hope it does anyway) we will see more and more polls come out as we get closer to election time.

There is much debate about polls, even among experts.

But to the non-expert, allow me to explain a few things. 

First, a few definitions.  A population is every item (or in many cases, person) of interest to a researcher.  For example, if a person is interested in the average age of all people who go into a store one day (and is only interested in that particular day), then the population would be all people that walk into the store that day.

If, for example, 100 people walked through that door on that day, then the 100 people would represent the population.  However, the population could also be defined as the 100 ages of the 100 people.  In other words, either the ages or the people themselves can be considered the population.

The population can be defined in any way the investigator wants.  It will normally be dependent upon his interest. 

The sample is a subset of the population.  In the example above, if you determined the ages of, say, the first 10 people in the store, then the sample would be those 10 people (or the 10 ages).

Perhaps the biggest factor that investigators want to focus on is something called bias.  Perhaps I should say the biggest pitfall one wants to avoid is that of bias, as bias can completely distort your findings.  Bias occurs when a sample does not represent the population.

Perhaps the most famous case of bias in sampling came in 1948.

We will talk more about bias (along with types of sampling) in the second part.

The third part will focus more on this year’s Presidential election (and a look back at the 2016 election).

The importance of definitions

Back in the 1980’s, Bill James, the pioneer in sabermetrics (which is the study of statistics in baseball) wrote an article that distinguished between a players’ value with respect to the peak of his career compared to his value regarding his career as a whole.  Of course, most people at the time (or even now) would not have thought much about this, and perhaps telling James’ “you are making this complicated, why can’t we just have a discussion”?

Why?  Because clarification/definition on just about any discussion is very important.  If you are a baseball fan, you have no doubt been embroiled in debates regarding the greatest players of all time.  One such discussion might have been “who is the greatest left-hand pitcher in history”.  But let’s pare this down to simply being a debate between Warren Spahn and Sandy Koufax.  Most people probably would probably say Sandy Koufax, if only because his legend was bigger.  If you look at their career statistics, you will see that Spahns’ statistics tower over Koufax (and not just in longevity).  In other words, if you look at career value, it would be extremely hard to say that Koufax was better.  The point here is that unless one defines what is meant by “greatest”, it is really hard to answer the question.

This is true for just about any discussion.  Look at the situation with COVID-19.  The media and supposed experts are in the news every day (heck, every hour) and they have done a poor job, if at all of defining what they mean with a “COVID related death.”

If somebody is 91, and they have all kinds of illnesses, and they essentially die of old age, while having have high cholesterol, is it fair to say they “died of high cholesterol”?

Even if they are assessing these deaths in the same manner they are defining flu related deaths, a flu related death has never been clearly defined, and further, if we are given a definition, is it a fair definition?  Note: (By the way, allow me to make an important distinction.  When I say clearly defined, I mean talked about, or written about enough to make the general public aware.  If it is written on page 24 of some dusty document, that doesn’t’ count as being ‘clearly defined’ in this context.)

There are just short of 80,000 deaths in the U.S. as of this writing.  How did they count these? What if you have somebody with an assortment of maladies, in their late eighties, they test positive with COVID-19, and pass away?  Should that count as COVID-19 death? 

I believe there are two reasonable ways to go about this.  Assume a person has three ailments and they pass away.  We could either pick what we believe has the strongest impact towards dying (and count that as THE condition they died from).  So, in the case of a person having conditions A, B, and C, if they would have lived just two months without A (while having B and C), but two years for each of the other two scenarios, then it seems reasonable to say they died due to condition A and A only. 

Or we could try to assess how long they would have lived without having a particular condition.  If it is believed they would have lived a few days (without COVID-19) it seems absurd to say they died from COVID-19.  If they would have lived a year, it seems reasonable to say that that they died from COVID-19.  So, what is a reasonable cutoff (i.e., threshold)?  Whatever you deem it to be, define it.  Note that by this second way, you can die from more than one illness.

This is not just important because of curiosity.  This is important because it allows lawmakers to make reasonable decisions with regard to social distancing, etc., and further allows the citizens of this country to be enlightened to respond to those decisions appropriately.  For instance, if the flu kills 30,000 people a year by however you define it, and COVID ends up killing 40,000 this year (by the same definition), then these extreme measures are nonsensical.  But if it ends up by that same definition that 500,000 die (or would have died without intervention) then it would appear that most of these measures have been reasonable. 

Notice we go back to word ‘reasonable’.  That word, in many ways defines our decision making, doesn’t it?  It is at the crux of our court system, i.e. guilty beyond a reasonable doubt.

We all need to use reason.  Have the lawmakers used reason with these restrictions?  There are at least two problems in answering.  Record keeping/assessment is one of them.  That is tough nut to crack, and even our best methods are going to leave debate along with uncertainty.  But the other problem is something they could have and should have easily resolved: how precisely do they define a COVID related death?  Nobody seems to know, because nobody seems to be defining it.

It’s all about probability (well, mostly)

There was a book written several years ago by astrophysicist Mario Livio called Is God a Mathematician?  Of course, an astrophysicist (or any scientist) is going to be almost certain to think that God is a mathematician as opposed to say an English major or somebody who studies rocks. 

It seems kind of natural to think of the world in terms of mathematics.  If we explain some of the mysteries of the universe by the equation e=mc2, few will take a second look.  But if we try to explain some of the mysteries of the world by Shakespeare’s “to be or no to be”, there will be quite a few heads turned.

As a mathematician myself, I am biased to a mathematics-based model of the world.  That said, one branch of mathematics that has not gotten as much hype as perhaps it should have is that of probability.  Despite my liking of mathematics, I have always had a particular affinity for probability.  Probability is narrow where mathematics is wide.

How does this relate to our current crisis?  Well, people are divided as to how quickly we should open up certain aspects of America.  Should restaurants be open now?  What about in states that have not has as big an issue as others?  Should hairdressers be allowed to conduct business?

Although it seems that the belief system is mostly partisan (Republicans want to end this lockdown, and Democrats want it extended), it does not completely go across party lines.

Rules/laws/ordinances are for the most part based in probability.  The U.S. is the “land of the free” but what exactly does that mean?  There are restrictions on what we can say.  We cannot go into a movie theatre and yell ‘fire’ (unless there really is one).

There are speed limit laws.  By imposing a law, are our freedoms being interfered with?  Some would say ‘yes’. 

Regarding speed limits (and other restrictions) society generally has a balance between risk and reward.  For instance, if the speed limits were 10 miles per hour (even on highways), there is a pretty good chance that many people would be rioting in the streets over their freedoms being infringed upon. 

If you are really not concerned about dying, you might be willing to go 150 miles per hour on a highway.  Or if given the chance, and told you had a 50% chance of dying if you took a spaceship to Mars, you might take it.  Others might think it is worth the risk.  Of course, many would not.

We all have different thresholds regarding risk.  If you could take a rocket to the moon and had a one in a million chance of dying, would you take it?  I probably would.  If the chance was one in a thousand, I probably would not.  Thus, my threshold is somewhere between 1 in a million and 1 in a thousand (of not making it alive) for going to the moon.

With regards to COVID-19, every governor has his/own discretion about what risks should be taken.  There are some that say that much like when we are skiing, we take risks, and in fact, take a risk not only that we will get hurt independently, but also that somebody else might hurt us.  But we take that risk.  The argument goes that we should be able to take the risk by going to a restaurant.  If somebody does not want to go out, that is up to them.

But unlike the skiing example, this is not an apples to apples comparison.  Because when you are skiing, you take risks for yourself and others who are willing to take the same risk.  If you go to a restaurant, while you are at the restaurant, you take risks along with others, but the problem is that you might bring something home with you. 

Now, I am not saying one way or another which way I feel.  I do believe in certain cases, restrictions are too strong.  I am making the general point that each person in authority, for the most part anyway, are making decisions not with the intent to infringe on anybody’s rights, but are making decisions with probabilities in mind.  Whether they also have ulterior motives in purposely infringing on somebody’s rights, I don’t know their hearts.

But my point is that everyone has a different threshold in the many facets of life, and this guides their decision making.  Just like in the rocket to the moon example.

Another way of counting

https://www.news18.com/news/world/25000-missing-deaths-tracking-the-true-toll-of-the-coronavirus-crisis-in-11-countries-2587973.html

My previous post was in regards to another way of counting. I figured it was a matter of time before somebody posted an article about it. The above article uses a method that I was espousing. Keep in mind, I was not suggesting that the current method was overstating or understating the number of deaths due to COVID-19. I was simply advocating for a “backup” system; a method that might ‘tease out’ the number of deaths by COVID-19. This article implies that (for 11 countries, anyway) the number of COVID-19 deaths is highly understated.

Let’s keep a few things in mind. First, we really do not know that this information is accurate. Like it or not, many people (or entities) have an agenda, and we can’t necessarily take an article (even if it is a reputable newspaper) as gospel.

Secondly, this method has its drawbacks, and is largely dependent on the volatility of the citizens in that area.

Let me give an illustration. Remember, Whoville, the town created by Dr. Seuss? Well, let’s assume that 10,000 people live there, and they live fairly healthy lives, and they do not do drugs, and they do not take crazy risks on the highway, and they don’t get in fights in bars, etc. A great majority live to be in their 80’s or 90’s. Let’s assume that typically 100 die per year. (For the sake of illustration, let’s assume the birth rate is about the same as the death rate, so that the number of people living there at any time is about 10,000). Perhaps over a five-year stretch, the number of deaths is always between 90 and 110. We would call this a non-volatile situation, and one that is generally easy to model. If 200 died one year, we would know that something is up, and it was in the midst of COVID-19, then we can be pretty sure that roughly 100 died from COVID-19.

Getting back to the 11 countries, the accuracy depends a lot on the volatility. For example, did any of them go through an economic crisis, whereby there may be a number of reasons why there is a higher death rate: not taking care of one’s health, suicide, murder, taking unnecessary risks, etc.? To truly ‘tease out’ the number of deaths that would be expected would require a model. And as we have seen models can be drastically off (at least it looks that way). That said, I think this article makes a good point. And I will reiterate what I said last time. Why don’t the media and experts give both methods when reporting the death toll. And let us decide which one might be more accurate. Studies have shown by the way (maybe I will make it a post someday) that very often the average of two or more models is best!

Update on COVID-19

It looks like the modelers have way overshot the original prediction. Before we get into that, let me explain some terms in the statistical world. First of all, there is the term, “statistics”, which is a very broad term, encompassing just about everything in the, well…… statistical…. world. A subset of statistics is ‘probability’, i.e. “the probability of rain is 40%.” Related to probability is a word I think should get a little more hype is that of inferencing. To make an inference is to reach a conclusion based on the evidence you have (along with some common sense, or intuition).

When a modeler makes a statement, “we believe that 200,000 Americans will die from COVID-19, then they are making an inference. It appears (fortunately) that it will fall well short of the mark. Does it mean the inference is “wrong”? George Box is known for this expression, “All models are wrong, but some are useful.” What exactly did he mean? That, by and large, none are going to be perfect. But some will be more accurate than others.

Getting back to the under prediction of COVID-19, certainly, there have been mitigating factors, such as social distancing. I believe many experts made a prediction of a few hundred thousand with those factors in mind. So, they are still off the mark. They are being maligned by many people for making a poor prediction. Is this fair? Really, it is hard to say. Sometimes things happen that are really hard to envision. Take a football team, say the San Francisco 49ers last year from last year, when they went 13-3, and came within a play of two of winning the Super Bowl. The year before they went 4 and 12. What was the prediction for them going into last year? Well, everybody might have a different model, some may have thought their coach, Kyle Shanahan, was only good as an offensive coordinator, and would fail as a head coach. Some maybe thought he learned from his mistakes, and his players would play hard for him, and he would be a “boy genius”. Many were in the middle of that spectrum. What would Jimmy Garoppolo do for a full year? Was he overrated? How good was their draft? What about injuries? There are a million things that go into a prediction. Many prognosticators figured they would be improved, perhaps to as much as .500, maybe 9 and 7. I do not think the greatest clairvoyant could have seen anything close to what they ultimately did. Does mean everybody who had some kind of a model (a lot of us have very “loose” models) had a poor model? Not really. We can make a bad prediction, but it still is a relatively strong model. Were the models (or modelers) that over predicted what would happen with COVID-19 poor? Unless we know what went into the models (which only the modelers hold that information), we will never know. But I do believe that the general public has been overly tough on these forecasters, not quite understanding that, much like with predicting Frisco to go 8 and 8 with the info they had, 200,000 deaths might have been appropriate given the information.

COVID-19-Another way of looking at the numbers

The question has come up quite often in the media and elsewhere regarding the exact death toll due to COVID-19.

There are many cases (likely a majority) where a person’s death has been attributed to COVID-19.  Is it possible that some of them are dying and the virus just hastened their death (i.e. they might have died within weeks or months anyway)? Should these be counted? I mean, perhaps a lot of older people, in general are dying, due in part to the flu, but that is never credited?

In light of that, I would like to propose another way to estimate; a way in which I have not seen or heard anybody mention (though I am sure those advocates are out there).

We should be able to estimate the number of COVID deaths by way of subtracting TOTAL deaths for some area by EXPECTED TOTAL deaths for that area. So, for example, assume that a small city in New York over the last 50 months of March has had x1, x2, …. x50 deaths.  A good mathematician can project for this year what should have happened. He can allow for anything that might be relevant. A time trend factor might be used. Can do by per capita, etc. If one then projects 130, and there are 150, and we believe we have allowed for every other reasonable factor, then we can deduce that 20 died from COVID 19. Now, we might miss big on any given city. But if we do this for 100 cities, we should be pretty close/

Now, this misses on how many people are affected since it is only trying to tease out the COVID-19 deaths.  I am not even saying that it is a better way of doing it as they are now.  But it would complement the findings that are currently out there.

As it is now, many people are skeptical as the real count.  As of this writing, the world wide deaths are a little over 114, 000 deaths.  But how many again, how many are there really?  Some people believe it might be half of that.  My method would, to some extent confirm the general findings.