How Product Managers Test Experiments Part I

6 min readJul 21, 2023

In product management, making informed decisions is key to creating successful products. The art of testing experiments plays a pivotal role in this journey, where data-driven insights illuminate the path to innovation. In this article series, we will unravel how product managers test experiments, exploring sampling techniques that reveal valuable insights from smaller data sets and statistical methodologies that empower them to make confident choices.

This article will be in a series. The end goal is to walk us through the use cases of different comparative experiments. In this part 1, we shall lay the theoretical foundation needed to make good decisions when designing experiments for our respective products.

Disclaimer: I made an assumption about a Product Manager with computational skills throughout my article. This does not in any way mean all Product Managers test these experiments themselves. In some cases, Data Analyst, Data Scientist, and any other Computationally inclined personnel on the team does this.

Outline

Introduction
Sampling Techniques
Estimation and Hypothesis Testing
Conclusion

Sampling Techniques

Sampling is the process of selecting a small unit from a population (excursive list) of an item. This is majorly done with the end goal of using the sample to make inferences instead of the entire population. In statistics, we expect a sample (n) to be representative (measured by different techniques) of the population (N) for its inference to be acceptable.

However, there are several ways to select samples, majorly grouped into:

Probability Sampling Technique
Non-probability Sampling Technique

Probability Sampling Techniques

This sampling technique is mostly known as Random Sampling with different forms as below:

Simple Random Sampling: In this approach, all the samples have an equal chance of being selected. This technique works best when the population is homogenous (have the same attribute).
Systematic Random Sampling: This sampling technique is built on Simple Random Sampling. It selects the first sample randomly and then takes k-th steps until the required sample is obtained. What’s k right? In theory, k is always calculated by dividing N by n (N/n).
Stratified Random Sampling: Unlike Systematic and Simple, this approach involves splitting the population (N) into homogenous groups first, then performing Simple Random Sampling per group. Stratification is very useful when a population is a heterogeneous group.

Non-probability Sampling Techniques

Non-probability sampling is a sampling technique used in statistics where the selection of individuals or items from a population is not based on randomization or known probabilities.

Convenience Sampling: Convenience sampling involves selecting individuals who are readily available and accessible to the researcher. It is an easy method but may introduce bias since the sample may not accurately represent the entire population.
Purposive Sampling: Purposive sampling involves deliberately selecting individuals who possess specific characteristics or expertise relevant to the research objective. The researcher uses their judgment to handpick participants based on their knowledge, expertise, or unique characteristics. A good example is a teacher asked to select a student to represent their school in a competition.
Snowball Sampling: Snowball sampling relies on existing participants to refer or nominate additional individuals who meet the desired criteria. This sampling method is often used when the population is hard to reach or identify, such as in studies involving marginalized or hidden populations. You can see this as a referral method.
Quota Sampling: Quota sampling involves selecting individuals to fulfill pre-defined quotas based on specific characteristics such as age, gender, or occupation. The researcher sets these quotas to ensure the sample represents certain population segments. However, the selection of participants within the quotas may not be random.
Cluster sampling: is a sampling technique in which the target population is divided into clusters (groups), and a random sample of clusters is selected. Then, all individuals or elements within the selected clusters are included in the sample. It is different from stratified sampling, where the target population is divided into mutually exclusive strata (subgroups), and a random sample is taken from each stratum.

Depending on your scenario, you would often find yourself combining these various sampling techniques.

Estimation and Hypothesis Testing

Estimation

After selecting the appropriate sampling technique, product managers use statistical estimation methods to conclude the population based on the data collected from the sample. Estimation involves calculating point estimates and interval estimates.

Point Estimate: A point estimate is a single value that serves as an estimate of an unknown parameter in the population. For example, if we are interested in estimating the average satisfaction score of users for a new feature, the sample mean can be used as the point estimate.
Interval Estimates: This provides a range of values (confidence intervals) within which the true population parameter is likely to fall with a certain level of confidence. Commonly used confidence levels are 90%, 95%, and 99%. A wider confidence interval indicates more uncertainty, while a narrower interval suggests higher precision in the estimate.

Use Cases that deal with life and high level of risk uses 99% Confidence Intervals (CI)

Hypothesis Testing

Hypothesis testing is a crucial step in the experimentation process. It helps product managers make data-driven decisions by determining whether there is a significant difference between groups or if an observed effect is due to chance.

Null Hypothesis (H0): The null hypothesis represents the status quo or the absence of any significant effect. It assumes that any observed difference or relationship in the sample is purely due to chance.
Alternative Hypothesis (H1): The alternative hypothesis contradicts the null hypothesis and states that there is a significant effect or difference in the population. This is always the claim of the researcher.
Statistical Significance: During hypothesis testing, product managers set a threshold, known as the significance level (alpha), typically at 0.05. If the p-value obtained from the test is less than or equal to the significance level, the result is considered statistically significant. It suggests that the null hypothesis can be rejected in favor of the alternative hypothesis.
Type I and Type II Errors: In hypothesis testing, there are two types of errors. A Type I error occurs when the null hypothesis is rejected when it is, in fact, true. A Type II error happens when the null hypothesis is accepted when it is actually false. Product managers aim to minimize both types of errors, but the trade-off between them often depends on the specific context and consequences.
Test Statistics: a test statistic is a numerical value calculated from the sample data that is used to determine whether to reject or fail to reject the null hypothesis. The test statistic is compared to a critical value or p-value to make this decision.

The test statistic's choice depends on the data's nature and the specific hypothesis being tested. Different statistical tests are used for different types of data and research questions. Some common test statistics include:

t-statistic: The t-statistic is commonly used for hypothesis testing when dealing with small sample sizes and when the population standard deviation is unknown. It is often employed in scenarios involving means or averages.
Z-score: The Z-score is used when working with large sample sizes and known population standard deviation. It is mainly applied in situations where the statistic's sampling distribution is approximately normal.
F-statistic: The F-statistic is used in the analysis of variance (ANOVA) and other related tests, such as comparing variances between multiple groups. It assesses whether the means of multiple groups are significantly different.
Chi-square statistic: The chi-square statistic is employed in testing relationships between categorical variables. It is commonly used in chi-square tests for independence and goodness-of-fit tests

In theory, the z-score is used when variance is known and the sample size is large. On the other hand, T-test is used when the variance is unknown and the sample is small. However, T-tests always work in practice. This is because the t-test approximates to z as the sample size increases.

Interpreting Results

Once hypothesis testing is completed, product managers analyze the results to make informed decisions. Statistical significance does not necessarily imply practical significance, and effect size should also be considered. The effect size measures the magnitude of the observed difference and provides insight into the practical importance of the findings.

Conclusion

Product managers play a critical role in the testing of experiments, employing various sampling techniques to collect representative data. Estimation and hypothesis testing allows them to draw meaningful insights from the data and make informed decisions about product enhancements or new features. By using robust statistical methods, product managers can confidently validate their hypotheses and drive product improvements that cater to the needs and preferences of their users.

In the subsequent article of this series, we shall be walking through some use cases. I have a preference for social media data for our experiments and would love to see your suggestions for some social experiments you would like to test!

Shall we find answers to some hypotheses together, please?

Why not!