- You conduct an A/A test (or its equivalent such as having a Variant which is behaviorally the same as the Control group).
- You notice that some metrics show significance and that seems counter-intuitive.
How is this to be reconciled?
It is to be borne in mind that if 2 (or more) groups are made from a homogenous pool of users, the groups will never be identical in all respects (Almost in any respect for that matter). To start with, this can be seen in the fact that of the various metrics defined for the users, almost none would be identical for the different groups of users, even though they were all chosen from the same pool of users at random. Furthermore, some metrics go beyond just being non-identical but also begin to display statistical significance in their difference.
The reason for this is that every group of users, even when chosen at random has its own biases. And these biases manifest as a significant difference in one or more of the metrics. What can be observed is that the number of such metrics that show statistical significance is typically equivalent to the Error Percentage i.e. (100 - Confidence Interval).
- You must ask yourself, what the point of an A/A test is. An A/A test is typically a meta-test, in that it is not meant to test the features but to test the setup and assumptions of the experiment.
- If the number of metrics that show statistical significance in an A/A test goes way above or way below the Error Percentage, it means that the setup of the experiment is faulty. There is some inherent bias in the way users are grouped (i.e. randomizer is not functioning well).