Data is only an opinion.
Not that our interpretation of data is an opinion. That’s already well understood.
All data is an opinion. Not a fact.
Let’s back up. In the natural, physical world, we experience the color of the sky. We feel the heft of stone in our hand. We also enjoy the positive emotion of a stroll through the park. We can describe the sky as blue, the weight as 12 ounces, and our feeling as tranquil.
We can share these opinions because there is an agreement on what this data means.
Sky blue is a portion of the visual spectrum marked by a certain wavelength. A standards body set an ounce as a measure. When we feel twelve of them, we can cite that number. And as we grow up, we express a feeling and our parents give us the name to describe it.
Those are all opinions. Opinions based on a common unit of measure, a counting system, and the meaning of words.
This is not a conspiracy. This a needed agreement in our social contract to collaborate, to learn, and to share.
Collapsing the wave function.
Collapsing the wave function is an expression used by quantum physicists. It suggests that when we observe a phenomenon, we eliminate possibilities. A quantum particle, for example, can be in more than one place at the same time.
Collapsing the wave function is also called the observability effect. It states that once we measure something, all the other possibilities are not available.
Therefore, humans are the root cause of what we consider an objective description of the universe. This conclusion parallels our premise that “data is just an opinion.”
The fingerprints of humankind are always in the data.
Humans created the pound and the kilogram. They are not a byproduct of the nature. We agree on the use of the Imperial or Metric measuring system to do business. Weight measurements did not exist before the third or fourth century BC. It did not exist until farmers wanted a uniform way to sell their crops.
This realization is fundamental to understanding AI.
We think of analytics, or AI, as an outcome based on data. We predict a future incident based on events of the past. We train the algorithm to find the key factors from prior failures.
To get to those factors, we populate a table full of measurements taken over time. Measurements such as temperature, vibration, sound, and output yield. When we map the unexpected numbers coincident with a failure, we make a connection.
In simple terms, when these things happen, this other thing is likely to occur. That is predictive analytics in the most simple term.
Let’s explore what kinds of data we generate.
Putting on the “first principles” thinking hat, there are three different kinds of data. With expressions in either numbers or words. But ultimately numbers play the principle role.
1. Counting data.
Using a numerical system, we account for what is in the inventory. We count the units, then affix a number. This is a short hand express all of the available items.
Jerry picked an apple. Jerry picked another apple. Then another apple. Jerry has three apples.
From here we can apply mathematical principles such as addition, subtraction, multiplication, fractions, etc.
Basically, when we convert to the symbolic language of numbers, further operations can ensue.
2. Measuring data.
This broad category shares one principle: measuring requires starting with a fixed point.
Measuring length is possible because we have notches on a yard stick. What is fixed is the convention on the separation of the notches.
Measuring distance, like length, is also measured in a similar fashion. However, the attention is not the length of an object but the separation between objects.
Even machine or application data stems from people. Engineers decide how sensors are programmed or applications write logs. Without these interventions, there is no data.
Time can also be measured.
With each tick of the stopwatch the number of seconds increase. We measure the lap around the track in seconds, minutes, or hours. But the space between the ticks was something humans divined. Interestingly, the latest definition of a second arrived in 1967. The scientific community agreed on the number of oscillations of the cesium atom that make up a second.
The good news is that whether we choose seconds or seasons, it resolves in a number. Which means that mathematical principles can be applied.
3. Label data.
Let’s go back to our friend Jerry. He picked tree fruit that was red, tasted in a particular way, and generally came in a uniform shape. We called them “apples” because that is the generally acceptable name for the fruit.
Names of things are just labels for things. The same object can have many labels.
You can be a mother, a daughter, and a sister all at the same time. You can share all three labels. Those names are mutually exclusive even if the person remains exactly as she is in time and space.
Verbs are labels for an action taken. Run, jog, trot are labels for forward locomotion of the person.
We can see that language is really expression of our understanding of what we experience. It is useful because there is a general convention of the definitions. But they are still a function of human understanding and not a fundamental property of physics.
Words can turn into numbers.
In the early days of sentiment analysis, we counted the number of words that had a positive salience to them. Then we compared them to the words that had a negative salience. If there were more positive words in a product review, then the post was considered positive.
To the degree that the net sum is significantly above neutral, we refer to the review as “very positive”. It is numerically higher relative to neutral (zero).
The data science principles of Natural Language Processing have matured far further than counting positive and negative words. Proximity of words also contains value, since this, too, can derive context.
Now that we have data expressed as numbers, we can do all kinds of math.
This is the secret that data scientists don’t talk about. It is not a forbidden fruit for the “rest of us.” Instead, it is simply too long of an explanation. Not many are willing to go this far down the rabbit hole.
If AI is based on data, and data is based on opinions, then is AI is based on opinions?
Let’s unpack the word “opinion”. Opinion is a decision or conclusion based on observations. It is a conclusion built from logic.
AI uses mathematical logic to calculate highest probability of a condition (an opinion). For example the chance of rain is 85%. It bases it’s score on the observations derived from the data fed into the algorithm.
So AI is based on opinions.
But who cares about philosophy on the true nature of data, when we have real problems to solve?
While philosophy might be fun, the rubber meets the road when we apply data to an action. Attention is required.
These outcomes have far reaching effects. For example, false positives can result in over-medication, delays in production, or unproductive worry.
Providing quality outputs means that we must ensure quality inputs first. The expression “garbage in, garbage out” fits here.
Consider this framework for taking responsibility for high quality data outcomes.
A. Begin with the end-in-mind.
Define the outcome you seek, then determine the key indicators which will ensure that. The data will follow. This choice is the first stage in your checklist for quality.
B. Check the source.
Inspecting the origin of the data matters. Are the sensors have a high degree of accuracy? Was there bias in how the data was labeled? Was there business logic based on a faulty conclusion from another report?
Checkpoints across data lineage are a systematic way of remaining in control.
C. Invite outside inspection.
Your group has a particular way of seeing the world. A third-party offers an outsider’s view of the situation. They will come with blind-spots, but they will not be your blind-spots.
Build mechanisms to verify the process is sound. When in doubt, ask an accountant why auditors matter.
Your opportunity is best served with critical thinking.
Know the difference between data and a fact. A fact is something known proved to be true. Data is generally considered the best means to arrive at that proof.
However, even data that come from the emotionless machines, are not indelible aspects of the universe. They are devices designed by people to achieve a consistent result. While they are useful, they is not perfect.
Critical thinking can identify an opinion that has an impact today. It can also recognize a result that survives the test of time.