The scientific community has spent a long time trying to answer this question: how many human proteins are there exactly? The answer to this provides with a deep understanding about how our genetic blueprints (DNA) translates into who we are. They can also give us insight about how our DNA gives us our individual characteristics, our phenotype.
There is a great blog post on the topic as well as a recent article by Neil Kelleher and colleagues published in Nature Chemical Biology (here). Both papers are trying to answer how the proteins constitute the prototype. Essentially, it comes down to the seemingly simple question of how many different forms of proteins there are.
To even begin to answer this question, we must look into how many genes in our DNA are protein-coding. DNA is a very complex structure. Though it contains very long sequences, not all of them are used for translation into a functional protein. The estimate is that there are around 20’000 protein-coding genes in our DNA. This means there are about the same number of proteins in the human body. This seems straightforward, but it’s not.
One needs to consider that our transcription machinery (DNA->RNA) can produce different splice variants of the same protein-coding gene. Additionally, there are various post-translational modifications (PTMs) like phosphorylation, glycosylation, etc. that strongly influence the function or activity of specific proteins. There are also many additional sources of protein variability in biological systems. The question then becomes philosophical. If there is so much potential variability in a single protein, can we still talk about 1 protein? Or, should we discuss thousands of different protein variants?
In the article mentioned above, Prof. Kelleher and colleagues focus on proteoforms rather than proteins. A proteoform is defined as an individual molecular form of an expressed protein. So, if we aim to understand the importance of proteins and protein networks in biology, the relevant question to ask is not so much how many proteins there are, since protein is a broadly defined term, but how many proteoforms are expressed.
This question needs to be looked at from different angles. First of all, what is the number of theoretically possible proteoforms, secondly how many of those are actually expressed, and thirdly should we count each different variant as a separate proteoform? An example in the article is given for the human histone H4 protein, which is estimated to have at least 98’304 theoretically possible proteoforms, purely based on PTMs. However only 75 of those were reported so far. The big gap between theoretical possibilities and actual observed proteforms likely comes from the fact that nature is not so free spirited after all and that there is a high degree of control over possible proteoform diversity. All in all, current estimates put the number of actual distinct proteoforms out there in the range of single digit millions of variations instead of the theorized trillions.
We come to the part where we have to ask ourselves: Are we comfortable making these estimates? Are we ready to know what we don’t know?
The diversity presented here can’t really be captured because technology is not advanced enough. Mass spectrometry based proteomics allows us to see deeper in the proteome than ever before. Scientists are identifying thousands of protein groups. However, this is still far away from understanding the complete set of distinct proteoforms.
We are constantly uncovering new insights into protein analysis technologies. These discoveries gives us an unprecedented insight into the mysterious world of proteins and how they define biology.
We may never have a full map of all the human proteoforms that occur in nature but, we shouldn’t be discouraged. Scientists should focus on empowering our knowledge about the biological systems with the information that is available while also keeping ourselves humble. We should never assume that we completely understand nature and biology.