Warning: imagejpeg(C:\Inetpub\vhosts\kidney.de\httpdocs\phplern\29895692
.jpg): Failed to open stream: No such file or directory in C:\Inetpub\vhosts\kidney.de\httpdocs\pget.php on line 117 Proc+Natl+Acad+Sci+U+S+A
2018 ; 115
(26
): 6703-6708
Nephropedia Template TP
gab.com Text
Twit Text FOAVip
Twit Text #
English Wikipedia
Unique function words characterize genomic proteins
#MMPMID29895692
Scaiewicz A
; Levitt M
Proc Natl Acad Sci U S A
2018[Jun]; 115
(26
): 6703-6708
PMID29895692
show ga
Between 2009 and 2016 the number of protein sequences from known species
increased 10-fold from 8 million to 85 million. About 80% of these sequences
contain at least one region recognized by the conserved domain architecture
retrieval tool (CDART) as a sequence motif. Motifs provide clues to biological
function but CDART often matches the same region of a protein by two or more
profiles. Such synonyms complicate estimates of functional complexity. We do
full-linkage clustering of redundant profiles by finding maximum disjoint
cliques: Each cluster is replaced by a single representative profile to give what
we term a unique function word (UFW). From 2009 to 2016, the number of sequence
profiles used by CDART increased by 80%; the number of UFWs increased more slowly
by 30%, indicating that the number of UFWs may be saturating. The number of
sequences matched by a single UFW (sequences with single domain architectures)
increased as slowly as the number of different words, whereas the number of
sequences matched by a combination of two or more UFWs in sequences with multiple
domain architectures (MDAs) increased at the same rate as the total number of
sequences. This combinatorial arrangement of a limited number of UFWs in MDAs
accounts for the genomic diversity of protein sequences. Although eukaryotes and
prokaryotes use very similar sets of "words" or UFWs (57% shared), the
"sentences" (MDAs) are different (1.3% shared).