A novel machine learning method can identify past infections using unprecedentedly small datasets retrieved from T-cell receptors and has the potential to enhance the understanding of how the immune system recognises pathogens.
In science, discoveries are often limited by the data collection and processing methods as large datasets may be expensive, inconvenient, and time-inefficient to collect.
Fortunately, a new machine learning method, named “MotifBoost”, developed by researchers at the Institute of Industrial Science at The University of Tokyo, can analyse data from T-cell receptors (TCRs) to identify the pathogens responsible for previous infections.
This approach interprets data from shorter sequences in the T-cell receptors, allowing for a reduction of the pool of data for analysis. The researchers’ efforts may result in a better understanding of pathogen identification by the immune system, which in the long run may lead to improved therapies and better health outcomes.
The ongoing pandemic has once again emphasised the importance of the immune system to guard against unknown pathogens. The adaptable immune system comprises a variety of specialised cells, for example, T-cells, that have a range of unique receptors that recognise certain antigens that belong to different pathogens, even before prior infection.
Thus, the variety of receptors that T-cells have is an important research focus for a better understanding of the human immune system. However, data collection of the signaling between the receptors and the antigens they are receptive to is difficult, and prevailing computer-based models are unable to generate useful insights if not provided with sufficiently large datasets.
MotifBoost solves this problem by scanning ultra-short segments in individual T-cell receptors known as “k-mers”. This goes against the conventional methods of scanning for longer amino acid chains as only segments of three amino acids are scanned, yet it is efficient and produces accurate results. “Our machine learning methods trained on small-scale datasets can supplement conventional classification methods, which only work on very large datasets,” first author Yotaro Katayama said. Their creation drew upon previous knowledge that TCRs produced by different people in response to the same pathogen shared a close resemblance.
In their study, the researchers used unsupervised learning techniques, which automatically categorise participants based on pre-existing trends in the data, and found that participants were grouped by the k-mer distribution depending on whether they had been previously infected by a cytomegalovirus (CMV).
Since unsupervised learning methods do not allow for the identification of those participants that had been previously exposed to CMV, the outcome supports the reasoning that k-mer data is suitable as indicators for various characteristics of a participant’s immune systems (for example, whether they had been infected with a particular virus or not). After that, the researchers used the k-mer distribution data, which had been tagged according to their past infection status by pathogens, in a supervised learning model. The model was taught to predict the infection status for unknown samples and their accuracy was tested for two viruses, CMV, and HIV.
“We found that existing machine learning methods can suffer from learning instability and reduced accuracy when the number of samples drops below a certain critical size. In contrast, MotifBoost performed just as well on the large dataset, and still provided a good result on the small dataset,” said senior author Tetsuya J. Kobayashi.
This study may open avenues for the development of novel methods to identify viral infections and to check for immunity to specified viruses from T-cells.
Source: Katayama et al. (2022). Comparative study of repertoire classification methods reveals data efficiency of k-mer feature extraction. ;Frontiers in Immunology, 3660.