AI y reproducción asistida, un trabajo del Dr. Chávez Badiola
Hace unos días recibimos la noticia de que un artículo del doctor Alejandro Chávez Badiola fue aceptado para ser publicado en la revista RBM Online. Se titula Artificial Intelligence in the embryology laboratory: A review y te lo compartimos a continuación.
Introduction
Since the late 1970s, when the first ―test-tube baby‖ was born in England, the field of reproductive endocrinology and infertility has made many advancements. However, despite many attempts to create prediction models, we still struggle to accurately predict the outcome of an in vitro fertilization (IVF) cycle.
Initially, prediction models were based on well-known statistical models (Bancsi et al., 2004; Hunault et al., 2002; Jurisica et al., 1998; van Weert 2008). More recently, the emerging technologies of time-lapse incubators and preimplantation genetic testing (PGT) were introduced in the field as important achievements, with the potential to produce a more objective method of selecting embryos with the best implantation probability.
However, at present, there is insufficient evidence to recommend the routine use of these techniques for the sole purpose of improving single embryo transfer (ET) live birth rates (Khosravi et al., 2019; Tiitinen et al., 2019). Within the last decade, machine learning (ML), more specifically convolutional neural networks (CNNs), have been used to assist with medical imaging in a variety of fields, such as ophthalmology (Abràmoff et al., 2016), dermatology (Esteva et al., 2017), radiology (Hosny et al., 2018), and pathology (Khosravi et al., 2018).
This technology has also been applied in the embryology laboratory, aiming to improve the selection of the single embryo with the best implantation potential to achieve the ultimate goal of fertility treatment: the birth of a healthy
baby (Khorsavi et al., 2019).
Since artificial intelligence (AI) has found a place in IVF, its potential use in nearly every aspect of infertility patient care has been investigated, including for identifying empty or oocyte-containing follicles; predicting embryo cell stages, blastocyst (BL) formation from oocytes, and live birth from BLs; assessing sperm morphology and human BL quality; improving embryo selection; developing optimal IVF stimulation protocols; and quality control (Curchoe and Bormann 2019; Bormann et al., 2021a). The goal of this review is to summarize recent advancements using AI technology in the embryology laboratory.
AI Learning Algorithms
AI is a general concept comprising diverse mathematical approaches with the capacity to make predictions based on complex pattern recognition by incorporating the processing power of computers (Malik et al., 2021). The selected algorithm(s) and the weight distribution attributed to its parameters define an AI model (Burkov 2019).
The selection of an ML model is determined by the intended task (e.g., classification vs regression vs ranking), the dataset’s characteristics (e.g., size, labeled/unlabeled data, structured vs unstructured data), and the planned learning approach (e.g., supervised, unsupervised). Based on these variables, scientists can choose among several different approaches to build algorithms or blocks (pipelines) of algorithms with different learning capabilities (i.e., shallow or deep learning).
Examples of learning algorithms include artificial neural networks (ANNs), support vector machines (SVM), and decision trees, among others (Jordan and Mitchell 2015). Selecting ML models is difficult, which explains why sometimes several architectures can be tested at once (Burkov 2019). Chavez-Badiola et al., presented one such example as a proof of concept when five different algorithms were trained and tested on two datasets to assess their generalization capabilities to predict embryo implantation.
This study presents an example of how this approach could guide scientists during the selection of a model toward clinical implementation (Chavez-Badiola et al., 2020a). In this study, however, the limited size of the
datasets could explain the poor performance of ANNs, making a real comparison against ANNs
potentially inadequate.
Several other studies have tested multiple architectures (Morales 2008 et al.; Miyagi et al., 2019), including the description by VerMilyea et al., of how different model architectures and hyperparameters (i.e., loss function and optimization methods) were considered before building their final architecture (VerMilyea et al., 2020). Overall, results from these studies illustrate how different algorithms, even when trained on identical datasets, result in different performances, underlining the essential importance of a well-designed mathematical and computational
approach.
AI Algorithm Training and Validation
As problem complexity scales, most learning algorithms begin to show their inherent limits. One outstanding exception is ANNs. ANNs are designed to solve challenging classification problems and process large amounts of complex (non-linear) features simultaneously (Lancashire et al., 2009), which in turn tend to benefit from large training datasets. Disadvantages of ANNs include their tendency to overfit and the ―black box‖ nature of their hidden layers (Tu 1996).
ANNs are a family of algorithms that includes CNNs, which stand out for image analysis due to their ability to perform numerical matrix analysis, in contrast with non-CNNs, which allow other information as input (e.g., age). As expected, CNNs have become a common recourse for embryo analysis based on static images and time-lapse videos, as confirmed by the recent number of publications describing their implementation as either stand-alone solution (Chen et al., 2019; Bormann et al., 2020a) or part of a pipeline of algorithms allowing for efficient image analysis (Kragh et al., 2019; Chavez-Badiola et al., 2020b).
The next step after selecting a learning algorithm is its training. This involves adjusting the model to minimize the error of the output using the values of the data provided as a ground truth (i.e., training), and a second step where the trained model is exposed to ―unseen‖ data to assess its performance (i.e., validation). The relevance of a high-quality dataset cannot be overestimated, since problems related to training on suboptimal datasets are numerous. One example is the result of training on an unbalanced dataset, which can lead to unreliable results (Chawla et al., 2004), which may have been the case in a study by Tran et al.
In this study, the high proportion of embryos with negative outcomes outweighed those with positive outcomes, resulting in a deeply unbalanced dataset, perhaps not representative of the problem, which in turn led to an almost unrealistic performance (area under the ROC curve of 0.93) (Tran et al., 2019; Kan-Tor et al., 2020a).
The size of a dataset is also relevant. However, encountering high-quality and large datasets is uncommon in the field of reproductive medicine due to a lack of standardization in data collection and storage, the routine use of manual annotations, and the challenges related to data sharing (Hickman et al., 2020; Curchoe 2021). There are, however, strategies to optimize a dataset’s size.
Examples include the recourse to data augmentation made by VerMileya et al., where images in their training set were subjected to manipulations (e.g., rotations, reflections, jitter) (VerMileya et al., 2020; Kanakasabapathy et al., 2021), allowing the training examples to multiply without a real increase in the size of the dataset. In this context, the use of synthetic data seems a promising tool to generate large, diverse, representative, and balanced datasets without the constrains of accessing analog clinical data.
Still, understanding its inherent challenges will become paramount to making best use of this attractive approach (Chen et al., 2021). Another proposed solution to approach a limited-sized dataset is a top-down feature extraction (ChavezBadiola et al., 2020a), which relies on the use of customized feature extractors designed with knowledge of the problem, as opposed to CNNs, which require a lot of data to determine the feature extractors to use (bottom-up approach).
In brief, most training data sets used in AI protocols are labeled data, i.e., supervised learning. Labeling is performed by humans and thus is very subjective. In addition, if clinical outcome data are used, humans are selecting the embryos for transfer. The requirement for heterogeneous diverse training data, including an ethnically and racially diverse population of patients, is essential. A balanced set of data is also important to eliminate bias in AI learning (Swain et al., 2020). Unsupervised learning is an attractive alternative that needs to be explored.
Clinical Training and Validation
Validation as a part of the training process should be separated from the validation of a system in a clinical setting (Curchoe et al., 2020). AI should be built to become robust enough to perform beyond its training dataset. But as described recently by Meseguer and colleagues, when a system is deployed in real-life, specific conditions from new datasets, including the wide range of characteristics that surround clinical and laboratory procedures, may lead to an AI system’s suboptimal (Meseguer and Valera 2021) and sometimes even erratic performance, a common
ML problem known as under specification (D’Amour et al., 2020).
Most current AI models for embryo selection rely on expert human supervision (supervised learning). One notable exception is the study by Kanakasabapathy et al., where the authors present an adaptive adversarial neural network (AANNs), which uses a form of unsupervised learning called adversarial learning. In this study, AANNs performance was tested when using different microscopes on a variety of samples including human embryos, sperm and blood cells. The authors compared a supervised learning model against their AANN and show how the later managed to maintain performance despite profound variations in image quality, suggesting AANNs could overcome training bias and task-irrelevant feature information incorporated into the model.
By training neural networks to focus on relevant features alone, AIs might show better performance when deployed through different laboratory settings (Kanakasabapathy et al., 2021). Since this study only discriminates between blastocyst and no-blastocyst, its clinical application during the embryo selection process is still to be tested. This, however, presents as an example on how AI’s training could be designed to become self-supervised.
Learning algorithms are attractive because they are expected to continuously improve performance as the available dataset grows. However, the brute force of a large dataset alone does not guarantee improved performance; if further training is not carefully undertaken, it risks performance degradation (Lavin et al., 2021) and the threat of data poisoning (Schwarzchild et al., 2020), whether intentional or not. Understanding the risks associated with further training is key to assessing a model’s robustness (e.g., internal validation, external validation). Moreover, continuously evaluating its performance after tuning according to individual practices through a standardized quality assurance process is paramount, or at least highly desirable when considering the clinical readiness of an AI system (Curchoe et al., 2020; Mahadevaiah et al., 2020).
AI Application in Assisted Reproductive Medicine
Both invasive and noninvasive methods are used to select competent, healthy gametes for combination during assisted reproductive technology (ART) procedures. Every stage of ART treatment (fertilization, embryo development, implantation, healthy clinical pregnancy) depends on high-quality, mature, genetically normal sperm and oocytes. Morphology of oocytes (cumulus oocyte complex, polar body, and ooplasm defects) and motility characteristics of sperm (swim up, gradient centrifugation or laminar flow microchannels on chip, and PVP challenge) combined with morphology (vacuoles, head shape, and midpiece and tail defects) are routinely used to select gametes for insemination.
Unfortunately, developmentally incompetent oocytes may exhibit the same morphology as competent ones. In addition, even high-powered microscopy, such as Intracytoplasmic morphologically selected sperm injection (IMSI), cannot detect DNA fragmentation in sperm.
AI Application on Sperm
In reproductive urology, early AI applications focused on semen parameters, but the technology has advanced to include the development of automated sperm detection and semen analyses. AI technology for semen analysis, sperm viability, and DNA integrity has even been bridged with external hardware devices and smartphone (mobile) applications (Dimitriadis et al., 2019a; Kanakasabapathy et al., 2017).
Goodsen et al., classified single sperm as progressive, intermediate, hyperactivated, slow, or weakly motile using SVMs with 89.9% accuracy (Goodsen et al., 2017). Mirsky and colleagues employed interferometric phase microscopy along with SVM to develop a model to assess sperm morphology and classify sperm into ―good‖ or ―bad‖ morphology with over 88% accuracy (Mirsky et al., 2017). Thirumalaraju and colleagues used smartphone microscopy in conjunction with deep transfer learning to develop an inexpensive system that can accurately measure sperm morphology based on WHO 5th edition Kruger strict criteria (Thirumalaraju et al., 2019a).
Ovarian Stimulation Management
Infertility is a multifactor disease, which makes diagnosis and treatment complicated. Liao et al. have shown that a ML-derived algorithm is useful to assist clinicians in making an efficient and accurate initial judgment on the condition of patients with infertility. In their study, over 60,000 infertile couples’ medical records were evaluated using a grading system that classified patients into 5 grades ranging from A to E. The worst grade, E, represented a 0.90% pregnancy rate, while the pregnancy rate in the A grade was 53.8%. The cross-validation results showed that the stability of the system was 95.9% (Liao et al., 2020).
Letterie et al., evaluated a computer decision support system for day-to-day management of ovarian stimulation during IVF following key decisions made during an IVF cycle: [1] stop stimulation or continue stimulation. If the decision was to stop, then the next automated decision was to [2] trigger or cancel. If the decision was to continue stimulation, then the next key decisions were [3] the number of days to follow-up and [4] whether any dose adjustment was needed (Letterie and Mac Donald 2020).
The authors used data derived from an electronic medical records system of a female population undergoing IVF cycles and oocyte cryopreservation to include the ’patients’ demographics, past medical history, and infertility
evaluation, including diagnosis, laboratory testing for ovarian reserve, and any radiologic studies pertinent to a diagnosis of infertility. The four key decisions during the process of ovarian stimulation and IVF were compared to expert decisions across 12 providers; they were found to have a sensitivity of 0.98 for trigger and 0.78 for cycle cancellation.
AI Application on Oocytes
Controlled ovarian stimulation (COS) yields oocytes at various stages of meiotic maturity. Identification of MII (extruded polar body), MI (no polar body), GV (germinal vesicle indicative of prophase I), giant MII oocytes, and other abnormalities is primarily performed by embryologists; however, nuclear and cytoplasmic maturity cannot be assessed.
Noninvasive AI methods to evaluate oocyte competency could become an important selection and prediction tool to reduce the number of embryos created and wasted (of paramount importance in countries that restrict supernumerary embryos), to reduce the number of embryos for trophectoderm (TE) biopsy and PGT, and to prognose the success of an IVF cycle. In the case of donor egg cycles, a tool to objectively assess oocyte quality and subsequent fertilization potential may be very valuable to intended parents for psycho-social reasons.
Additionally, experimental and research procedures like in vitro maturation (IVM) of oocytes, somatic cell nuclear transfer and reprogramming, in vitro gametogenesis (IVG), and more would benefit from prediction and
selection AI systems.
In 2011, Setti and colleagues performed a meta-analysis to identify the relationship between oocyte morphology and ICSI outcomes (Setti et al., 2011). Their study demonstrated that the presence of a large first polar body and a large perivitelline space and the inclusion of refractile bodies or vacuoles are associated with decreased oocyte fertilization.
In 2013, Manna et al. performed texture analysis of 269 oocyte images and tracked the corresponding embryo
development (Manna et al., 2013). Texture features were used with a neural network to predict the outcome of a given cycle, meaning that multiple transfers were present in the data used, for an AUC of 0.80. In 2021, Targosz and colleagues tested 71 deep neural network models for semantic oocyte segmentation (Targosz et al., 2021). They trained their algorithm to classify the following oocyte morphologic features: clear cytoplasm, diffuse cytoplasmic granularity, smooth endoplasmic reticulum cluster, dark cytoplasm, vacuoles, first polar body, multi-polar body,
fragmented polar body, perivitelline space, zona pellucida, cumulus cells, and the germinal vesicle. In this study, the top training accuracy (ACC) reached about 85% for training patterns and 79% for validation.
In 2020, Kanakasabapathy and colleagues trained a CNN to predict fertilization (2PN or non2PN (pronuclear formation)) potential from oocyte images and to identify oocytes with the highest fertilization potential >86% of the time (Kanakasabapathy et al., 2020a). Results from this study allow for the development of novel quality assurance tools used to monitor oocyte stimulation regimens, assess ICSI performance, maintain optimal fertilization and embryo culture conditions, and evaluate oocyte vitrification and warming procedures. This oocyte quality algorithm was helpful in identifying an association between oocyte morphology and subsequent embryo development (Sacha et al., 2021).
Dickinson and colleagues used deep CNNs to locate the first extruded polar body which allowed them to distinguish mature, metaphase II oocytes from metaphase I and germinal vesicle stage oocytes (Dickinson et al., 2020). Pinpointing the location of the extruded polar body also allowed this algorithm to identify the correct location on the oocyte to inject sperm for ICSI. In their study, over 14,000 images of MII oocytes were used for training, validation, and testing. The deep learning CNN was able to correctly identify the location of the polar body and the corresponding location for sperm injection for a test set of 3,888 oocytes with 98.9% accuracy with a 95% confidence interval (CI) ranging between 98.5% and 99.2% (Dickinson et al., 2020).
AI Application on Pronuclear Stage Embryos
Normal fertilization follows a definite course of events. Oocytes show circular waves (Payne et al., 1997) of granulation within the ooplasm after ICSI. During this granulation phase, the sperm head decondenses and the second polar body is extruded. This is followed by the formation of the male pronucleus. At about the same time, the female pronucleus forms and is drawn toward the male pronucleus until apposition is achieved. Both pronuclei then increase in size, and their nucleoli move around and arrange themselves near the common junction. Only zygotes with two distinct pronuclei are considered normal and appropriate for transfer. It is critical that embryologists assess fertilization status correctly, as there is only a small window of time in which pronuclei can be properly counted.
Fertilization checks and embryo quality assessments require manual examination, status recording, and embryo development scoring. These processes are labor intensive and subjective. In 2019, Dimitriadis and colleagues described the development of a CNN that can distinguish between 2PN and non-2PN zygotes at 18 hours post-insemination with >90% accuracy (Dimitriadis et al., 2019b). This system can be used as an embryologist aid to help confirm the fertilization assessment of each oocyte. It can also be used to monitor individual embryologists
performing ICSI in a clinical setting for advanced quality assurance to improve patient outcomes
(Thirumalaraju et al., 2019b; Bormann et al., 2021a).
Several studies have shown that morphological features specific to the pronuclear stage embryo can be used to assess embryo quality and developmental potential. These grading systems factor in the size, shape, and alignment of pronuclei. They also factor in the number and distribution of nucleoli and the overall appearance of the cytoplasm. (Scott and Smith 1998; Scott et al., 2000; Tesarik and Greco 1999). These morphological grading systems have also been shown to help aid embryologists in selecting embryos with high implantation potential. (Lan et al., 2003; Zollner et al., 2003). Manually scoring zygotes is a labor-intensive and subjective activity. As such, few practices continue to assess this critical stage of development. However, with the use of AI, these predictive features may be readily incorporated into an embryo selection algorithm (ESA).
In 2021, Zhao and colleagues used CNNs for segmentation of pronuclear-stage embryos. They examined the morphokinetic patterns of the zygote cytoplasm, zona pellucidae, and pronuclei. Their manually annotated test set had precision of >97% for the cytoplasm, 84% for the pronuclei, and approximately 80% for the zona pellucida.
The authors concluded that their CNN system has the potential to be incorporated in a clinical practice for pronuclear-stage segmentation as a powerful tool with high precision, reproducibility, and speed (Zhao et al.,
2021). Early parameters of zygotic (cytoplasmic movement) development, analyzed by AIpowered methods, have been shown to be predictive of BL development. Compared to human evaluation and prediction using morphological parameters, AI-based methods using cytoplasmic kinetics showed on average 10% higher accuracy (Coticchio et al., 2021).
AI Application on Cleavage Stage Embryos
Embryo transfers are generally performed at the cleavage or BL stage of development. Cleavagestage embryos are generally selected for transfer based on only three features: blastomere cell count, percentage of overall cytoplasmic fragmentation and degree of asymmetry between blastomeres (Prados et al., 2012). These grades or made by visual examination of the embryos and have been shown to be highly subjective in nature.
The introduction of time-lapse imaging (TLI) technology has allowed for both automated and manual assessments of embryo development at precise times and under controlled environments (Azzarello et al., 2012; Cruz et al., 2012; Hlinka et al., 2012; Lechniak et al., 2008; Lemmen et al., 2008). However, most of the TLI algorithms only shown promising results in identifying embryos with low developmental potential. The incorporation of TLI systems to standard manual embryo assessments did not improve overall clinical outcomes nor did they decrease the amount
of time embryologists spent assessing embryo morphology (Chen et al., 2017; Conaghan et al., 2013; Kaser et al., 2016; Kirkegaard et al., 2015).
Dimitriadis and colleagues demonstrated a fast and simple cohort embryo selection (CES) method for selecting cleavage-stage embryos that will develop into high-quality BLs. This study demonstrated the ability of embryologists to quickly identify high-quality cleavage-stage embryos when all embryos in the cohort were simultaneously compared in a single image. This method of selection outperformed traditional methods of cleavage-stage embryo ranking based on both morphology and adjunctive morphokinetic TLI parameters (Dimitriadis et al., 2017). This method is excellent at identifying high-quality embryos from a cohort; however, this
method of selection is subjective and lacks consistency between operators.
Computer vision technology has been proposed as a solution to overcome the labor constraints and subjective nature of assessing and selecting embryos based on morphology and morphokinetic measurements. Kanakasabapathy and colleagues used deep learning CNNs to train and validate embryo assessments on day-3 embryo images based on embryo developmental outcomes recorded on day 5 of culture. This algorithm was trained to make the following day-5 developmental predictions: embryo arrest, morula, early BL, full BL, and high-quality BL. Using a test set of 748 embryos, the accuracy of the algorithm in predicting BL development at 70 hpi was 71.9% (CI: 68.4% to 75.2%) (Bortoletto et al., 2019; Kanakasabapathy et al., 2020b).
To evaluate the potential improvement in predictive power, Kanakasabapathy and colleagues also compared the accuracy of predictions by embryologists in identifying embryos that will eventually develop into BLs when presented with embryo morphology imaged on days 2 and 3 of development. Additionally, their performance was evaluated with and without the use of the Eeva three-category TLI algorithm that uses P2 (duration of the 2-cell stage) and P3 (duration of the three-cell stage) to predict BL development (VerMilyea et al., 2014). The neural network significantly outperformed the embryologists in identifying embryos that will develop into BLs correctly (P < 0.0001) and the overall accuracy in prediction, regardless of the evaluated methodology (P < 0.0001). This was the first AI-based system for predicting the developmental fate of cleavage-stage embryos (Kanakasabapathy et al., 2020b).
Bormann and colleagues described an early warning system for using cleavage-stage embryos and statistical process controls for detecting clinically relevant shifts due to laboratory conditions (Bormann et al., 2021a). This study presented a novel key performance indicator (KPI) for monitoring embryo culture conditions at the cleavage stage of development. This AI-based KPI predicted the percentage of cleavage-stage embryos that would develop into high-quality BLs on day 5 of development. When compared with 5 established cleavage-stage KPIs, this AI-based KPI for predicting high-quality BL formation had the highest association with ongoing pregnancy rates (R2 =0.906). This is the first AI-based cleavage-stage KPI demonstrated to detect changes in a culture environment that resulted in a shift in pregnancy outcomes.
Carrasco et al., used 800 cleavage-stage embryo images with decision tree methods and statistical analysis of features to determine the implantation potential of cleavage-stage embryos (Carrasco et al., 2017). Wang et al. extracted features from textures from 206 micrographs of early embryos (2 hours of development) (Wang et al., 2018). SVM was used (10-fold cross validation) to achieve 77.7% accuracy and 0.78 of AUC to predict the early embryo development stage (initial and days 1, 2, 3, and 4).
Using CNNs, Meyer and colleagues were able to classify day 3 cleavage-stage embryo images as aneuploid or euploid with a high specificity and thus were able to sufficiently identify 85.5% of aneuploid embryos (Meyer et al., 2020). These results demonstrate the ability of CNNs to identify noninvasive markers for detecting genetically abnormal embryos. Collectively, these studies show that a variety of AI techniques can be utilized to extract unique features from cleavage-stage embryos, which may be used for classification, assessment ranking, or to aid in clinic decision-making.
Kelly and colleagues used CNNs to identify safe regions on a cleavage-stage embryo to perform laser-assisted hatching. This study utilized more than 13,000 annotated images of cleavage-stage embryos to develop an algorithm that identified the largest perivitelline space region or atretic/fragmentated blastomeres. These regions of the cleavage-stage embryos were considered the safest at which to perform laser-assisted hatching. The AI-trained network was tested on almost 4,000 cleavage-stage images had 99.4% accuracy with a 95% CI ranging between 99.1% and 99.6% (Kelly et al., 2020).
Embryo witnessing is a critical step in the embryo transfer process. Traditionally, embryo identification is performed by two embryologists to ensure the correct embryo has been selected for transfer. However, as gametes and embryos are moved from one dish to another during an ART cycle, the possibility of misidentification still exists. Bormann and colleagues used CNNs to classify images of embryos captured on Day 3 of development at 60 and 64 hours post insemination. The algorithm processed embryo images for each patient and produced a unique
key that was associated with the patient ID at the initial evaluation.
At the later time, images were captured and CNNs were used to match the embryo morphology with the initial image. The accuracy of the CNN in correctly matching embryos at the different time periods on Day 3 was 100% (CI: 99.1% to 100%, n = 412) (Bormann et al., 2021b). This technology offers a robust witnessing step based on unique morphological features that are specific to each individual embryo.
AI Application on Blastocyst Stage Embryos
A key question about BL assessment needs to be answered: When do we evaluate BLs? Since BL development is a dynamic process, do we evaluate and grade BLs when they are exhibiting the ―best‖ appearance? Or should we evaluate them at a particular time? This question has yet to be answered by existing AI applications, which have utilized both fixed and flexible time-based methods of evaluation.
Another issue with BL assessment involves grading. For instance, the problem with using Gardner-type BL grading to assess embryo quality is that it is subjective and does not include quantitative parameters. It is a visual estimate of the number, size, and morphology of the inner cell mass (ICM) and TE cells. On the other hand, BL expansion can be easier to standardize if we use measurement tools and volume ratios. The quality of the ICM is estimated by the number and compaction of the cells. However, the minimum number of ICM cells necessary to develop into a viable human fetus is unknown. In addition, the ICM is a cocktail of pluripotent (epiblast) and primitive endoderm (hypoblast) cells. The size of the ICM alone does not indicate the composition of the cells within.
Assessing TE cells is more challenging, as the cell number, shape, nuclear content, and position in the expanding BL are not standardized. AI methods that use segmentation of the BL will enable us to objectively score TE complement. It is easier to judge the compaction of the ICM than it is to assess TE quality.
The bigger question is, do we need to assess BLs at a particular time point? We know that day-5 and day-6 BLs have different outcomes, even when using fresh or frozen ET cycles (Irani et al., 2018). This is especially important to consider when developing AI algorithms that use a single 2D BL image. We must consider the speed and timing of developmental events, particularly compaction and blastulation.
For successful implantation, both BL cell types (ICM and TE) are required. Since current BL grading systems are very simple, it is no surprise that they are not very informative when used to predict implantation. More complex and detailed BL grading systems correlate very well with implantation potential and ploidy assessment. In their recent paper, Zhan et al. converted alphanumeric BL grades into a numeric score for use in statistical analysis and correlations (Zhan et al., 2020a). By using AI, we might be able to strengthen the correlation between BL assessment and outcome in a more objective manner. Also, the ability of AI BL applications predicted by early developmental versus later developmental events needs to be explored.
Time-Lapse Microscopy (TLM) Image Analysis
AI algorithms can be applied to ―raw‖ TLM images. In a recently described image analysis system (Tran et al., 2019), supervised AI training using previously labeled images was developed. The labels used included BL and morphokinetic annotations with positive or negative implantation results. One of the drawbacks of the system was its reliance on humans to create the labels, introducing biased observations and scores. The other problematic practice was the use of non-viable, non-fertilized, or discarded material for negative training groups to increase the training data set. The rationale behind this was the establishment of a completely automatic system that will also be able to recognize these negative embryos. The question remains, Will the developed algorithms perform equally well after removing the discarded group? And are they superior to the BL grading system (Kan-Tor et al., 2020b)?
In another recent study, a different approach was used to predict BL development. It used TLM data up to day 3 of embryo development. Two different AI algorithms were developed: an automatic morphokinetic data model (temporal) and a TLM embryo image model (spatial). Both models have comparable predictive power (~0.7). When combined, the different weights were used to optimize BL prediction. Interestingly, more weights were given to the morphokinetic data compared to the images. When compared to embryologists, the AI model performed better
in terms of sensitivity and specificity (Liao et al., 2021). In another TLM study, BL prediction was accomplished by using morphokinetic TLM data from the first three days of development. Interestingly, by applying a self-improvement (reinforcement) strategy, the predictive power of the AI system improved (d’Estaing et al., 2021).
One unique approach to assessing BL quality is to evaluate a quantitative standard expansion assay (qSEA) utilizing AI. This measures the kinetics of BL expansion and correlates to outcome, where faster-expanding BLs exhibit higher implantation potential (Huang et al., 2021). The following novel embryo parameters have been proposed by Bori et al., to be included in AI selection models: pronuclear kinetics, BL measurements, the size of the ICM, and the cell cycle length of the TE cells. To verify the general utilization of their proposed model (donor oocytes), the authors’ algorithm will need to be evaluated on the IVF patient population.
The same group presented a novel model utilizing AI to predict embryo implantation. Utilizing AI image analysis
combined with the embryo proteomic profile of PGT euploid embryo spent culture media, the authors were able to demonstrate very high implantation prediction. Although the study is preliminary, it demonstrates the power of AI to combine different data points (proteins and morphology) (Bori et al., 2020).
Static Image Analysis of Blastocysts
The object of a study by Khosravi et al. was to establish an AI deep learning model that canevaluate BL quality (Khosravi et al., 2019). In this AI-based prediction model, the BL expansionwas an important parameter, followed by ICM and TE quality. The precise time point used forthe AI evaluation (110 hr) demonstrated the importance of embryo developmental kinetics forembryo prediction.
In a 2020 study by Bormann et al., a single image from the TLM image poolat 113 hr was used for analysis (Bormann et al., 2020a). A CNN system was used to classify BLsbased on the presence of the cavity and the morphological quality of the ICM and TE. Similar toKhosravi et al., Bormann’s group demonstrated that the accuracy of this system for classifyingBLs versus non-BLs was very high (91%). By using the genetic algorithm, the authorsestablished a BL ranking system called the ―BL score.‖ The evaluation of the AI BL selectionmethod, using implantation outcomes of the BLs selected by humans for transfer, showed over50% percent positive outcomes. It will be necessary to perform a comparative prospective studyto identify the (dis)agreement in BL selection for transfer between AI models and embryologists.The emerging question is how different the BL selection for ET is between embryologists usingthe Gardner BL grading system and AI model selection.
Bormann and colleagues demonstrated that the high degree of variability seen amongembryologists making decisions on vitrification and embryo biopsy based on standardmorphological assessments can be dramatically improved using deep neural networks (Bormannet al., 2020b). Souter et al., further demonstrated that deep-learning CNNs can be used toaccurately identify which Day 3 assisted-hatched embryos met Day 5 criteria for trophectodermbiopsy and cryopreservation with 93.7% sensitivity and 96.3% specificity. This validation studywas the first of its kind to demonstrate that an embryo decision making algorithm could besuccessfully applied to embryos that had been artificially breached to promote prematureherniation of trophectoderm cells for blastocyst stage biopsy (Souter et al., 2019).
How many times will the AI choose a different BL for ET than the embryologist within thecohort of available embryos? There is a lot of disagreement among embryologists grading BLs,but how many times is the best BL chosen for ET? There are no standards in choosing an AIsystem for embryo evaluation. They depend on the type of data, the size of the data set, and theoutput queries (Fernandez et al., 2020). It will be helpful to compare multiple AI models on thesame data set.
Other AI models do not use a specific time point for image analysis. In the model by VerMilyeaet al., the ―viability‖ of the embryos was categorized based on the embryologist-given GardnerBL grade, where a ―3BB‖ BL was a cut-off for viable and non-viable classes using fetal heartmeasurements (VerMilyea et al., 2020). Using computer vision image processing and deeplearning, the authors achieved an overall accuracy of over 60% and an average accuracyimprovement of 24% over embryologist grading.
Numerous complex neural network architectures have been proposed for image recognition andperformance of these architectures are highly dependent on the requested task. Thirumalarajuand colleagues compared the use of 8 different architectures to classify blastocyst stage embryoimages captured on a variety of imaging platforms. This study showed that Xception performedbest in learning categorical embryo data and was able to accurately classify blastocysts based ontheir morphological quality. In this study, Xception correctly classified >99.5% of the highest quality blastocysts which is of critical importance, clinically, when identifying embryos suitedfor transfer (Thirumalaraju et al., 2021).
Automated Annotation of Blastocysts
One of the potentially confounding factors that can affect AI protocols is the fact that themorphokinetic annotations are done by humans and are subjective. It will be necessary todevelop AI models that can recognize abnormal karyokinetic (nuclear) and cytokineticabnormalities (direct divisions 1–3, cell fusion) for optimal automatic annotation.
Most ML methods for embryo assessment and selection have used ―computer vision methods‖utilizing visual data (TLM or microscopic images). CNN is a method of choice to process visualinformation. It can be used for automatic cell annotation (Malmsten et al., 2020), cell detectionand tracking (Leahy et al., 2020), blastocyst stage identification and witnessing(Kanakasabapathy et al., 2020c), embryo grading and selection, and BL and implantationprediction (Louis et al., 2021).
Furthermore, Dimitriadis and colleagues used an AI-implantationprediction model as a novel and unbiased morphology-based evaluation tool to assess thecompetencies of embryologists selecting embryos, performing vitrification and warming and ofembryologists and physicians performing embryos transfers (Dimitriadis et al., 2021). It isimportant to note that these studies were done on retrospective data under experimental settings.The clinical application of AI still requires prospective studies.
Implantation Prediction
In a recent study, Fitz and colleagues sought to determine whether embryologists could improve their ability to select euploid embryos with the highest implantation potential with the aid of anAI-trained implantation algorithm. In this two-part study, embryologists from 5 separate laboratories were asked to select the top embryo for transfer from an image set of 2 embryos(n=200 image sets). Next, they were provided with the same image set and a notation of which embryo was predicted to implant using AI.
Embryologists were told that the AI-implantation algorithm had a 75% accuracy, which could be incorporated into their embryo selection decision.All 14 embryologists participating in this study improved their ability to select the top-quality embryo when incorporating AI with a mean percent improvement of 11.1% (range 1.4% to15.5%) (Fitz et al., 2021). One limitation of this study is its retrospective nature.In studies using AI to predict embryo implantation potential on static or TLM images, secondary factors such as laboratory conditions or other human factors have not been analyzed or included in the models. Culture conditions and human expertise are important factors that influence embryo development and quality.
For achieving a useful and objective prediction, these factorswill need to be included in models. In addition, we know that successful implantation and livebirth depend on other factors not inherent to the embryo. Predicting implantation solely onembryo quality is an incomplete assessment. The focus of AI embryo prediction models shouldbe the ranking of the embryos within the patient cohort rather than on implantation prediction.The variation in success rates among IVF centers and labs prevents the establishment ofuniversal AI models for implantation prediction (Zaninovic and Rosenwaks, 2020).
How do we use AI-based models in the clinical lab setting and within lab workflows in a prospective way? First, we need to evaluate AI models in parallel with standard lab embryo selection practice. Second, we need to perform prospective studies of embryo selection by machine and human.
AI for Non-Invasive Ploidy Screening
PGT for aneuploidy (PGT-a) remains the most objective way to assess an embryo. However, itsinvasive nature, cost, and the assumption of diagnostic accuracy limit a more widespread use. Itis no surprise that noninvasive approaches for embryo selection including time-lapsemorphokinetic evaluation (Campbell et al., 2013), morphology assessment (Capalbo et al., 2014;Zhan et al., 2020b), and AI systems (Pennetta et al., 2018; Meyer et al., 2020) have aimed tocompare PGT-a outcomes against their findings. However, it is still difficult to find studiespresenting AI systems for embryo ranking that are trained against ploidy status as their groundtruth.
The first published study of this kind was most likely by Chavez-Badiola et al., in which theauthors trained and tested an AI model called ERICA to rank embryos based on its ability topredict euploidy, using a single static BL image as the only source of information. Followingtraining and validation on 1,231 images from 3 IVF centers, the ERICA device showedsignificantly better predicting capabilities (70% overall accuracy for euploidy prediction) thanchance and the embryologists involved in the study. It is important to acknowledge that despiteseniority and experience, conclusions on the device’s superiority cannot be drawn based on acomparison against the performance of only two embryologists.
As the authors acknowledge, alarger testing set, as well as a larger number of embryologists with different levels of experienceand seniority, would be required to confirm the study’s results. At this point, however, the resultsare encouraging enough to suggest that ERICA has the potential to assist embryologists andclinicians during embryo selection in a noninvasive fashion (Chavez-Badiola et al., 2020b).We can anticipate that other similar full-paper publications will follow shortly, presenting newapproaches aimed at embryo selection based on ploidy.
These studies will perhaps target timelapse sequences (Barnes et al., 2020) and incorporate omics (Bori et al., 2021), patient and cycle characteristics (Jiang et al., 2021), noninvasive chromosome screening tests (Chavez-Badiola etal., 2020c), as well as new AI approaches. Building high-quality datasets from diverse settings—while managing hype (VerMilyea et al., 2019) and expectations—are challenges that will remain.
Conclusion
AI has long been utilized in other industries and has recently found a place in medical imaging;however, it is just beginning to make an impact on the clinical practice of reproductive medicine,a field familiar to rapid advancements and open to using new technologies to achieve the ultimate goal of a healthy baby.Since there are over 2 million IVF cycles performed annually throughout the world, and withIVF being a medical procedure globally registered, one can only hope that the data collection from throughout the years will help to develop AI systems that are widely applicable across clinics and independent of differences in protocols and populations. Barriers to achieving this
include health record privacy terms, paper records, and variations in electronic medical recordsystems.AI systems developed thus far for the field of reproductive medicine have focused primarily onthe use of embryo imaging and have been summarized here. However, AI has the potential toassist in other areas of reproductive medicine as well, including endometrial receptivity, uterinefunction, fertility impact of diseases such as endometriosis and adenomyosis, recurrentimplantation failure, and recurrent pregnancy loss (Curchoe 2021).
In summary, AI has the potential to be utilized as a promising tool to resolve many longstanding challenges in the field of reproductive medicine as well as to assist clinicians in decision-making and achieve the ultimate goal of a healthy live-born baby. However, at present, AI has not established its role in the world of reproductive medicine, and it is important to keep in mind that its use in improving outcomes is not, as of yet, proven in the literature. Further studies, ideally randomized controlled, are required, to identify indicated use of this very promising tool.
En el siguiente PDF puedes ver el artículo del doctor Alejandro Chávez Badiola.