백석예술대학교 도서관

본문 바로가기
탑 메뉴 바로가기
주 메뉴 바로가기
하단 바로가기

내용보기

Learning to Design Protein and DNA Libraries- [electronic resource]

자료유형: 학위논문파일 국외

최종처리일시: 20240214101125

ISBN: 9798380380782

DDC: 004

저자명: Busia, Akosua.

서명/저자: Learning to Design Protein and DNA Libraries - [electronic resource]

발행사항: [S.l.]: : University of California, Berkeley., 2023

발행사항: Ann Arbor : ProQuest Dissertations & Theses, 2023

형태사항: 1 online resource(106 p.)

주기사항: Source: Dissertations Abstracts International, Volume: 85-03, Section: B.

주기사항: Advisor: Listgarten, Jennifer;Jordan, Michael.

학위논문주기: Thesis (Ph.D.)--University of California, Berkeley, 2023.

사용제한주기: This item must not be sold to any third party vendors.

초록/해제: 요약Using next-generation sequencing, it is now possible to screen up to billions of protein or DNA sequences in parallel for a property of interest. Consequently, high-throughput sequencing has vastly accelerated the rate of biological discovery for both basic scientific inquiry and for engineering novel enzymes, therapeutics, antibodies, regulatory elements, and beyond. In such high-throughput sequencing-based screens and selections, the quality of the starting sequence library greatly influences the overall chance of successfully identifying sequences with the desired property. Generalizable in silico methods for designing high-quality sequence libraries promise to reduce wet lab experimental burden and improve the speed with which new, functional sequences can be discovered. Machine learning, in particular, provides a useful set of tools for implementing such methods, as it is well-suited to analyzing the large quantities of data produced by high-throughput sequencing. In this dissertation, we will discuss several aspects of machine learning-guided library design, and propose solutions to challenges posed by existing technologies.First, we introduce a framework for machine learning-guided library design, and showcase its ability to design diverse, functional libraries in a gene therapy context. Specifically, we (i) outline a modeling approach for predicting the property selected for in a high-throughput sequencing-based selection experiment that explicitly accounts for uncertainty in the observed sequencing data, and (ii) describe a novel machine learning-guided design procedure that optimally trades off between a library's average predicted property values and its sequence diversity. We use these methods to design a clinically-relevant adeno-associated virus (AAV) peptide insertion library. AAVs hold tremendous promise as delivery vectors for clinical gene therapy, and packaging is a general prerequisite for delivering genetic material to a target tissue. Standard diversified libraries for engineering effective AAV delivery vectors contain a high proportion of variants that are unable to assemble or package their genomes, which often limits the effectiveness of downstream selections for desired properties such as efficient infection of human tissues. Using our machine learning-guided design framework, we systematically design effective starting libraries that are as diverse as possible whilst being biased towards variants that are able to assemble and package the viral genome efficiently. Specifically, we design a library of peptide insertions into the AAV capsid that achieves five-fold higher packaging fitness than the standard insertion library-known as the "NNK" library-with negligible reduction in diversity. We further demonstrate the general utility of our designed library on a downstream task to which our design approach was agnostic: infection of primary human brain tissue. Compared to the standard NNK library, our machine learning-designed library contains approximately 10-fold more variants that successfully infect the human brain.Next, we highlight a key shortcoming of the above predictive modeling approach-namely, its extremely limited ability to share information across related but non-identical reads-that prevents it from making effective use of sequencing data in many settings of interest. We introduce model-based enrichment (MBE) to overcome this shortcoming. MBE is based on a new perspective of differential sequencing analysis that uses sound theoretical principles from the density ratio estimation field in machine learning, is easy to implement, and can trivially make use of advances in modern-day machine learning classification architectures or related innovations. We evaluate MBE empirically, both in simulation and on real experimental data, and show that it improves accuracy compared to current ways of performing sequencing-based differential analyses-including the previous section's predictive modeling approach. The greater flexibility of our new approach enables effective analysis across a broader range of common experimental setups than can currently be achieved, thereby expanding the set of biological applications for which one can learn accurate predictive models to guide library design.Finally, we highlight some remaining challenges for machine learning-guided library design, including research opportunities into combining multiple sources of biological information in the design process. In summary, this dissertation presents a number of machine learning techniques that can be brought to bear on the problem of designing improved starting libraries for biological screens and selection experiments. The insights from this work provide further motivation for researchers to combine laboratory experiments with tools from machine learning to efficiently engineer novel functional protein and DNA sequences.