[dipl] Perustieteiden korkeakoulu / SCI
Permanent URI for this collectionhttps://aaltodoc.aalto.fi/handle/123456789/21
Browse
Recent Submissions
Now showing 1 - 20 of 5424
- Statistical modelling of genetic background of haemoglobin deferral in blood donation
School of Science | Master's thesis(2025-02-24) Karttunen, KristaBlood donation and blood supply are essential part of modern health care and part of national preparedness and readiness. In Finland, the Finnish Red Cross Blood Service, a non-profit organization, is centrally responsible for blood donation and the production of blood products. The health of the donor and the patient receiving the blood products is one of the most important principles. Each blood donation results in the donor losing about 250 mg of iron, which can lead to iron deficiency. To maintain the donor's health and the quality of blood products, a health questionnaire and haemoglobin measurement are conducted before each donation. If the donor's haemoglobin is lower than the threshold values, the donor is not allowed to donate. A low haemoglobin level results in a 90-day deferral, or in some cases, the deferral remains until the cause of the low haemoglobin is determined. Even a short-term deferral is known to affect the donor's motivation to donate in the future. If individuals known to be prone to low haemoglobin donated less frequently than the current recommendations, the number of haemoglobin deferrals could be decreased. Factors affecting haemoglobin deferral have been studied, but there is still little information on the impact of single nucleotide polymorphisms on haemoglobin deferrals. The aim of this thesis was to model the effect of single nucleotide variants associated with iron deficiency anaemia and iron metabolism disorders on haemoglobin deferral, alongside other commonly used variables, and to statistically compare the performance of these models. This study used data from the Finnish Red Cross Blood Service Biobank, including genotyped donors and their donation history. More complex models did not perform better in prediction of Hb deferral than simpler models. The SNP 17:58358769 variant was positively associated with haemoglobin deferral in all models used. Interestingly, the Cox proportional hazards model performed worse than other models in the prediction task, but detected a difference in the risk of haemoglobin deferral between pre- and postmenopausal women for two variants (SNP 1:169549811 and SNP 22:37066896), which other models did not detect. Variables derived from donation history predict Hb deferral well, but genetic variables can provide additional information for the prediction models. - Vascular Analysis from Retinal Images Using Machine Learning Methods
School of Science | Master's thesis(2025-02-20) Pienimäki, PetraThe growing use of deep learning in medical image segmentation has enabled significant advantages in the field. Retina image segmentation can help detect early signs of eye diseases such as glaucoma, diabetic retinopathy, and retinal macular degeneration. These diseases are related to microvascular changes in the retina, and their early diagnosis can prevent disease progression and preserve vision. The microvascular changes affect the arteriolar-to-venular ratio (AVR), which is around 2/3 for a healthy eye. AVR is derived by dividing the central retinal artery equivalent (CRAE) of the main arteries by the central retinal vein equivalent (CRVE) of the main veins in the region of interest (ROI) which is near the optic disc in the retina. In this thesis, several U-Net and U-Net-R architectures are applied for retina image segmentation to calculate the AVR. The process involves locating the optic disc to define ROI, preprocessing the retina images, and training the networks to segment the vessels into arteries and veins. From these segmented images, the diameters of the arteries and veins are calculated using skeletonisation and dilation techniques to derive CRAE and CRVE to calculate AVR. These values are compared with AVR results calculated from ground truth images. The best U-Net model and the best U-Net-R model were both implemented with the DiceCE loss function, with a Dice score of 0.779 and 0.806, respectively. The performance of the proposed AVR calculation method is presented, together with an evaluation of its limitations and future potential. - Optimizing adaptive eyewear lens transition properties
School of Science | Master's thesis(2025-02-24) Sjöholm, AnnikaPresbyopia is an age-related ocular disorder in which the person’s accommodation abilities decrease with age, which makes focusing on nearby objects difficult. This usually starts to develop for people over 45 years and affects almost two billion people worldwide. Presbyopia occurs due to the thickening of the lens when the lens proteins begin to accumulate without breaking down, which decreases the power of the crystalline lens. Treatments for presbyopia include reading glasses, multifocal eyewear, contact lenses, and intraocular lens surgery. However, these methods have their challenges. Multifocals can have a limited reading area or increase the risk of falling, especially among elderly people. One solution for these problems would be adaptive eyewear with liquid crystal filled lenses that adapt according to your gaze point. Pixieray Ltd is a company that develops adaptive eyewear with eye tracking technology. An important part of this technology is the transition of the lens, that is, when the optical power of the lens turns on or off. The lens transition can be made faster or slower with different setting options controlled using the drive voltage of the lens. This thesis aims to find the optimal time for the transition based on user experience. The users compared different transitions and an intended delay, occurring before the transition. These delays are relevant for the latency that the future product will have. The results showed that the preferences of the 20 users’ varied considerably. However, everyone agreed that a faster transition is better than a slower one. The opinions of which transition was the fastest differed slightly between the users, although the majority thought that a transition lasting for 200 ms was the fastest. Some users would have preferred an even faster transition. A delay of 400 ms was more acceptable than an 800 ms delay. All users were pleased with the OFF transition, even though we did not have any other OFF transition to compare it with. Future research and tests will be needed to make the ON transition even faster for the end product. - Detecting age-related changes in brain activity during picture naming using machine learning
School of Science | Master's thesis(2025-02-22) Valkeavirta, VenlaLanguage production becomes more challenging with age, weakening the communication skills of the elderly. The age-related behavioral changes have been mainly linked to phonological and phonetic processing, while lexical-semantic processing is thought to be relatively well preserved with age. Picture naming is a widely used task in studying language production since it includes all the core processing stages of word production. Although the effects of aging on behavioral performance have been well studied, the neural basis for the age-related changes in language production remains largely unknown. In this thesis, magnetoencephalography (MEG) was combined with machine learning to identify age-related changes in the neural dynamics of word production. MEG data, recorded from 25 young and 25 old healthy adults during a picture-naming task, were analyzed using a temporal decoding method in which logistic regression was applied to MEG data segments separately at each time point. Furthermore, the model weights were projected into source space to visualize the decoding patterns as brain activations. Subjects' age group was significantly decodable from averaged evoked responses, with the classifier reaching an accuracy of 0.89 around 225 ms after image presentation and the peak accuracy occurring within a time window linked to lexical-semantic processing. Additionally, certain stimulus attributes associated with visual and conceptual features were decodable from the MEG data; however, no significant age-related differences related to these decoding tasks were found. These findings indicate that the most pronounced age-related neural differences in word production could occur during lexical-semantic processing, challenging the conclusions of earlier behavioral and neuroimaging studies that have suggested that aging primarily affects phonological rather than lexical-semantic processing. - Multimodal Tumour Type and Subtype Classification with Deep Learning
School of Science | Master's thesis(2025-02-24) Huttunen, AnttiCancer was the second most common cause of death in Finland in 2023. Since early diagnosis and identifying the tissue of origin are crucial in cancer treatment and prognosis, finding new, efficient, and minimally invasive diagnostic methods is important. Tumour types in different tissues are characterised by distinct patterns of somatic mutations that have been proven to be helpful in tumour type prediction. These mutations and patterns can be detected from a blood sample by examining circulating tumour DNA (ctDNA), enabling the use of minimally invasive and accurate computational diagnostic tools for early-stage cancer detection. This thesis aims to develop and evaluate the use of deep learning for tumour-type prediction from somatic mutation data. The thesis investigates the performance of the Mutation-Attention (MuAt) deep learning model and compares its original trinucleotide-based embedding with alternative approaches, including DNABERT embeddings and one-hot encoding of individual nucleotides. Additionally, chromatin state information is integrated into the MuAt model using the EpicVAE variational autoencoder to evaluate the impact of the epigenetic information on model performance. The experiment utilises tumour site DNA data from 24 tumour types and 2578 patients from the database of the PCAWG project. The results show that the original MuAt embedding approach outperforms new approaches in 10-fold cross-validation with average and best validation accuracies of 0.882 and 0.919, respectively. Moreover, adding the chromatin state information only increased the average validation accuracy by 0.002. Further study of DNABERT embedding spaces with Uniform Manifold Approximation and Projections (UMAPs) shows the low quality of the embedding spaces and the low capability of DNABERT to distinguish the differences between mutated and reference sequences. The results also highlight the importance of genomic position information in prediction. The research demonstrates the effectiveness of the original MuAt model pipeline design, thus demonstrating the importance of good design in embedding the mutation data. Even though the tested approaches and MuAt model have limitations, this research provides a solid foundation for further studying the use of DNABERT in the case of mutation data, new embedding approaches, and developing computational methods for tumour type prediction. - Forecasting Prescription Medication Utilization: A Comparative Study of SARIMA, Prophet, XGBoost and LSTM Models
School of Science | Master's thesis(2025-02-12) Ripatti, ViliPredictive analytics in healthcare is essential for improving patient outcomes, managing limited resources, and addressing the growing complexities of drug demand and supply. Accurate forecasting of prescription drug utilization enables healthcare providers and policymakers to make informed decisions, reduce costs, and ensure timely availability of medications. This study evaluates the performance of four widely used forecasting models, SARIMA, Prophet, XGBoost, and LSTM, in 97 cardio-related ATC codes using a dataset from the Swedish National Board of Health and Welfare (2009–2023). The effectiveness of the models was compared by using root mean square error (RMSE), the mean absolute error (MAE), and the mean absolute percentage error (MAPE) as evaluation metrics. The results indicate that SARIMA consistently outperforms other models, accurately capturing seasonal patterns, abrupt shifts, and long-term trends, securing first place in three quarters of cases. LSTM demonstrated strong performance in handling datasets with nonlinear dependencies but occasionally struggled with abrupt trend shifts. XGBoost delivered moderate results, particularly for simpler datasets, yet struggled with high volatility, leading to reduced accuracy in unstable time series. Prophet, while robust to missing data, was less effective in capturing complex temporal dynamics, leading to higher errors in datasets with irregular trends. These findings underscore the importance of aligning model selection with dataset characteristics to optimize forecasting outcomes. By highlighting the strengths and limitations of these models, this research contributes to the growing body of knowledge on predictive modeling in healthcare, offering valuable insights for p - The Interplay of Somatic Mutation Profiles and Germline Variations in Breast Can-cer
School of Science | Master's thesis(2025-02-21) Pitko, IirisMutational signatures in cancer reflect somatic mutational processes active throughout an individual’s lifetime and provide insights into cancer etiology. While some signatures are well-characterized, arising from environmental and endogenous processes, many remain poorly characterized and new signatures are frequently identified. Germline genetic factors, such as polygenic risk scores (PRS) combining genetic effects across the genome, impact cancer predisposition, but their role in shaping somatic mutational profiles remains largely unexplored. Understanding the interplay between germline predispositions and somatic mutational processes could provide new insights into cancer development. This thesis explores the relationship between somatic mutational profiles and PRSs. Using data of 218 breast cancer patients of Finnish ancestry provided by the iCAN flagship project, we evaluated the associations between somatic mutational signatures and PRS for breast cancer and 30 common blood biomarkers by using Spearman’s correlation and beta regression. Mutational signature profiles of single base substitutions (SBS) were constructed from tumor whole-exome sequencing data and PRSs were computed in previous publications. We observed in total of 20 associations (p - Evaluating the Useof Retrieval-augmented Generation for Enhancing Online Courses
School of Science | Master's thesis(2024-11-24) Pasquarelli, LeonardoProviding sufficient and adequate teaching assistance towards students in pro- gramming education for online courses requires substantial resources, especially considering the growing enrolment numbers. To tackle the problems of scal- able course assistance, we developed a chat bot specific to the Web Software Development (WSD) course at Aalto, using a novel technology called retrieval- augmented-generation (RAG), which harnesses large language models (LLM) and augments the produced answer with search results from an external data source: in our case the course material, vectorised and embedded into a vector database. Our evaluations include a benchmark, in which we compare the faithfulness and relevancy of answers generated by 54 different configurations, determined by the LLM, the embedding model, the chunk size and amount of chunks, and the re- trieval mode. The 28 used questions were mainly collected from course partici- pants taking the WSD course. The findings suggest that in the context of this experiment, higher chunk sizes work better, a vector-only retrieval mode pro- duces better results, the choice of LLM in itself had a mild effect on the answer quality, and text-embedding-3-large and all-MiniLM-v6 performed significantly better than RoBERTa. Furthermore, we conducted an in-person user survey (N =14), in which students were required to work on course tasks given the assistance of our chat bot, and a search functionality. The goal was to assess the satisfaction of RAG when compared against a search functionality, as well as the search performance us- ing RAG when compared against a search functionality. The findings suggest users perceive both assistants as useful or highly useful, and that the bot pro- duces factually correct results. The preference towards a specific assistant and performance depended on various factors, including the exercise type. - The effect of language on perceived ability to understand machine learning concepts
School of Science | Master's thesis(2024-12-30) Loukamo, LindaAmid accelerating globalization, access to the most recent research and relevant information has become language-dependent and often reliant on a person’s English skills. The era-defining conquest of generative AI and near cost-free, easy access to AI products such as ChatGPT has cemented English as key to understanding technological evolution and the mechanisms, such as Machine Learning, behind them – this, however, has become a bottleneck for furthering digital literacy and ML literacy which are needed to ensure safe and ethical use, development, and policing of AI globally. Given this, the objective of this study is to clarify the role of language in conveying information about the complicated technology and its effects on the test subjects’ self-perceived ability to understand Machine Learning concepts. This study finds that while there is a more general preference for English due to content availability and accustomedness, mother tongue learning and content availability cannot be dismissed as they foster better understanding among those who are non-proficient in ML, in addition to bringing forth more engagement in all mother tongue users. This means there are observable benefits to pursuing further translation of difficult Machine Learning concepts when it comes to successfully communicating the material, even if the bulk of specialist-level material remains in English, if the goal is to help increase spread of information and to promote digital literacy in the era of GenAI. - Towards Carbon-aware: Power Monitoring in Cloud Computing
School of Science | Master's thesis(2024-12-29) Cao, NhutThe escalating adoption of cloud computing has not only bring significant values in terms of technology, but also sparked critical concerns regarding its environmental footprint. Together with the advantages cloud services provide, there exist issues from the energy consumption management and its impacts on the planet. The vast data centers required to support cloud services consume significant amounts of energy, much of which is derived from non-renewable sources, contributing to greenhouse effect. Additionally, the manufacturing, operation, and disposal of hardware result in substantial ecological impacts, including resource depletion and electronic waste. The thesis delves into to the energy efficiency and sustainability of cloud computing environments by analyzing data collected from monitoring tools deployed in customized cluster, aiming to examine the role of monitoring tools in managing computing systems. The research centers on comprehending the functionality of monitoring tools and assessing their performance within this context. By scrutinizing key metrics such as CPU time, CPU cycles, and energy consumption patterns, the study provides insights into the factors that influence overall system performance and costs. The findings underscore the potential of monitoring tools such as Kepler to optimize resource allocation and enhance energy efficiency. The study also identifies limitations in the current state of monitoring tools and emphasizes the necessity for further development to capture a more representative set of system metrics. The analysis highlights the significance of exploring innovative approaches to sustainable cloud computing, such as developing and committing to energy-efficient architectures, optimizing cooling systems, and implementing effective methods to achieve the sustainability. - Generative AI Agent for Autonomous Match-3 Gameplay through Real-Time Image-Based Decision-Making
School of Science | Master's thesis(2024-12-31) Liu, RongzhiAdvances in artificial intelligence create more possibilities for autonomous gaming agents in video games. Therefore, this thesis focuses on building a generative AI agent for the Match-3 game. It combines Large Language Models (LLM) with Android Debug Bridge (ADB) commands. As a result, the agent can monitor the game screen in real-time, and then use specific prompts to enable the Large Language Model to analyze the image and execute appropriate actions. Unlike traditional AI agents that need pre-training or particular data, this method adapts to different Match-3 games with minimal setup. Next, we evaluated the agent across multiple games and found that it achieves performance levels similar to real players in certain aspects. The main contribution of this work is showing a new approach to agent development that uses LLMs for decision-making and ADB for action execution. This highlights the agent’s ability to adapt quickly to different games. - Finding structures in document embeddings using sequence segmentation
School of Science | Master's thesis(2024-12-31) Tran, DuongLarge language models (LLMs) have been revolutionary tools in natural language processing, which results in their increasing use in important fields like healthcare or law. However, the general impossibility of interpreting the text embeddings of LLMs due to their "black-box" nature might have harmful consequences, introducing risks of biases and misinformation during the document handling process. While current research has focused on understanding embeddings on a word, sentence, or paragraph level, there is a gap in analyzing document-level embeddings. Therefore, the purpose of this thesis is to investigate and uncover semantic structures in document embeddings using sequence segmentation algorithms. We introduce the perspective of splitting documents into chunks, embedding separate chunks, and stacking these chunk embeddings to obtain a sequential representation as the final document embedding. For the embedding task, we use Jina and OpenAI models that are capable of handling longer input sequences of up to 8192 tokens, presenting an approach that can be scaled to arbitrarily large documents. We apply sequence segmentation algorithms, such as dynamic programming and randomized segmentation, on the sequential document embeddings to identify semantic boundaries. We discuss an entropy-based framework for segmentation comparison to evaluate the similarity between segmentation results and ground-truth boundaries that are chosen to be the division of chapters in documents. In our extensive experiments, we recognize the impact of chunk size on embedding and segmentation quality, revealing the trade-off between granularity and contextual inclusion in each chunk. Results show that embeddings by both models tend to comprehend and resemble ground-truth best with the smallest chunk size of 128 tokens, demonstrating the effect of fine-grained embedding on retaining local context, especially for medium-sized documents like doctoral theses. We also show that segmentation results usually do not exactly match the ground-truth, but capture the semantic structures of documents well. These results are promising, but leave interesting questions open for further research; for example, exploring different notions of ground-truth would be valuable. - Fluid Interfaces and Fixed Patterns: Understanding LLM Behavior in Educational Contexts
School of Science | Master's thesis(2024-12-31) Kucheria, AayushAs Large Language Models (LLMs) emerge as potential tutoring agents, they promise more fluid, adaptive educational interactions than traditional intelligent tutoring systems. However, the extent to which LLM behavior actually aligns with human tutoring patterns remains poorly understood. This thesis examines this tension between fluid interfaces and fixed behavioral patterns in AI tutoring. Drawing on constructivist learning theory and analysis of historical constraints in educational technology, we investigate how LLMs process and respond in the tutoring task compared to human teachers. Through systematic analysis of the CIMA dataset, we compare action distributions and response patterns between human tutors and three state-of-the-art LLMs (GPT-4o, Gemini Pro 1.5, and LLaMA 3.1 405B) in language teaching dialogues. Rather than evaluating performance or effectiveness, we focus on understanding fundamental differences in how artificial and human tutors structure their teaching interactions. Our results reveal systematic deviations in LLM behavior from human tutoring patterns, particularly in action selection and response adaptation to student behavior. These findings suggest that while LLMs enable more fluid interaction, they may develop fixed behavioral patterns distinct from human teaching strategies. This research contributes to both theoretical understanding of AI tutoring behavior and practical development of more effective educational technologies, while raising important questions about the nature of machine teaching and learning. - Model-Agnostic Personalized Federated Learning using Adaptive Client Selection
School of Science | Master's thesis(2024-12-31) Dang, PhiPersonalized Federated Learning (pFL) addresses the challenges of heterogeneous and decentralized data by enabling client-specific model training without sharing raw data. This thesis introduces a novel method for model-agnostic pFL that leverages adaptive client selection to improve the personalization of client models. By assuming cluster-based distributions of local datasets, the proposed algorithms iteratively select and incorporate the most beneficial candidate datasets to optimize each client's model. Two main methodologies are presented: one tailored for parametric models using gradient-based updates and another designed for non-parametric models using a generalized optimization approach. Experimental evaluations on synthetic datasets and the Fashion-MNIST benchmark demonstrate significant improvements in both classification and regression metrics, including accuracy and mean squared error, when compared to baseline models and established methods. The results highlight the potential of adaptive collaboration in achieving robust personalization while maintaining privacy. - Version-Sensitive Network Traffic Classification for Kubernetes Applications
School of Science | Master's thesis(2024-12-31) Hirvensalo, AleksiNetwork traffic classification is a crucial area in cybersecurity and network management, enabling effective monitoring and analysis of data flows. However, existing methods often lack the granularity needed to identify subtle differences, such as those between application versions, limiting their utility in dynamic, real-life scenarios. Despite the growing importance of detailed traffic analysis, there has been no research into fine-grained classification methods to differentiate between application versions. Current techniques fall short in addressing the subtle nuances in network behavior introduced by different versions of the same application. Furthermore, there are no suitable datasets that would allow exploring version-sensitive network traffic classification. This thesis introduces a novel, version-sensitive framework for network traffic classification. The framework is designed to detect and distinguish subtle changes between application versions by integrating machine learning and fingerprinting mechanisms. The methodology involves data collection, fingerprint generation, and classification using a custom experimental setup within a Kubernetes environment. The proposed framework demonstrates the ability to accurately classify and differentiate application versions, achieving an accuracy rate of 95.9\%, even in dynamic network scenarios. Additionally, the research contributes to the field by publishing a new dataset, which provides a foundation for future studies on fine-grained traffic analysis. This research underscores the potential for enhancing network security and management through advanced traffic classification techniques. By paving the way for more adaptive and precise systems, this work contributes a significant step forward in the development of fine-grained network traffic analysis tools. - Prediction of Protein Solution Viscosity Using Graph Neural Networks
School of Science | Master's thesis(2024-12-30) Nguyen, TrangThe significance of highly concentrated protein solutions has been recognized in a wide variety of disciplines, especially in the healthcare and pharmaceutical industries. Successful application of effective protein compositions can lead to efficient industrial production and provide comfort and convenience for patients. However, predicting the viscosity of protein solutions remains a major challenge due to its dependence on complex molecular interactions, which are difficult to analyze and interpret. Early attempts to experimentally measure protein solution viscosity have yielded results with high precision, although at the expense of significant time and effort. Computer-based simulations, albeit less expensive, still suffer from long execution times and are inadequate for handling large amounts of data. Recent advances in machine learning have introduced new techniques to address these challenges and provide more optimal solutions. They are able to deliver accurate predictions while reducing the reliance on extensive experimental work. However, the majority of existing models require manual selection of meaningful features, which restricts their generalizability. Realizing the need for a reliable viscosity prediction method, this thesis implements a machine learning model that automatically learns the underlying attributes of protein solutions and predicts their viscosity as a function of concentration. Consisting of language- and graph-based components, this model combines both sequential and structural information in order to gain a comprehensive understanding of protein solution behavior. It is able to achieve satisfactory results within the scope of the training data and demonstrates the computational power in tackling complex problems in pharmaceuticals, among others. - MCLB: Multi-Cluster Load Balancer - Hosting-Agnostic, Kubernetes-Controlled, Self-Service Load Balancing for Hybrid Cloud
School of Science | Master's thesis(2024-12-27) Mikkola, KimmoToday, a notable share of hosting is conducted within cloud providers' environments. The tools and solutions offered by major cloud providers are often highly sophisticated and user-friendly, making it easier to address complex challenges. For users operating entirely outside the cloud or in a hybrid cloud mode, effectively utilizing cloud provider tools is often challenging. This challenge arises partly because cloud providers have strong incentives to encourage dependencies on their environments, promoting vendor lock-in. The purpose of this thesis is to address the challenges associated with serving traffic from multiple Kubernetes clusters on a self-service basis, particularly in dedicated hosting environments. This thesis aims to propose an approach for multi-cluster load balancing that is able to improve system resiliency and increase development velocity for organizations running Kubernetes on hybrid cloud, or on dedicated hosting. This thesis focuses on developing a self-service solution to enable traffic routing across multiple Kubernetes clusters in a hybrid setup, aiming to seamlessly integrate dedicated hosting and cloud provider environments. We evaluate multiple service mesh solutions against requirements deduced from scenarios and use cases, and conclude that none of the candidates satisfy our requirements, thus we take it on to develop a custom solution. This thesis proposes MCLB, a hosting agnostic solution, that allows external load balancers be controlled through Kubernetes on a self-service basis, while avoiding the operational overhead of a complete service mesh solution. We benchmark MCLB against authentic scenarios to demonstrate the validity of the proposed solution, and its fulfilment of the specified requirements. Furthermore, we evaluate MCLB under realistic production loads to showcase its performance and to identify areas for further improvements. We demonstrate that using MCLB, it is possible to achieve multi-cluster Kubernetes, where load balancer configuration is configured on a self-service basis through Kubernetes while keeping memory overhead minimal. Furthermore MCLB allows remaining hosting agnostic, thus avoiding cloud vendor lock-in. - Bayesian Non-Negative Matrix Factorization with Applications in Genetics
School of Science | Master's thesis(2024-11-27) Wojnicki, MikolajGenome-wide associations studies (GWASs) have shown that pleiotropy and polygenicity are common genetic phenomena, especially for complex traits. They are often caused by various interactions between biological pathways, which can be difficult to decipher, therefore limiting medical innovation that could be achieved through genetic studies. To improve our understanding of the underlying interactions and the effects of each genetic variant, latent factor and cluster analysis methods have been used. One such technique is Bayesian non-negative matrix factorization (bNMF), which has shown promising results, but has not yet been thoroughly investigated for this application. This thesis assesses the validity, reliability, and effectiveness of using bNMF to cluster GWAS association coefficients. It shows that the method can produce valuable clusters that can be interpreted as disease risk factors and that are distinct from those obtained with truncated singular value decomposition. However, unlike previously thought, bNMF is sensitive to random initialization. It also often suffers from poorly chosen hyperparameter values, some of which appear to be misunderstood in the literature. To address these shortcomings, I developed a more reliable initialization method and guidelines for the selection of hyperparameters for applications in GWAS statistics clustering. My results can make the method more popular and reliable for these applications, which can lead to a better understanding of how genetic variants affect phenotypes and ultimately to improved diagnosis and treatment in medicine. - Using log sequence representations and transfer learning in classification and novelty detection
School of Science | Master's thesis(2024-12-11) Wang, TaigeThe goal of this study is to experiment with a method that uses log sequence representations and transfer learning to perform multiple downstream tasks in analysing logs, including classification and novelty detection. The proposed method is evaluated on two log datasets with distinct properties, where both the transfer learning approach and traditional time-series models as baselines are used and compared. The results show that the proposed method is comparable to or outperforming the baseline methods in the classification task, and it outperforms the baseline methods in scenarios of limited training data. The study also provide suggestions on the choice of methods depending on the properties of the log data, as well as approaches to improve the performance in terms of data-loading mechanisms and fine-tuning approaches. In addition, the proposed method serves as a simplified approach in the novelty detection task to detect novelties in the log data, which indicates great potential for the proposed method. In conclusion, this study shows that the proposed method is a promising approach for analysing logs in both classification and novelty detection tasks. - Fast Evaluation of Neighborhood-Based Features from Point Cloud Data on the GPU
School of Science | Master's thesis(2024-12-16) Kononen, AleksiDistance measurements can be collected with high spatial accuracy using light detection and ranging (LiDAR) technology by utilizing the time of flight of a laser pulse reflected from the environment. Modern laser scanner measurement systems are used for a variety of applications ranging from autonomous driving to infrastructure digitization due to their high accuracy and data collection rate. However, the rate of data acquisition presents a significant challenge for efficient data analysis especially within real-time performance constraints. To obtain maximal computational performance from modern processors, implementations must facilitate a sufficient degree of parallel execution. This thesis considers the use of the specialized hardware of the graphics processing unit (GPU) for point cloud analysis and presents novel parallelized methods for computing local geometric descriptors and evaluating local context-based template models for small object detection. To maximize hardware utilization, several optimization techniques targeting different aspects of the hardware are developed and evaluated both separately and in combination. Depending on the method, different techniques are shown to reduce memory use by 71–86% and runtime by 26–58% and improve the computational performance by 29–81%, with best performance obtained by combining techniques. The template method is used to detect rails and shown to reach 96.9% precision while achieving a speedup by a factor of 12 over the state-of-the-art. The developed methods satisfy real-time performance requirements, exceeding the rate of data collection by factors of 54 and 16.