Early Prediction of Severe Dengue Using Machine Learning

Abstract: Dengue remains a major public health challenge in tropical regions such as Sri Lanka, where early identification of severe dengue is critical to reducing mortality and optimizing resource allocation. This study developed and evaluated machine learning models to predict severe dengue progression within 1, 2, and 3 days of hospital admission. Since clinical data contain noise, several preprocessing steps were performed before fitting any model, including feature selection and handling of class imbalance. Multiple algorithms, including Random Forest, Extra Trees, Gradient Boosting, XGBoost, LightGBM, CatBoost, and Support Vector Machines, were trained using stratified cross-validation and hyperparameter tuning. Model performance was evaluated using standard classification metrics, with an emphasis on sensitivity to ensure early detection of high-risk patients. The results demonstrated that the Extra Trees model achieved the best performance within 2 days of admission, with a recall of 93.4% and an F1-score of 0.937, outperforming both time windows of day 1 and day 3. SHAP analysis revealed plasma leakage (USS), platelet count, white blood cell count, and liver function markers (AST/ALT) as the most influential predictors, aligning with established dengue pathophysiology. This study uniquely evaluates severe dengue prediction across multiple early time windows (days 1-3), finding that the 2-day window achieves the best sensitivity. Unlike prior research focusing on diagnosis or single timepoints, the proposed framework uses routinely collected hospital data to provide practical clinical decision support for timely interventions in resource-limited, dengue-endemic regions.

What She Signs They Mine - Quantifying the Gaps in FemTech Privacy Policies

Abstract: Femtech applications have revolutionized women’s healthcare, yet transparency failures are inevitable in their data privacy practices. While recent investigations highlight privacy violations in reproductive health applications, existing automated assessment frameworks lack domain-specific adaptations to evaluate femtech privacy practices. This study explores privacy policy transparency in femtech applications through domain-specific risk weighting and an XLNet-based multidimensional evaluation. The methodology compares Data Safety Section declarations with privacy policy content across 68 femtech applications listed on the Google Play Store. The comparison is supported with a custom-defined three-tier transparency assessment framework encompassing Information Coherence Analysis, Privacy Disclosure Assessment, and Privacy Protection Evaluation across comparable features. The XLNet-based classification system achieves substantial performance improvements with F1-score ranging from 0.792 to 0.845 across pre-defined privacy dimensions. The transparency assessment reveals discrepancies where privacy policies disclose a higher health data monetization compared to Data Safety Section declarations (8.8% versus 2.9%), while 17.6% of applications exhibit critical security gaps through inadequate encryption and deletion capabilities. This research advances femtech privacy assessment through domain-specific risk weighting and multi-dimensional evaluation, establishing a foundation for ethical data handling, fostering trust and accountability.

Mining Strategic Business Insights from Online Reviews - A Case Study in the Southern Coast of Sri Lanka

Abstract: This study investigates tourist perceptions of eight southern beaches in Sri Lanka using Google Reviews. With the increasing influence of online platforms in travel decision-making, analyzing review content provides valuable insights into tourist experiences and preferences. The research employs transformer-based models from Hugging Face for sentiment analysis and topic modeling, offering a modern, data-driven approach to textual review interpretation. Word clouds and bigram visualizations are used to highlight common positive and negative expressions associated with each beach. The findings reveal themes such as cleanliness, natural beauty, surfing opportunities, crowd, and local service quality as key themes associated with the southern coastline of Sri Lanka. Sentiment patterns vary across beaches, with some consistently rated positively while others receive mixed feedback. This analysis offers practical insights for stakeholders in the field of tourism to improve destination management and marketing strategies. The study demonstrates the effectiveness of modern-day NLP techniques in understanding tourist experiences and provides a scalable framework for future such analysis that centres around the user responses.

Comparing the Performance of LLMs in RAG-based Question-Answering - A Case Study in Computer Science Literature

Abstract: Retrieval Augmented Generation (RAG) is emerging as a powerful technique to enhance the capabilities of Generative AI models by reducing hallucination. Thus, the increasing prominence of RAG alongside Large Language Models (LLMs) has sparked interest in comparing the performance of different LLMs in question-answering (QA) in diverse domains. This study compares the performance of four open-source LLMs, Mistral-7b-instruct, LLaMa2-7b-chat, Falcon-7b-instruct and Orca-mini-v3-7b, and OpenAI’s trending GPT-3.5 over QA tasks within the computer science literature leveraging RAG support. Evaluation metrics employed in the study include accuracy and precision for binary questions and ranking by a human expert, ranking by Google’s AI model Gemini, alongside cosine similarity for long-answer questions. GPT-3.5, when paired with RAG, effectively answers binary and long-answer questions, reaffirming its status as an advanced LLM. Regarding open-source LLMs, Mistral AI’s Mistral-7b-instrucut paired with RAG surpasses the rest in answering both binary and long-answer questions. However, among the open-source LLMs, Orca-mini-v3-7b reports the shortest average latency in generating responses, whereas LLaMa2-7b-chat by Meta reports the highest average latency. This research underscores the fact that open-source LLMs, too, can go hand in hand with proprietary models like GPT 3.5 with better infrastructure.

Comparison of Machine Learning Models to Classify Documents on Digital Development

Abstract: Automated document classification is a trending topic in Natural Language Processing (NLP) due to the extensive growth in digital databases. However, a model that fits well for a specific classification task might perform weakly for another dataset due to differences in the context. Thus, training and evaluating several models is necessary to optimise the results. This study employs a publicly available document database on worldwide digital development interventions categorised under twelve areas. Since digital interventions are still emerging, utilising NLP in the field is relatively new. Given the exponential growth of digital interventions, this research has a vast scope for improving how digital-development-oriented organisations report their work. The paper examines the classification performance of Machine Learning (ML) algorithms, including Decision Trees, k-Nearest Neighbors, Support Vector Machine, AdaBoost, Stochastic Gradient Descent, Naive Bayes, and Logistic Regression. Accuracy, precision, recall and F1-score are utilised to evaluate the performance of these models, while oversampling is used to address the class-imbalanced nature of the dataset. Deviating from the traditional approach of fitting a single model for multiclass classification, this paper investigates the One vs Rest approach to build a combined model that optimises the performance. The study concludes that the amount of data is not the sole factor affecting the performance; features like similarity within classes and dissimilarity among classes are also crucial.

Publications