See detailed instruction see also the Project 1 PDF description available on the ML course web site.
Introduction
The increasing prevalence of Cardiovascular Diseases (CVDs), such as heart attacks, pose a significant threat worldwide. With adults living longer, the diseases of the heart and circulatory vessels are prevalent in the older population. However, the advent of technologies like machine learning can facilitate early detection and prevention of developing CVDs. This project is an endeavour to leverage machine learning in predicting the likelihood of a person developing a CVD based on their personal lifestyle factors.
The Task
Participants will engage in various phases of a data science project, from exploratory data analysis to feature processing and engineering. They will also implement machine learning techniques on the data, evaluate their models, generate predictions, and report their findings. By the end of this project, participants are expected to have a full-fledged machine learning solution ready for the challenge at hand.
Problem Statement
Utilising data from the Behavioral Risk Factor Surveillance System (BRFSS), participants are tasked with determining the risk of a person developing CVDs, specifically Myocardial Infarct or Coronary Heart Disease (MICHD). Given a vector of features detailing the health-related data of an individual, predict if the situation will lead to MICHD or not. This will involve applying binary classification techniques discussed during the lectures.
Dataset
The dataset originates from the BRFSS, a system of health-related telephone surveys that collect state data about U.S. residents regarding their health behaviours, chronic health conditions, and preventive services use. Specifically, respondents were classified as having MICHD if a provider informed them or if they had a heart attack or angina. The complete dataset is available for participants and can be accessed from the competition arena at EPFL Machine Learning Project 1. For deeper insights into the dataset’s background, refer to this longer description. Note that in-depth medical knowledge isn’t necessary to excel in this machine learning challenge.
Submission Process/Baseline
Utilising data from the Behavioral Risk Factor Surveillance System (BRFSS), participants are tasked with determining the risk of a person developing CVDs, specifically Myocardial Infarct or Coronary Heart Disease (MICHD). Given a vector of features detailing the health-related data of an individual, predict if the situation will lead to MICHD or not. This will involve applying binary classification techniques discussed during the lectures.
- Grading: Project 1 constitutes 10% of the final grade. The grading includes evaluating the code, the report, and a competition score for feedback. There’s also a second project, Project 2, which will count 30%.
- Logistics: Teams of three students will work on Project 1. A diverse skill set and interdisciplinary backgrounds enhance a team’s effectiveness.
- Deliverables:
- Code: All methods and code must be contained in a GitHub classroom repository. Only specific Python libraries are allowed for this project. Additional methods and feature engineering can be explored, but no external libraries, code, or data are permissible in this project.
- Report: A maximum 2-page PDF report summarising the findings. The report should cater to beginners in ML, highlighting significant findings and ensuring reproducibility.
- Competition: To benchmark and get feedback on the models, participants can submit their predictions on aicrowd.com. There’s a limit of 5 submissions per day. The competition rank will not affect the project grading.
- Final Submission: To be submitted online, which includes the report (2-page PDF), and the complete, executable Python code with a link to the GitHub repository. Ensure reproducibility and provide a well-documented system in the report and the code. Submit your project here before the deadline.
File descriptions
- x_train.csv - Training set inputs consisting of 321 features of 328135 individuals.
- y_train.csv - Training set binary labels in format {+1,-1} for the 328135 individuals. The labels are in the column named '_MICHD' (Myocardial Infarct / Coronary Heart Disease)
- x_test.csv - Test set of 109379 individuals - Everything as above, except the label is missing.
- sample-submission.csv - Sample submission file in the correct format. In this example, the predictions are all set to -1, which is ‘healthy’.
The zip file contains all the above files and can be downloaded from the resource section or the github.
Further information on the semantics of the features, labels, and weights, can be found in the project description PDF file.
Rules
Each participant is allowed to make 5 submissions per day. If you particpate as a team, the whole team gets 5 submissions, not 15 as the rules page states. Failed submissions (e.g. wrong submission file format) do not count.