The material in this repository was presented at a training workshop in Kigali, Rwanda on July 15 to July 19, 2019. The training was organized by The African Institute for Mathematical Sciences (AIMS) as part of the mentorship for the winners of Big Data Challenge.
The goal of the course is to introduce participants to the use of Python to perfom data science tasks such as data ingestion, data analysis and machine learning with focus on processing of large scale datasets. This course is different from regular online courses as it uses real life datasets and case studies to challenge participants with real world data science problems, instead of solving toy problems. The course has four main components as follows:
- Introduction to Python: the focus here is to provide participants with skills in Python programming which they can utilize in the rest of the course.
- Python for Data Science: here, the course provides a tour of the essential Python tools for data science so that the participants are familiar with them.
- Big Data Processing with Pyspark: this component introduces tools for handling large scale data. The focus is on Apache Spark as distributed data processing engine.
- Machine Learning in Python: the course introduces participants to essential Python libraries for ML: sciki-learn and TensorFlow.
- Case Studies: in order to go beyond hello world and toy problems, the case studies challenges participants with real life data science problems.
The materials are organised into folders by day. All the code live in the src folder. Due to large size of powerpoint files, these are not included in the repository, instead you can find uptodate powerpoint slides here. Also, some datasets arent included in the repository. All the code use Python 3.
In the Big Data Analytics with Python course, we will use the Python programming language to interact with data. To ensure that participants gain the most out of the course, we require that you have basic skills in Python. To this end, I have suggested course materials which you should complete in preparation for the course.
See below two links for free Python courses. You need only do one of the courses, but you can do both if you will. They are both free and will take less than 5 hours of your time. Once you finish the course(s), you will have the prerequisite Python knowledge to enable you gain the most out of the 5-day course.
We will use Github for tracking our code and submitting exercises. As such, its important that you make yourself familiar with Github. Refer to the links below for Github training materials.