Author: Paul Gitonga Njoki Client: With all major corporations developing original visual studios. Microsoft wants to join in and has chosen to open a new movie studio, but they don't know anything about virtual video creation. Microsoft has tasked me with determining what steps they want to take in order to enter this field. I was given many data files to evaluate and make recommendations to the head of Microsoft's new movie studio based on my findings in order to succeed in the field of movie development.
METHOD: CRISP DM I will be following the CRISP DM process for this task The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that serves as the base for a data science process. It has six sequential phases;
<<<<<<< HEAD
- Business understanding – To venture into movie production.
- Data understanding - Data was obtained from top movie wesites of which it was already provided.
- Data preparation – cleaning data,removing unwanted columns, removing outliers changing to prefered data types.
- Modeling – visualization with matplotlib.
- Evaluation.
- Deployment.
Data Analysis Overview
In this analysis, I will perform an analysis on large data sets containing different types of movies. The data includes many different types of information about each movie, ranging from the release date, the director, the studio, average rating, rating, gross domestic and foreign and many other information obtained from different movie sites, we see this when reading the separate data files. I utilized three different data sources for my analysis in order to have the most comprehensive view of the current movie performance.
I intend to do this analysis on the data sets containing vast movie genres. When we study the distinct data files, we can see that the data includes many different sorts of information about each movie, such as the release date, the Studio, average rating, rating, gross domestic and foreign, and many other details acquired from multiple movie websites.
- Rotten Tomatoes Data: The dataset was provided in CSV format, having 1560 rows and 12 columns. According to the data, Drama is the most produced genre by value counts, followed by comedy.
- The Box Office Mojo Data: This was provided as zipped data in CSV format, with 5 columns and a collection of 3387 movies. The data set was taken from the Box Office website and spanned from 2010-2018. According to the Mojo data, most films were shot at the IFC studio.
I will start my analysis with a descriptive analysis of each data set. This allows me to identify trends in data relevant to what has to be known for a film to be successful. This analysis will be conducted mostly through the review of graphs featuring particular attributes.