Definition: Big data process is the set of activities: business understanding, data ... Do statistical analysis: min, max, mean, standard deviation, variance...etc.
Big Data Industry Process
Definition: Big data process is the set of activities: business understanding, data collection, data exploration, data preprocessing, data mining, model evaluation and deployment; processed together in order to extract hidden information from a mass of data.
Fig.1: General overview of big data process
Big data process activities: During my experience in Data Science, i come up to resume the process of big data in the following steps: Step1: Understand the business In this step, we are concerned to:
Well define the problem and its scope
Have a clear view of the goal
Draw the path to the objective
Page 1 – Big Data Industry Process – Adil ZEAARAOUI
Step2: Collect the data Import and collect the data from different sources like: RDMS, datalake store, datawarehouse...etc. Step3: Understand and explore data Before any kind of development, we must first explore our dataset. The exploration is manifesting in :
Explore features
Distinguish categorical features from numerical ones
Do statistical analysis: min, max, mean, standard deviation, variance...etc.
Visualize data: missing values for each feature, unique values, how values are distributed…etc.
Define business important features
Step4 : Pre-process data This is the important step in big data; it can take up to 90% of the whole process. This step intends to prepare data before mine it. We must do:
Correct wrong input values
Remove missing values
Fill the rest of missing values
Discretize continues features
Remove correlated features
Normalize features if required
Remove outliers if necessary
Etc.
Step4: Develop your model (Data mining) After building a clean and “ready to process” dataset, it is time to build our model.
Transform our dataset if required
Apply our machine-learning algorithm
Page 2 – Big Data Industry Process – Adil ZEAARAOUI
Step5: Evaluate and deploy the model Before deployment, we must validate and see how accurate is our model. So we must :
Evaluate and test the model
Review and enhance it
Deploy the model
Automate the system workflow
Page 3 – Big Data Industry Process – Adil ZEAARAOUI