Data Analysis Of Meteorological Data

What is Data Analysis ?

Aleena Mishra
5 min readOct 14, 2020

Data analytics is the science of analyzing raw data in order to make conclusions about that information. Many of the techniques and processes of data analytics have been automated into mechanical processes and algorithms that work over raw data for human consumption. Data analytics is a broad term that encompasses many diverse types of data analysis. Any type of information can be subjected to data analytics techniques to get insight that can be used to improve things.

In this article , we are going to analyze weather data of a northern european country Finland 10 years . The dataset used in this project , you can find that here kaggle .

What is meteorological data ?

Data consisting of physical parameters that are measured directly by instrumentation, and include temperature, dew point, wind direction, wind speed, cloud cover, cloud layer(s), ceiling height, visibility, current weather, and precipitation amount are known as meteorological data. We can make analysis of these data by some statistical method and come up with a conclusion.

We make the analysis here by using statistics . Here we have done hypothesis testing . Our null hypothesis is “Has the Apparent temperature and humidity compared monthly across 10 years of the data indicate an increase due to Global warming” and by doing hypothesis testing we shall decide should we accept this null hypothesis or reject it. So what do we mean by Hypothesis testing ? A statistical hypothesis is an assumption about a population parameter. This assumption may or may not be true. Hypothesis testing refers to the formal procedures used by statisticians to accept or reject statistical hypotheses. A statistical hypothesis is a hypothesis that is testable on the basis of observed data modeled as the realised values taken by a collection of random variables . There are two types of statistical hypotheses.

  • Null hypothesis. The null hypothesis, denoted by Ho, is usually the hypothesis that sample observations result purely from chance.
  • Alternative hypothesis. The alternative hypothesis, denoted by H1 or Ha, is the hypothesis that sample observations are influenced by some non-random cause.

If we get our result in favour of null hypothesis , we shall accept it otherwise reject it . The code block is given below .

First we imported the required libraries for data cleaning and data visualization.

import numpy as npimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as plt

Then we read the data from the dataset .

df = pd.read_csv("/content/drive/My Drive/weatherHistory.csv")df.head(10)
df.columns

Here we have shown the column names of the dataset and the now of rows each column contains and then we count the null values .

df.count()df.isnull().sum()

After that we dropped the unnecessary columns from the dataset . ‘Precip Type’,’Summary’,’Daily Summary’ are unnecessary columns . So we dropped it.

df_new = df.drop(['Precip Type','Summary','Daily Summary'],axis=1)df_new.head(10)
df_new.info()

As the formatted date column is string type , we converted it to datetime format by using to_datetime() function.

df_new['Formatted Date'] = pd.to_datetime(df_new['Formatted Date'],utc=True)df_new.dtypes

To perform resampling , we had set the formatted date as index by using set_index() .

def_new=df_new.set_index('Formatted Date')def_new.head(3)

Then , we have done resampling to covert the daily data into monthly . To do this , we used resample() function and set the argument as ‘M’ .

df_fin=def_new.resample('M').mean()df_fin.head()
df_fin.shape

After that we performed t-test because we were given with two features i.e. apparent temperature and humidity . We shall check weather the null hypothesis is valid or not . It will be valid , if in our test p-value will be greater than the threshold (set as 0.05) , we shall reject the null hypothesis otherwise we shall accept . We checked it for the month of April . So first we extracted those rows which correspond to the April month .

df_1=df_fin[df_fin.index.month==4]df_1.head()

To perform t-test , we imported scipy.stats library and by using the function stats.ttest_ind() function we have calculated the p value .

import scipy.stats as statsttest,p_value=stats.ttest_ind(a=df_1['Apparent Temperature (C)'],b=df_fin['Humidity'],equal_var=False)p_value

Clearly , our p-value is less than the threshold . So we shall accept the null hypothesis i.e. The apparent temperature and humidity has increased monthly across 10 years.

We can proof it by doing some visualizations .

sns.set_style("ticks")sns.set_context("poster")fig , axes = plt.subplots(1,2,squeeze=False,figsize=(20,7))sns.scatterplot(df_1.index,df_1['Apparent Temperature (C)'],ax=axes[0][0])sns.scatterplot(df_1.index,df_1['Humidity'],ax=axes[0][1])
sns.set_style("ticks")fig,axes=plt.subplots(1,2,squeeze=False,figsize=(20,7))sns.lineplot(df_1.index,df_1['Apparent Temperature (C)'],ax=axes[0][0])sns.lineplot(df_1.index,df_1['Humidity'],ax=axes[0][1])

From the scatter plot and line plot , we can see that apparent temperature is increasing every year but there is a small deviation in case of humidity .

The above source code you can find at github.

“I am thankful to mentors at https://internship.suvenconsultants.com for providing awesome problem statements and giving many of us a Coding Internship Experience. Thank you www.suvenconsultants.com"

--

--

No responses yet