How to explore your csv file with pandas?

  Pandas, Python, Uncategorized, Worknotes

A comprehensive understanding of the overall situation of the data is the first step in data analysis. The following sample code shows you how to do this simply with pandas. Use of functions columns, keys(),axes,dtypes,info(),describe(),describe(include=object),isna().sum(),nunique(),value_counts(),len,shape,nsmallest(), nlargest(),sample(),head and tail.

Read data


#dowonload https://github.com/ziwangdeng/Data/blob/main/Vancouver_weather2010to2019_v00.csv
import pandas as pd
df=pd.read_csv('Vancouver_weather2010to2019_v00.csv')

Check column names


df.columns

Index(['Unnamed: 0', 'Longitude (x)', 'Latitude (y)', 'Station Name',
       'Climate ID', 'Date/Time (LST)', 'Year', 'Month', 'Day', 'Time (LST)',
       'Temp (°C)', 'Temp Flag', 'Dew Point Temp (°C)', 'Dew Point Temp Flag',
       'Rel Hum (%)', 'Rel Hum Flag', 'Precip. Amount (mm)',
       'Precip. Amount Flag', 'Wind Dir (10s deg)', 'Wind Dir Flag',
       'Wind Spd (km/h)', 'Wind Spd Flag', 'Visibility (km)',
       'Visibility Flag', 'Stn Press (kPa)', 'Stn Press Flag', 'Hmdx',
       'Hmdx Flag', 'Wind Chill', 'Wind Chill Flag', 'Weather'],
      dtype='object')

df.keys()
Index(['Unnamed: 0', 'Longitude (x)', 'Latitude (y)', 'Station Name',
       'Climate ID', 'Date/Time (LST)', 'Year', 'Month', 'Day', 'Time (LST)',
       'Temp (°C)', 'Temp Flag', 'Dew Point Temp (°C)', 'Dew Point Temp Flag',
       'Rel Hum (%)', 'Rel Hum Flag', 'Precip. Amount (mm)',
       'Precip. Amount Flag', 'Wind Dir (10s deg)', 'Wind Dir Flag',
       'Wind Spd (km/h)', 'Wind Spd Flag', 'Visibility (km)',
       'Visibility Flag', 'Stn Press (kPa)', 'Stn Press Flag', 'Hmdx',
       'Hmdx Flag', 'Wind Chill', 'Wind Chill Flag', 'Weather'],
      dtype='object')

Refer to the 2 dimensions of a DataFrame as axes


df.axes

[RangeIndex(start=0, stop=88392, step=1),
 Index(['Unnamed: 0', 'Longitude (x)', 'Latitude (y)', 'Station Name',
        'Climate ID', 'Date/Time (LST)', 'Year', 'Month', 'Day', 'Time (LST)',
        'Temp (°C)', 'Temp Flag', 'Dew Point Temp (°C)', 'Dew Point Temp Flag',
        'Rel Hum (%)', 'Rel Hum Flag', 'Precip. Amount (mm)',
        'Precip. Amount Flag', 'Wind Dir (10s deg)', 'Wind Dir Flag',
        'Wind Spd (km/h)', 'Wind Spd Flag', 'Visibility (km)',
        'Visibility Flag', 'Stn Press (kPa)', 'Stn Press Flag', 'Hmdx',
        'Hmdx Flag', 'Wind Chill', 'Wind Chill Flag', 'Weather'],
       dtype='object')]

Check data type of each column


df.dtypes
Unnamed: 0               int64
Longitude (x)          float64
Latitude (y)           float64
Station Name            object
Climate ID               int64
Date/Time (LST)         object
Year                     int64
Month                    int64
Day                      int64
Time (LST)              object
Temp (°C)              float64
Temp Flag               object
Dew Point Temp (°C)    float64
Dew Point Temp Flag     object
Rel Hum (%)            float64
Rel Hum Flag            object
Precip. Amount (mm)    float64
Precip. Amount Flag     object
Wind Dir (10s deg)     float64
Wind Dir Flag           object
Wind Spd (km/h)        float64
Wind Spd Flag           object
Visibility (km)        float64
Visibility Flag        float64
Stn Press (kPa)        float64
Stn Press Flag          object
Hmdx                   float64
Hmdx Flag              float64
Wind Chill             float64
Wind Chill Flag        float64
Weather                 object
dtype: object

Display all columns and their data types


df.info()

RangeIndex: 88392 entries, 0 to 88391
Data columns (total 31 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           88392 non-null  int64  
 1   Longitude (x)        88392 non-null  float64
 2   Latitude (y)         88392 non-null  float64
 3   Station Name         88392 non-null  object 
 4   Climate ID           88392 non-null  int64  
 5   Date/Time (LST)      88392 non-null  object 
 6   Year                 88392 non-null  int64  
 7   Month                88392 non-null  int64  
 8   Day                  88392 non-null  int64  
 9   Time (LST)           88392 non-null  object 
 10  Temp (°C)            58173 non-null  float64
 11  Temp Flag            7 non-null      object 
 12  Dew Point Temp (°C)  58159 non-null  float64
 13  Dew Point Temp Flag  21 non-null     object 
 14  Rel Hum (%)          58157 non-null  float64
 15  Rel Hum Flag         23 non-null     object 
 16  Precip. Amount (mm)  0 non-null      float64
 17  Precip. Amount Flag  25 non-null     object 
 18  Wind Dir (10s deg)   58046 non-null  float64
 19  Wind Dir Flag        15 non-null     object 
 20  Wind Spd (km/h)      58164 non-null  float64
 21  Wind Spd Flag        16 non-null     object 
 22  Visibility (km)      58180 non-null  float64
 23  Visibility Flag      0 non-null      float64
 24  Stn Press (kPa)      58160 non-null  float64
 25  Stn Press Flag       20 non-null     object 
 26  Hmdx                 3166 non-null   float64
 27  Hmdx Flag            0 non-null      float64
 28  Wind Chill           2454 non-null   float64
 29  Wind Chill Flag      0 non-null      float64
 30  Weather              26060 non-null  object 
dtypes: float64(15), int64(5), object(11)
memory usage: 20.9+ MB

Summarizes some statistical data for all numerical columns


df.describe()

Sisplay some descriptive statistics of object columns


df.describe(include=object)

Check how many missing data in each column


df.isna().sum()
Unnamed: 0                 0
Longitude (x)              0
Latitude (y)               0
Station Name               0
Climate ID                 0
Date/Time (LST)            0
Year                       0
Month                      0
Day                        0
Time (LST)                 0
Temp (°C)              30219
Temp Flag              88385
Dew Point Temp (°C)    30233
Dew Point Temp Flag    88371
Rel Hum (%)            30235
Rel Hum Flag           88369
Precip. Amount (mm)    88392
Precip. Amount Flag    88367
Wind Dir (10s deg)     30346
Wind Dir Flag          88377
Wind Spd (km/h)        30228
Wind Spd Flag          88376
Visibility (km)        30212
Visibility Flag        88392
Stn Press (kPa)        30232
Stn Press Flag         88372
Hmdx                   85226
Hmdx Flag              88392
Wind Chill             85938
Wind Chill Flag        88392
Weather                62332
dtype: int64

Check how many unique values in each column


df.nunique()
Unnamed: 0               744
Longitude (x)              1
Latitude (y)               1
Station Name               1
Climate ID                 1
Date/Time (LST)        88392
Year                      11
Month                     12
Day                       31
Time (LST)                24
Temp (°C)                379
Temp Flag                  1
Dew Point Temp (°C)      393
Dew Point Temp Flag        1
Rel Hum (%)               84
Rel Hum Flag               1
Precip. Amount (mm)        0
Precip. Amount Flag        1
Wind Dir (10s deg)        37
Wind Dir Flag              1
Wind Spd (km/h)           64
Wind Spd Flag              1
Visibility (km)           31
Visibility Flag            0
Stn Press (kPa)          578
Stn Press Flag             1
Hmdx                      11
Hmdx Flag                  0
Wind Chill                16
Wind Chill Flag            0
Weather                   59
dtype: int64

Examine how often specific values occur in a column


df['Weather'].value_counts().head(15)

Rain                 5491
Cloudy               5171
Mostly Cloudy        4303
Mainly Clear         4165
Clear                2408
Fog                  1148
Rain Showers          926
Rain,Fog              826
Snow                  349
Drizzle,Fog           238
Moderate Rain         236
Moderate Rain,Fog     200
Smoke                 114
Drizzle               105
Rain,Drizzle,Fog       55
Name: Weather, dtype: int64

Check how many rows in the data


len(df)
88392

Check the shape of the table


df.shape
(88392, 31)

Check n smallest values of a column


df.nsmallest(5, 'Stn Press (kPa)')

Check n largest values of a column


df.nlargest(5, 'Stn Press (kPa)')

Randomly select fraction of rows


df.sample(frac=0.01)

Randomly select n rows


df.sample(n=5)

Select first n rows


df.head(5)

Select last n rows


df.tail(5)