National Climatic Data Center Weather Data using SQL Server

25 Mar 2017

The following analysis is conducted using historical weather data from 2000-2017 within the surrounding Boston area provided by the National Climatic Data Center. I began by using the SQL Server Data Tools GUI to import the .csv files I downloaded from the NCDC servers into a table I called weather_table:

The .csv files I used for this data set can be downloaded here.


Create table weather_table (
[STATION] varchar(50),
[STATION_NAME] varchar(50),
[WEATHER_DATE] date,
[MDPR] int,
[MDSF] int,
[DAPR] int,
[DASF] int,
[PRCP] int,
[SNWD] int,
[SNOW] int,
[TAVG] int,
[TMAX] int,
[TMIN] int,
[TOBS] int
);

Upon reading the accompanying documentation for the weather data and inspecting the imported rows, I have noticed there were a substantial number of rows with missing data as denoted with the value -9999 since apparently not all the data were recorded. I decided to update all the -9999 values in these rows as null.


Update weather_table set -- convert all -9999 values into NULLS
tavg = case when tavg = -9999 THEN NULL else tavg end, 
tmin = case when tmin = -9999 THEN NULL else tmin end, 
tmax = case when tmax = -9999 THEN NULL else tmax end;

Sample Queries

1) Return past weather data from one and two years ago from today’s date:


SELECT 
    station AS 'Station', station_name AS 'Station Name', CONVERT(varchar(8), weather_date, 1) AS 'Date', -- give past weather data from 1 and 2 years ago
CASE 
    WHEN (tavg IS NOT NULL and tmin IS NULL and tmax IS NULL) THEN CAST(tavg AS varchar(50))
    WHEN (tavg IS NOT NULL and tmin IS NULL and tmax IS NOT NULL) THEN CAST(tavg AS varchar(50))
    WHEN (tavg IS NOT NULL and tmin IS NOT NULL and tmax IS NULL) THEN CAST(tavg AS varchar(50))
    WHEN (tavg IS NOT NULL and tmin IS NOT NULL and tmax IS NOT NULL) THEN CAST(tavg AS varchar(50))
    WHEN (tavg IS NULL and tmin IS NOT NULL and tmax IS NOT NULL) THEN CAST(SUM((tmin+tmax)/2) AS varchar(50)) ELSE 'No data' END AS 'Average Temperature',
CASE 
    WHEN tmin IS NULL THEN 'No data' ELSE CAST(tmin AS varchar(50)) END AS 'Minimum Temperature', --CAST() isto bypass printing string restriction in a INT only column
CASE 
    WHEN tmax IS NULL  THEN 'No data' ELSE CAST(tmax AS varchar(50)) END AS 'Maximum Temperature', 
CASE 
    WHEN prcp IS NULL  THEN 'No data' ELSE CAST(prcp AS varchar(50)) END AS 'Precipitation', tavg, tmin, tmax
FROM 
    weather_table 
WHERE 
    weather_Date in (DATEADD(year,-1,CONVERT(datetime,CONVERT(nvarchar(11),GETDATE()))), DATEADD(year,-2,CONVERT(datetime,CONVERT(nvarchar(11),GETDATE()))))
GROUP BY 
    station_name, weather_date, tavg, tmin, tmax, station, prcp
ORDER BY 
    station_name, weather_date

Output (105 rows):

The case statement for Average Temperature, Minimum Temperature, and Maximum Temperature returns the string value in the event tavg, tmin, and tmax have null values but prioritizes to show numerical values if they exist. I included the tavg, tmin, and tmax fields in the tables is to give an accurate representation of what the data appeared after importing. Additionally, I decided to omit the Precipitation field from this point on because the rows mostly contained either 0 or null values.

2) Return the coldest day on average of a specified date:


SELECT 
    station_name, station, CONVERT(VARCHAR(8), weather_date, 1) AS 'Date', tavg, tmin, tmax, -- coldest day on avg by date
CASE 
    WHEN (tavg IS NOT NULL and tmin IS NULL and tmax IS NULL) THEN CAST(tavg AS varchar(50))
    WHEN (tavg IS NOT NULL and tmin IS NULL and tmax IS NOT NULL) THEN CAST(tavg AS varchar(50))
    WHEN (tavg IS NOT NULL and tmin IS NOT NULL and tmax IS NULL) THEN CAST(tavg AS varchar(50))
    WHEN (tavg IS NOT NULL and tmin IS NOT NULL and tmax IS NOT NULL) THEN CAST(tavg AS varchar(50))
    WHEN (tavg IS NULL and tmin IS NOT NULL and tmax IS NOT NULL) THEN CAST(CAST(SUM((tmin+tmax)/2.0) AS decimal(16,2)) AS varchar(50)) ELSE 'no data' END AS 'Average Temperature',
CASE 
    WHEN tmin IS NULL THEN 'No data' ELSE CAST(tmin AS varchar(50)) END AS 'Minimum Temperature', --CAST() is to bypass printing string restriction in a INT only column
CASE 
    WHEN tmax IS NULL  THEN 'No data' ELSE CAST(tmax AS varchar(50)) END AS 'Maximum Temperature'
FROM 
    weather_table WHERE weather_date = '1/16/04' and ((tmin IS NOT NULL and tmax IS NOT NULL) or tavg IS NOT NULL) -- filter out rows that are null tavg or a null tmin and tmax
GROUP BY 
    station_name, station, tavg, tmin, tmax, weather_date  
ORDER BY 
    CASE WHEN (tavg IS NULL and tmin IS NOT NULL and tmax IS NOT NULL) THEN (SUM((tmin + tmax)/2.0)) ELSE tavg END;

Output (20 rows):

Notice how the tavg column display null values since the column has no recorded data from the sourced weather data. I decided for the Average Temperature column to return the average of tmin and tmax in the event tavg is missing and use tmin and tmax as proxy variables to provide an accurate substitute for tavg values. In the event the tavg values are present, prioritize and return those values first. This is the what my case statement is trying to accomplish.

3) Return the hottest day on average from last year:


SELECT 
    TOP 1 station_name, station, weather_date, tavg, tmin, tmax,
CASE 
    WHEN (tavg IS NOT NULL and tmin IS NULL and tmax IS NULL) THEN CAST(tavg AS varchar(50))
    WHEN (tavg IS NOT NULL and tmin IS NULL and tmax IS NOT NULL) THEN CAST(tavg AS varchar(50))
    WHEN (tavg IS NOT NULL and tmin IS NOT NULL and tmax IS NULL) THEN CAST(tavg AS varchar(50))
    WHEN (tavg IS NOT NULL and tmin IS NOT NULL and tmax IS NOT NULL) THEN CAST(tavg AS varchar(50))
    WHEN (tavg IS NULL and tmin IS NOT NULL and tmax IS NOT NULL) THEN CAST(CAST(SUM((tmin+tmax)/2.0) AS decimal(16,2)) AS varchar(50)) END AS 'Average Temperature'
FROM 
    weather_table 
WHERE 
    station_name like 'Boston%' and WEATHER_DATE >= (DATEADD(dd,-365,CONVERT(datetime,CONVERT(nvarchar(11),GETDATE())))) 
GROUP BY 
    station_name, station, tavg, tmin, tmax, weather_date 
ORDER BY 
    'Average Temperature' DESC

Output (1 row):