This post describes a python based workflow to plot and analyze large amounts of data and the possibilities of this visualization tool, continuing on some of the parts of this post and offering a more detailed version of my 2016 AGU fall meeting poster.
The general issue when working with many and/or large files of hydrological time series (or similar datasets) often is that tools such as the well known LibreOffice or MSOffice aren’t really meant to handle large files and lots of files simultaneously. They are nice and easy to visualize a single dataset or some very small sets, but as soon as file sizes start to get measured in megabytes or different datatypes and timeframes need to get combined, the fun starts.

So what can we see in the plot above?
Apparently, there are five time series in there, but we can only see three. Those three series have different lengths and different ranges – with ‘B’ starting at the end of 1899 and ending somewhere around 1940 and ‘D’ and ‘F’ showing something that happened around 1900.
What should we see in the plot above?
Two daily precipitation time series, ‘C’ and ‘F’ (double digit millimeters), one daily river stage series ‘B’ (triple digit centimeters) and two monthly groundwater level series ‘D’ and ‘F’ (triple digit meters), all in the time from 1975 to 2010.
Now, of course this is easily fixable and probably the best course of action when you just need this one plot now, but if you encounter tasks like this more often, a better tool might be worth using.
What I am describing here, is a python workflow that builds one large dataframe out of standardized time series built from a folder full of CSV files, that can then be used to plot the data or for further analysis.
Unfortunately, wordpress does not allow the upload of text files, hence the code (and the example dataset at the end) are embedded in this post. Just copy the text and save it as .py and .csv files respectively.
In order to use these files, you need a python installation and the modules pandas,matplotlib and numpy.
To begin: a few words of warning.
Just as many (most?) folks in natural sciences, I have never learned programming but suddenly found myself in a situation where I am just expected to do so. The result of this mechanism leads to about 32,100,000 results in google, when searching for “scientist bad code”. Now I don’t claim that the following code is bad, but the fact that I wouldn’t even know how to identify bad code (It runs, so it can’t be bad since it obviously does work…) is probably telling. But just keeping this code hidden so that everyone who wants to replicate something has to reinvent the wheel is not exactly a solution either. So if you find anything terrible in these lines, please do leave a comment!
With that out of the way, back to our plotting problem. At first, let’s have a look at the following script that takes a bunch of CSV files and produces some other CSV and H5 files:
#This script reads a directory full of CSV files with groundwater, river water and precipitation data, identifies their type, standardizes them and outputs three large CSV and HDF files. #The folder must contain files for all three types and no other CSV files, else it will fail or need some changes. #import various things from read_CSV_timeseries import readcsvts from standardize_timeseries import standardizer from standardize_timeseries_PRECIP import PRECIP_standardizer import glob; import pandas as pd; import matplotlib.pyplot as plt #Built some empty dataframes to be filled with the respective datasets dfdates = pd.date_range('01/01/2000', '01/01/2013', freq = 'MS') Header = pd.MultiIndex.from_product([[], [], [], []], names = ['name', 'location', 'number', 'datatype']) df_GW = pd.DataFrame([], index = dfdates, columns = Header) df_RW = pd.DataFrame([], index = dfdates, columns = Header) df_PRECIP = pd.DataFrame([], index = dfdates, columns = Header) #get a list of all csvs in the directory csv_list = glob.glob("*.csv") #get the length of the the list and thus the number of csv files in the directory number_csv = len(csv_list) #Loop over all the files csv_counter = 0 while csv_counter < number_csv: current_csv = csv_list[csv_counter] filename = current_csv #turn the data from the CSV into a dataframe and grab some information about the measurement station. csvoutput = readcsvts(filename) timeseries = csvoutput['timeseries'] name = csvoutput['name'] location = csvoutput['location'] datatype = csvoutput['datatype'] number = csvoutput['number'] if datatype is 'Groundwater': Stand_GW = standardizer(timeseries, name, location, datatype, number) #add it into our large groundwater dataframe df_GW = pd.concat([df_GW, Stand_GW], axis = 1) if datatype is 'Riverwater': Stand_RW = standardizer(timeseries, name, location, datatype, number) df_RW = pd.concat([df_RW, Stand_RW], axis = 1) if datatype is 'Precipitation': timeseries = timeseries.resample('MS', how = 'sum') df_raw_PRECIP = pd.concat([df_PRECIP, timeseries], axis = 1) #Assuming we are looking at a small region, it makes sense to average the precipitation measuring stations. Hence we only concatenate the raw data here, and standardize the mean below. csv_counter += 1 df_mean_PRECIP = df_raw_PRECIP.mean(axis = 1) df_PRECIP = PRECIP_standardizer(timeseries, name, location, datatype, number) #Plot the dfs, as a quick visual control df_PRECIP.plot() df_GW.plot() df_RW.plot() plt.show() #Export it all as a CSV file to have an easily readable file with all our standardized data, in case we want to open it with a spreadsheet application df_GW.to_csv('df_GW.csv', sep = ';') df_PRECIP.to_csv('df_PRECIP.csv', sep = ';') df_RW.to_csv('df_RW.csv', sep = ';') #What we really want for this, is a HDF file. However, it is binary and not really trivial to read in, so the CSV as a fallback might be a good idea. df_GW.to_hdf('df_GW.h5', 'df_GW', format = 'f') df_PRECIP.to_hdf('df_PRECIP.h5', 'df_PRECIP', format = 'f') df_RW.to_hdf('df_RW.h5', 'df_RW', format = 'f')
Besides the obvious modules, this also needs read_CSV_timeseries
, standardize_timeseries
and standardize_timeseries_PRECIP
, shown below:
def readcsvts(filename): """ Reads in a CSV file containing a hydrologic time series. Fitted to read the testdata provided with it. """ #import the needed modules import pandas as pd #since filename is kinda generic, change it to current_csv current_csv = filename #Now open the file with the standard python csv reader infile = open(current_csv, 'r') import csv table = [] #intitialize a counter for the rows and import the first 10 rows to analyze the header rows = 0 for row in csv.reader(infile): rows += 1 if rows < 10: table.append(row) infile.close() #Extract the name of the station: #Since strings get imported in square brackets, we can use those as markers StationNameLine = table[0] StationNameString = str(StationNameLine) NameStart = StationNameString.find(';')+1 NameEnd = StationNameString.find(']')-1 #find the position of the ;, the name comes directly after it, and ends with '] StationName = StationNameString[NameStart:NameEnd] #Grab various important information from the header and find the line in which the data starts linecounter = 0 while linecounter < 9: tester_list = table[linecounter] tester = str(tester_list) #must be turned into a string, otherwise I cant find strings in it if 'Waterlevel' in tester: Data_type_str = 'Groundwater' if 'Stage' in tester: Data_type_str = 'Riverwater' if 'Precip' in tester: Data_type_str = 'Precipitation' if 'Location' in tester: location = table[linecounter] LocString1 = str(location) LocStart = LocString1.find(';')+1 LocEnd = LocString1.find('\n')-1 location = LocString1[LocStart:LocEnd] location = float(location) if 'Number' in tester: Number = table[linecounter] NumberString1 = str(Number) NumberStart = NumberString1.find(';')+1 NumberEnd = NumberString1.find('\n')-1 Number = NumberString1[NumberStart:NumberEnd] if 'date' in tester: DataStart = linecounter linecounter += 1 #Read in the data from the csv_list. hydroTS = pd.read_csv(current_csv, sep = ';', skiprows = DataStart, usecols = ['date', 'level'], index_col = 'date', dayfirst = True, parse_dates = True) #For reasons I dont understand, the data sometimes gets read as object, but I need float. so... hydroTS.level = hydroTS.level.astype(float) #get the lowest and the largest date Mini_date = str(hydroTS.index.min()) Maxi_date = str(hydroTS.index.max()) #Find the frequency of the data firstdate = hydroTS.index[0] seconddate = hydroTS.index[1] datedelta = seconddate - firstdate #This is a pandas timedelta. Seems like it defaults to days, but just to make sure: dayinterval = datedelta.days #Results in an int with the number of days between two days. 1=daily timeseries; 28-31=monthly if dayinterval == 1: TS_freq = 'D' else: TS_freq = 'MS' #built a empty date range with the same start, end and frequency as the orignal data and get its length testdates = pd.date_range(Mini_date, Maxi_date, freq = TS_freq) testlength = len(testdates) #Get the length of the real data and compare it. If it differs, there must be a gap in the data. Rlength = len(hydroTS) lengthdifference = testlength - Rlength if lengthdifference != 0: Data_type_str = 'Bullshit' print 'there is a gap in the data' else: Errorstate = 0 print 'there is no gap in the data' #This is just a very basic error check, serving as a reminder that, depending on the nature of the data, it makes sense to implement some more checks for missing or erroneous data, which mostly depends on how the input is formatted. The Austrian data I am mostly working with for example has the nasty habit of marking missing values with the German word for gap, which necessitates a lot of "fun" with umlauts. See this rant for more on that: https://climatefootnotes.com/2016/08/02/imagine-pythons-and-data-analysis/ output = {'Error': Errorstate, 'name': StationName, 'location': location, 'number': Number, 'datatype': Data_type_str, 'timeseries': hydroTS} #'Mini_date':Mini_date, 'Maxi_date':Maxi_date} return output
def standardizer(timeseries, name, location, datatype, number): """ Takes a timeseries and standardizes it in a simple way, but takes into account some ideas from indices such as the SPI or SGI. """ #import stuff import pandas as pd #Groundwater is monthly data in our example, whereas river stages are daily. find out the frequency and resample if needed firstdate = timeseries.index[0] seconddate = timeseries.index[1] datedelta = seconddate - firstdate #This is a pandas timedelta. Seems like it defaults to days but it's not obvious from the documentation, so just to make sure: dayinterval = datedelta.days #Results in an int with the number of days between two days. 1=daily timeseries; 28-31=monthly if dayinterval == 1: timeseries = timeseries.resample('MS', how = 'sum') #turn the dataset into monthly data Jan_TS = timeseries[timeseries.index.month == 1] Feb_TS = timeseries[timeseries.index.month == 2] Mar_TS = timeseries[timeseries.index.month == 3] Apr_TS = timeseries[timeseries.index.month == 4] May_TS = timeseries[timeseries.index.month == 5] Jun_TS = timeseries[timeseries.index.month == 6] Jul_TS = timeseries[timeseries.index.month == 7] Aug_TS = timeseries[timeseries.index.month == 8] Sep_TS = timeseries[timeseries.index.month == 9] Oct_TS = timeseries[timeseries.index.month == 10] Nov_TS = timeseries[timeseries.index.month == 11] Dec_TS = timeseries[timeseries.index.month == 12] #I don't really like proper loops... #initialize a 1-12 counter months = 0 while months < 12: if months == 0: current_TS = Jan_TS if months == 1: current_TS = Feb_TS if months == 2: current_TS = Mar_TS if months == 3: current_TS = Apr_TS if months == 4: current_TS = May_TS if months == 5: current_TS = Jun_TS if months == 6: current_TS = Jul_TS if months == 7: current_TS = Aug_TS if months == 8: current_TS = Sep_TS if months == 9: current_TS = Oct_TS if months == 10: current_TS = Nov_TS if months == 11: current_TS = Dec_TS #calculate a simple standardization #get the mean and stdev for each month PREmean_TS = current_TS.mean() mean_TS = PREmean_TS[0] #Sometimes a header from the input files survives. This gets rid of it PREstd_TS = current_TS.std() std_TS = PREstd_TS[0] current_STS = ((current_TS - mean_TS)/ std_TS) #turn it into stuff workable outside the while loop in a weird way again if months == 0: Jan_STS = current_STS if months == 1: Feb_STS = current_STS if months == 2: Mar_STS = current_STS if months == 3: Apr_STS = current_STS if months == 4: May_STS = current_STS if months == 5: Jun_STS = current_STS if months == 6: Jul_STS = current_STS if months == 7: Aug_STS = current_STS if months == 8: Sep_STS = current_STS if months == 9: Oct_STS = current_STS if months == 10: Nov_STS = current_STS if months == 11: Dec_STS = current_STS months += 1 #Now we put the months together into yearly data again Yearlist_unsorted = pd.concat([Jan_STS, Feb_STS, Mar_STS, Apr_STS, May_STS, Jun_STS, Jul_STS, Aug_STS, Sep_STS, Oct_STS, Nov_STS, Dec_STS]) standardized_TS = Yearlist_unsorted.sort() #give it a header and return the standardized time series dfHeader = pd.MultiIndex.from_product([[name], [location], [number], [datatype]], names = ['name', 'location', 'number', 'datatype']) standardized_TS.columns = dfHeader return standardized_TS
def PRECIP_standardizer(timeseries, name, location, datatype, number): """ Takes a timeseries and standardizes it in a simple way, by taking into account some ideas from indices such as the SPI or SGI. This module is just done for the sake of demonstration, for real data, drought indices such as the standardized groundwater index SGI are probably a better idea. """ #This script is licenced under the BSD 2-Clause License #Copyright (c) 2016, Johannes Haas #import stuff import pandas as pd #Precipitation data always seems to be daily data. #Turn it into monthly data. timeseries = timeseries.resample('MS', how = 'sum') #Start a loop to average to precipitation over different time spans for period in range(0, 3): if period == 0: timeseries = timeseries if period == 1: timeseries = pd.rolling_mean(timeseries, 3) timeseries = timeseries[2:] if period == 2: timeseries = pd.rolling_mean(timeseries, 12) timeseries = timeseries[11:] #turn the dataset into monthly data Jan_TS = timeseries[timeseries.index.month == 1] Feb_TS = timeseries[timeseries.index.month == 2] Mar_TS = timeseries[timeseries.index.month == 3] Apr_TS = timeseries[timeseries.index.month == 4] May_TS = timeseries[timeseries.index.month == 5] Jun_TS = timeseries[timeseries.index.month == 6] Jul_TS = timeseries[timeseries.index.month == 7] Aug_TS = timeseries[timeseries.index.month == 8] Sep_TS = timeseries[timeseries.index.month == 9] Oct_TS = timeseries[timeseries.index.month == 10] Nov_TS = timeseries[timeseries.index.month == 11] Dec_TS = timeseries[timeseries.index.month == 12] #I don't really like proper loops... #initialize a 1-12 counter months = 0 while months < 12: if months == 0: current_TS = Jan_TS if months == 1: current_TS = Feb_TS if months == 2: current_TS = Mar_TS if months == 3: current_TS = Apr_TS if months == 4: current_TS = May_TS if months == 5: current_TS = Jun_TS if months == 6: current_TS = Jul_TS if months == 7: current_TS = Aug_TS if months == 8: current_TS = Sep_TS if months == 9: current_TS = Oct_TS if months == 10: current_TS = Nov_TS if months == 11: current_TS = Dec_TS #calculate a simple standardization #get the mean and stdev for each month PREmean_TS = current_TS.mean() mean_TS = PREmean_TS[0] #Sometimes our 'data' from the input files survives. This gets rid of it PREstd_TS = current_TS.std() std_TS = PREstd_TS[0] current_STS = ((current_TS - mean_TS) / std_TS) #turn it into stuff workable outside the while loop in a retarded way again if months == 0: Jan_STS = current_STS if months == 1: Feb_STS = current_STS if months == 2: Mar_STS = current_STS if months == 3: Apr_STS = current_STS if months == 4: May_STS = current_STS if months == 5: Jun_STS = current_STS if months == 6: Jul_STS = current_STS if months == 7: Aug_STS = current_STS if months == 8: Sep_STS = current_STS if months == 9: Oct_STS = current_STS if months == 10: Nov_STS = current_STS if months == 11: Dec_STS = current_STS months += 1 #Now put the months together into yearly data again Yearlist_unsorted = pd.concat([Jan_STS, Feb_STS, Mar_STS, Apr_STS, May_STS, Jun_STS, Jul_STS, Aug_STS, Sep_STS, Oct_STS, Nov_STS, Dec_STS]) STS = Yearlist_unsorted.sort() if period == 0: STS1 = STS if period == 1: STS3 = STS if period == 2: STS12 = STS #Put the different standardization periods back together standardized_TS = pd.concat([STS1, STS3, STS12], axis = 1) #give it a header and return the standardized time series #This assumes that we are going to average the precipitation over a small catchment, hence the location and number are set here per hand. dfHeader = pd.MultiIndex.from_product([['Catchment-A'], ['0'], ['XXX'], ['STS1', 'STS3', 'STS12']], names = ['name', 'location', 'number', 'datatype']) standardized_TS.columns = dfHeader return standardized_TS
Essentially what the whole thing does, is reading in the CSV files, grabbing some information about them from their headers, followed by standardizing them and putting it all together in large dataframes that get saved as CSV and HDF files.
Instead of implementing a standardization that is fitted to a certain kind of hydrological time series, such as the SPI [1] or SGI [2], this is using the easy (Value – Average of time series) / standard deviation of time series approach. This is combined with the approach of the SPI, where the mean and standard deviation are not calculated for the whole time series, but rather only for the values off a certain month. Also, for the precipitation, averaging periods are implemented.
The benefit of this standardization is, that every time series now has a range from about -3 to +3 standard deviations, no matter if the original series was dozens of millimeters of precipitation or a groundwater level fluctuating around 400 meters above sealevel.
This is easily demonstrated by reading in the three dataframes, putting them all together and plotting them:
import pandas as pd; import matplotlib.pyplot as plt df_GW = pd.read_hdf('df_GW.h5', 'df_GW',format = 'f') df_RW = pd.read_hdf('df_RW.h5', 'df_RW', format = 'f') df_PRECIP = pd.read_hdf('df_PRECIP.h5', 'df_PRECIP', format = 'f') plot_df = pd.concat([df_GW, df_PRECIP, df_RW], axis = 1) plot_df.plot() plt.show()
With a small number of the example files provided below, this can already be quite the usable plot, but using a real dataset (an aquifer in Austria, with data from ehyd.gv.at/ in this case) isn’t exactly producing something usable:

It is probably possible to make out some general oscillation pattern in this plot, but for the things I’m interested in, such as the question if groundwater and surface water are related, or if any of those two reacts to precipitation, this is not helping a lot.
To find out about the relations between the different time series, we can look at the correlations between the different timeseries, using the plot_df.corr()
function. For a simple case with only a handful of time series, a textual output is often preferred, but when looking at a large dataset, a graphical representation makes more sense:
#to make it visually easier to tell the different datasets apart, lets ad some empty columns plot_df.insert(6, 'spacer', 0) plot_df.insert(10, 'spacer2', 0) fig=plt.figure() ax = fig.add_subplot(111) cax = ax.matshow(plot_df.corr()) fig.colorbar(cax) plt.show()
For the real data, this results in a correlation matrix which is much more visually pleasing, especially with a very simple legend added.

In this case, the data is also sorted, by the distance to the stream, which requires an additional df_GW = df_GW.sortlevel('location', axis = 1)
before concatenating the groundwater, precipitation and river datasets.
In the case of this real dataset, a few things that would have been lost in a sea of numbers before are now easily visible:
Many of the groundwater wells are correlated with each other and the rivers in the area, as well as to a smaller amount with the three to nine months precipitation and there is a subset of five wells that are not correlated with anything, besides themselves.
Besides giving a quick, general overview over the system, this also can serve as a starting point to investigate the causation of these correlations. Especially the subset of the five “outliers” seems worth investigating.
Another option, is to use this tool to pick years or periods of interest, or to produce a series of images to be turned into an animated film:
startdate = 1976 enddate = 2010 date = startdate while date < enddate: strdate = str(date) matrixslice = plot_df.loc[strdate] fig=plt.figure() ax = fig.add_subplot(111) cax = ax.matshow(matrixslice.corr(), vmin = -1, vmax = 1) #note the fixed minimum and maximum. Otherwise, every matrix might have a different colorsheme fig.colorbar(cax) plotname = '%s.png' % strdate plt.savefig(plotname, dpi=300, format = 'png') plt.close() date = date + 1

The workflow shown herein offers three key benefits:
- The automated handling and plotting of files or numbers of files that would require a lot of handiwork in spreadsheet applications.
- A visually pleasing way to look at a large number of time series, enabling us to identify correlations and outliers as a starting point for further investigations.
- A way to add a time component, to visualize and investigate the change of correlations over time.
[1] SPI: McKee, Thomas B., Doesken, Nolan J., Kleist, John: The relationship of drought frequency and duration to time scales, Proceedings of the 8th Conference on Applied Climatology, 179–183, 1993
[2] SGI: Bloomfield, J. P., Marchant, B. P.: Analysis of groundwater drought building on the standardised precipitation index approach, Hydrology and Earth System Sciences 17(12), 4769–4787, 2013
To keep file sizes small, the sample files are only provided as headers, that can easily populated with times in the form DD.MM.YYYY
and separated by a semicolon ;
, data of different magnitudes, such as for example “groundwater levels” by using LibreOffices (and I assume it works the same in Excel) RAND()*1000
function for example.
Alternatively, the modules provided above can simply be changed to read in any CSV data.
Name:;the name of the location Location:;450 #distance to the river in this case Number:;1234 #an alternative way to identify a measurement, besides its name Type:;Waterlevel [m asl] date;level #the scripts use this as an identifier to find the start of the data 01.01.2000;339.4 ...
Name:;the name of the location Location:;0 Number:;2345 Type:;Stage [cm] date;level 01.01.2000;3
Name:;the name of the location Location:;500 Number:;3456 Type:;Precip date;level