This post describes a python based workflow to plot and analyze large amounts of data and the possibilities of this visualization tool, continuing on some of the parts of this post and offering a more detailed version of my 2016 AGU fall meeting poster.

The general issue when working with many and/or large files of hydrological time series (or similar datasets) often is that tools such as the well known LibreOffice or MSOffice aren’t really meant to handle large files and lots of files simultaneously. They are nice and easy to visualize a single dataset or some very small sets, but as soon as file sizes start to get measured in megabytes or different datatypes and timeframes need to get combined, the fun starts.

Futile attempt to simply plot a small, diverse dataset in LibreOffice
Futile attempt to simply plot a small, diverse dataset in LibreOffice

So what can we see in the plot above?

Apparently, there are five time series in there, but we can only see three. Those three series have different lengths and different ranges – with ‘B’ starting at the end of 1899 and ending somewhere around 1940 and ‘D’ and ‘F’ showing something that happened around 1900.

What should we see in the plot above?

Two daily precipitation time series, ‘C’ and ‘F’ (double digit millimeters), one daily river stage series ‘B’ (triple digit centimeters) and two monthly groundwater level series ‘D’ and ‘F’ (triple digit meters), all in the time from 1975 to 2010.

Now, of course this is easily fixable and probably the best course of action when you just need this one plot now, but if you encounter tasks like this more often, a better tool might be worth using.

What I am describing here, is a python workflow that builds one large dataframe out of standardized time series built from a folder full of CSV files, that can then be used to plot the data or for further analysis.

Unfortunately, wordpress does not allow the upload of text files, hence the code (and the example dataset at the end) are embedded in this post. Just copy the text and save it as .py and .csv files respectively.
In order to use these files, you need a python installation and the modules pandas,matplotlib and numpy.

To begin: a few words of warning.

Just as many (most?) folks in natural sciences, I have never learned programming but suddenly found myself in a situation where I am just expected to do so. The result of this mechanism leads to about 32,100,000 results in google, when searching for “scientist bad code”. Now I don’t claim that the following code is bad, but the fact that I wouldn’t even know how to identify bad code (It runs, so it can’t be bad since it obviously does work…) is probably telling. But just keeping this code hidden so that everyone who wants to replicate something has to reinvent the wheel is not exactly a solution either. So if you find anything terrible in these lines, please do leave a comment!

With that out of the way, back to our plotting problem. At first, let’s have a look at the following script that takes a bunch of CSV files and produces some other CSV and H5 files:

#This script reads a directory full of CSV files with groundwater, river water and precipitation data, identifies their type, standardizes them and outputs three large CSV and HDF files.
#The folder must contain files for all three types and no other CSV files, else it will fail or need some changes.
 
#import various things
from read_CSV_timeseries import readcsvts
 
from standardize_timeseries import standardizer
 
from standardize_timeseries_PRECIP import PRECIP_standardizer
 
import glob; import pandas as pd; import matplotlib.pyplot as plt
 
#Built some empty dataframes to be filled with the respective datasets
dfdates = pd.date_range('01/01/2000', '01/01/2013', freq = 'MS')
Header = pd.MultiIndex.from_product([[], [], [], []], names = ['name', 'location', 'number', 'datatype'])
df_GW = pd.DataFrame([], index = dfdates, columns = Header)
df_RW = pd.DataFrame([], index = dfdates, columns = Header)
df_PRECIP = pd.DataFrame([], index = dfdates, columns = Header)
 
#get a list of all csvs in the directory
csv_list = glob.glob("*.csv")
 
#get the length of the the list and thus the number of csv files in the directory
number_csv = len(csv_list)
 
#Loop over all the files
csv_counter = 0
while csv_counter < number_csv:
    current_csv = csv_list[csv_counter]
    filename = current_csv
    #turn the data from the CSV into a dataframe and grab some information about the measurement station.
    csvoutput = readcsvts(filename)
    timeseries = csvoutput['timeseries']
    name = csvoutput['name']
    location = csvoutput['location']
    datatype = csvoutput['datatype']
    number = csvoutput['number']
 
    if datatype is 'Groundwater':
        Stand_GW = standardizer(timeseries, name, location, datatype, number)
        #add it into our large groundwater dataframe
        df_GW = pd.concat([df_GW, Stand_GW], axis = 1)
 
    if datatype is 'Riverwater':
        Stand_RW = standardizer(timeseries, name, location, datatype, number)
        df_RW = pd.concat([df_RW, Stand_RW], axis = 1)
 
    if datatype is 'Precipitation':
        timeseries = timeseries.resample('MS', how = 'sum')
        df_raw_PRECIP = pd.concat([df_PRECIP, timeseries], axis = 1)
        #Assuming we are looking at a small region, it makes sense to average the precipitation measuring stations. Hence we only concatenate the raw data here, and standardize the mean below.
 
    csv_counter += 1
 
df_mean_PRECIP = df_raw_PRECIP.mean(axis = 1)
df_PRECIP = PRECIP_standardizer(timeseries, name, location, datatype, number)
 
#Plot the dfs, as a quick visual control
df_PRECIP.plot()
df_GW.plot()
df_RW.plot()
plt.show()
 
#Export it all as a CSV file to have an easily readable file with all our standardized data, in case we want to open it with a spreadsheet application
df_GW.to_csv('df_GW.csv', sep = ';')
df_PRECIP.to_csv('df_PRECIP.csv', sep = ';')
df_RW.to_csv('df_RW.csv', sep = ';')
 
#What we really want for this, is a HDF file. However, it is binary and not really trivial to read in, so the CSV as a fallback might be a good idea.
df_GW.to_hdf('df_GW.h5', 'df_GW', format = 'f')
df_PRECIP.to_hdf('df_PRECIP.h5', 'df_PRECIP', format = 'f')
df_RW.to_hdf('df_RW.h5', 'df_RW', format = 'f')


Besides the obvious modules, this also needs read_CSV_timeseries, standardize_timeseries and standardize_timeseries_PRECIP, shown below:

def readcsvts(filename):
    """
    Reads in a CSV file containing a hydrologic time series.
    Fitted to read the testdata provided with it.
    """
    #import the needed modules
    import pandas as pd
 
    #since filename is kinda generic, change it to current_csv
    current_csv = filename
 
    #Now open the file with the standard python csv reader
    infile = open(current_csv, 'r')
    import csv
    table = []
 
    #intitialize a counter for the rows and import the first 10 rows to analyze the header
    rows = 0
    for row in csv.reader(infile):
        rows += 1
        if rows < 10:
            table.append(row)
    infile.close()
 
    #Extract the name of the station:
    #Since strings get imported in square brackets, we can use those as markers
    StationNameLine = table[0]
    StationNameString = str(StationNameLine)
    NameStart = StationNameString.find(';')+1
    NameEnd = StationNameString.find(']')-1
    #find the position of the ;, the name comes directly after it, and ends with ']
    StationName = StationNameString[NameStart:NameEnd]
 
    #Grab various important information from the header and find the line in which the data starts
    linecounter = 0
    while linecounter < 9:
        tester_list = table[linecounter]
        tester = str(tester_list)   #must be turned into a string, otherwise I cant find strings in it
        if 'Waterlevel' in tester:
            Data_type_str = 'Groundwater'
 
        if 'Stage' in tester:
            Data_type_str = 'Riverwater'
 
        if 'Precip' in tester:
            Data_type_str = 'Precipitation'
 
        if 'Location' in tester:
            location = table[linecounter]
            LocString1 = str(location)
            LocStart = LocString1.find(';')+1
            LocEnd = LocString1.find('\n')-1
            location = LocString1[LocStart:LocEnd]
            location = float(location)
        if 'Number' in tester:
            Number = table[linecounter]
            NumberString1 = str(Number)
            NumberStart = NumberString1.find(';')+1
            NumberEnd = NumberString1.find('\n')-1
            Number = NumberString1[NumberStart:NumberEnd]
 
        if 'date' in tester:
            DataStart = linecounter
 
        linecounter += 1
 
    #Read in the data from the csv_list.
    hydroTS = pd.read_csv(current_csv, sep = ';', skiprows = DataStart, usecols = ['date', 'level'], index_col = 'date', dayfirst = True, parse_dates = True)
 
    #For reasons I dont understand, the data sometimes gets read as object, but I need float. so...
    hydroTS.level = hydroTS.level.astype(float)
 
    #get the lowest and the largest date
    Mini_date = str(hydroTS.index.min())
    Maxi_date = str(hydroTS.index.max())
 
    #Find the frequency of the data
    firstdate = hydroTS.index[0]
    seconddate = hydroTS.index[1]
    datedelta = seconddate - firstdate
    #This is a pandas timedelta. Seems like it defaults to days, but just to make sure:
    dayinterval = datedelta.days
    #Results in an int with the number of days between two days. 1=daily timeseries; 28-31=monthly
 
    if dayinterval == 1:
        TS_freq = 'D'
    else:
        TS_freq = 'MS'
 
    #built a empty date range with the same start, end and frequency as the orignal data and get its length
    testdates = pd.date_range(Mini_date, Maxi_date, freq = TS_freq)
    testlength = len(testdates)
 
    #Get the length of the real data and compare it. If it differs, there must be a gap in the data.
    Rlength = len(hydroTS)
    lengthdifference = testlength - Rlength
 
    if lengthdifference != 0:
        Data_type_str = 'Bullshit'
        print 'there is a gap in the data'
    else:
        Errorstate = 0
        print 'there is no gap in the data'
        #This is just a very basic error check, serving as a reminder that, depending on the nature of the data, it makes sense to implement some more checks for missing or erroneous data, which mostly depends on how the input is formatted. The Austrian data I am mostly working with for example has the nasty habit of marking missing values with the German word for gap, which necessitates a lot of "fun" with umlauts. See this rant for more on that: https://climatefootnotes.com/2016/08/02/imagine-pythons-and-data-analysis/
 
    output = {'Error': Errorstate, 'name': StationName, 'location': location, 'number': Number, 'datatype': Data_type_str, 'timeseries': hydroTS} #'Mini_date':Mini_date, 'Maxi_date':Maxi_date}
 
    return output


def standardizer(timeseries, name, location, datatype, number):
    """
    Takes a timeseries and standardizes it in a simple way, but takes into account some ideas from indices such as the SPI or SGI.
    """
    #import stuff
    import pandas as pd
 
    #Groundwater is monthly data in our example, whereas river stages are daily. find out the frequency and resample if needed
    firstdate = timeseries.index[0]
    seconddate = timeseries.index[1]
    datedelta = seconddate - firstdate
    #This is a pandas timedelta. Seems like it defaults to days but it's not obvious from the documentation, so just to make sure:
    dayinterval = datedelta.days
    #Results in an int with the number of days between two days. 1=daily timeseries; 28-31=monthly
 
    if dayinterval == 1:
        timeseries = timeseries.resample('MS', how = 'sum')
 
    #turn the dataset into monthly data
    Jan_TS = timeseries[timeseries.index.month == 1]
    Feb_TS = timeseries[timeseries.index.month == 2]
    Mar_TS = timeseries[timeseries.index.month == 3]
    Apr_TS = timeseries[timeseries.index.month == 4]
    May_TS = timeseries[timeseries.index.month == 5]
    Jun_TS = timeseries[timeseries.index.month == 6]
    Jul_TS = timeseries[timeseries.index.month == 7]
    Aug_TS = timeseries[timeseries.index.month == 8]
    Sep_TS = timeseries[timeseries.index.month == 9]
    Oct_TS = timeseries[timeseries.index.month == 10]
    Nov_TS = timeseries[timeseries.index.month == 11]
    Dec_TS = timeseries[timeseries.index.month == 12]
 
    #I don't really like proper loops...
    #initialize a 1-12 counter
    months = 0
    while months < 12:
        if months == 0:
            current_TS = Jan_TS
        if months == 1:
            current_TS = Feb_TS
        if months == 2:
            current_TS = Mar_TS
        if months == 3:
            current_TS = Apr_TS
        if months == 4:
            current_TS = May_TS
        if months == 5:
            current_TS = Jun_TS
        if months == 6:
            current_TS = Jul_TS
        if months == 7:
            current_TS = Aug_TS
        if months == 8:
            current_TS = Sep_TS
        if months == 9:
            current_TS = Oct_TS
        if months == 10:
            current_TS = Nov_TS
        if months == 11:
            current_TS = Dec_TS
 
        #calculate a simple standardization
        #get the mean and stdev for each month
        PREmean_TS = current_TS.mean()
        mean_TS = PREmean_TS[0]  #Sometimes a header from the input files survives. This gets rid of it
        PREstd_TS = current_TS.std()
        std_TS = PREstd_TS[0]
        current_STS = ((current_TS - mean_TS)/ std_TS)
 
        #turn it into stuff workable outside the while loop in a weird way again
        if months == 0:
            Jan_STS = current_STS
        if months == 1:
            Feb_STS = current_STS
        if months == 2:
            Mar_STS = current_STS
        if months == 3:
            Apr_STS = current_STS
        if months == 4:
            May_STS = current_STS
        if months == 5:
            Jun_STS = current_STS
        if months == 6:
            Jul_STS = current_STS
        if months == 7:
            Aug_STS = current_STS
        if months == 8:
            Sep_STS = current_STS
        if months == 9:
            Oct_STS = current_STS
        if months == 10:
            Nov_STS = current_STS
        if months == 11:
            Dec_STS = current_STS
        months += 1
 
    #Now we put the months together into yearly data again
    Yearlist_unsorted = pd.concat([Jan_STS, Feb_STS, Mar_STS, Apr_STS, May_STS, Jun_STS, Jul_STS, Aug_STS, Sep_STS, Oct_STS, Nov_STS, Dec_STS])
    standardized_TS = Yearlist_unsorted.sort()
 
    #give it a header and return the standardized time series
    dfHeader = pd.MultiIndex.from_product([[name], [location], [number], [datatype]], names = ['name', 'location', 'number', 'datatype'])
 
    standardized_TS.columns = dfHeader
 
    return standardized_TS


def PRECIP_standardizer(timeseries, name, location, datatype, number):
    """
    Takes a timeseries and standardizes it in a simple way, by taking into account some ideas from indices such as the SPI or SGI.
    This module is just done for the sake of demonstration, for real data, drought indices such as the standardized groundwater index SGI are probably a better idea.
    """
    #This script is licenced under the BSD 2-Clause License
    #Copyright (c) 2016, Johannes Haas
    #import stuff
    import pandas as pd
 
    #Precipitation data always seems to be daily data.
    #Turn it into monthly data.
    timeseries = timeseries.resample('MS', how = 'sum')
 
    #Start a loop to average to precipitation over different time spans
    for period in range(0, 3):
        if period == 0:
            timeseries = timeseries
        if period == 1:
            timeseries = pd.rolling_mean(timeseries, 3)
            timeseries = timeseries[2:]
        if period == 2:
            timeseries = pd.rolling_mean(timeseries, 12)
            timeseries = timeseries[11:]
 
        #turn the dataset into monthly data
        Jan_TS = timeseries[timeseries.index.month == 1]
        Feb_TS = timeseries[timeseries.index.month == 2]
        Mar_TS = timeseries[timeseries.index.month == 3]
        Apr_TS = timeseries[timeseries.index.month == 4]
        May_TS = timeseries[timeseries.index.month == 5]
        Jun_TS = timeseries[timeseries.index.month == 6]
        Jul_TS = timeseries[timeseries.index.month == 7]
        Aug_TS = timeseries[timeseries.index.month == 8]
        Sep_TS = timeseries[timeseries.index.month == 9]
        Oct_TS = timeseries[timeseries.index.month == 10]
        Nov_TS = timeseries[timeseries.index.month == 11]
        Dec_TS = timeseries[timeseries.index.month == 12]
 
        #I don't really like proper loops...
        #initialize a 1-12 counter
        months = 0
        while months < 12:
            if months == 0:
                current_TS = Jan_TS
            if months == 1:
                current_TS = Feb_TS
            if months == 2:
                current_TS = Mar_TS
            if months == 3:
                current_TS = Apr_TS
            if months == 4:
                current_TS = May_TS
            if months == 5:
                current_TS = Jun_TS
            if months == 6:
                current_TS = Jul_TS
            if months == 7:
                current_TS = Aug_TS
            if months == 8:
                current_TS = Sep_TS
            if months == 9:
                current_TS = Oct_TS
            if months == 10:
                current_TS = Nov_TS
            if months == 11:
                current_TS = Dec_TS
            #calculate a simple standardization
            #get the mean and stdev for each month
            PREmean_TS = current_TS.mean()
            mean_TS = PREmean_TS[0] #Sometimes our 'data' from the input files survives. This gets rid of it
            PREstd_TS = current_TS.std()
            std_TS = PREstd_TS[0]
            current_STS = ((current_TS - mean_TS) / std_TS)
 
                #turn it into stuff workable outside the while loop in a retarded way again
            if months == 0:
                Jan_STS = current_STS
            if months == 1:
                Feb_STS = current_STS
            if months == 2:
                Mar_STS = current_STS
            if months == 3:
                Apr_STS = current_STS
            if months == 4:
                May_STS = current_STS
            if months == 5:
                Jun_STS = current_STS
            if months == 6:
                Jul_STS = current_STS
            if months == 7:
                Aug_STS = current_STS
            if months == 8:
                Sep_STS = current_STS
            if months == 9:
                Oct_STS = current_STS
            if months == 10:
                Nov_STS = current_STS
            if months == 11:
                Dec_STS = current_STS
            months += 1
 
        #Now put the months together into yearly data again
        Yearlist_unsorted = pd.concat([Jan_STS, Feb_STS, Mar_STS, Apr_STS, May_STS, Jun_STS, Jul_STS, Aug_STS, Sep_STS, Oct_STS, Nov_STS, Dec_STS])
        STS = Yearlist_unsorted.sort()
 
        if period == 0:
            STS1 = STS
        if period == 1:
            STS3 = STS
        if period == 2:
            STS12 = STS
 
    #Put the different standardization periods back together
    standardized_TS = pd.concat([STS1, STS3, STS12], axis = 1)
 
    #give it a header and return the standardized time series
    #This assumes that we are going to average the precipitation over a small catchment, hence the location and number are set here per hand.
    dfHeader = pd.MultiIndex.from_product([['Catchment-A'], ['0'], ['XXX'], ['STS1', 'STS3', 'STS12']], names = ['name', 'location', 'number', 'datatype'])
 
    standardized_TS.columns = dfHeader
 
    return standardized_TS


Essentially what the whole thing does, is reading in the CSV files, grabbing some information about them from their headers, followed by standardizing them and putting it all together in large dataframes that get saved as CSV and HDF files.

Instead of implementing a standardization that is fitted to a certain kind of hydrological time series, such as the SPI [1] or SGI [2], this is using the easy (Value – Average of time series) / standard deviation of time series approach. This is combined with the approach of the SPI, where the mean and standard deviation are not calculated for the whole time series, but rather only for the values off a certain month. Also, for the precipitation, averaging periods are implemented.

The benefit of this standardization is, that every time series now has a range from about -3 to +3 standard deviations, no matter if the original series was dozens of millimeters of precipitation or a groundwater level fluctuating around 400 meters above sealevel.

This is easily demonstrated by reading in the three dataframes, putting them all together and plotting them:

import pandas as pd; import matplotlib.pyplot as plt
df_GW = pd.read_hdf('df_GW.h5', 'df_GW',format = 'f')
df_RW = pd.read_hdf('df_RW.h5', 'df_RW', format = 'f')
df_PRECIP = pd.read_hdf('df_PRECIP.h5', 'df_PRECIP', format = 'f')
plot_df  = pd.concat([df_GW, df_PRECIP, df_RW], axis = 1)
plot_df.plot()
plt.show()

With a small number of the example files provided below, this can already be quite the usable plot, but using a real dataset (an aquifer in Austria, with data from ehyd.gv.at/ in this case) isn’t exactly producing something usable:

aichfeld
Plot of 20 Groundwater wells, 5 SPI averaging periods and 3 rivers

It is probably possible to make out some general oscillation pattern in this plot, but for the things I’m interested in, such as the question if groundwater and surface water are related, or if any of those two reacts to precipitation, this is not helping a lot.

To find out about the relations between the different time series, we can look at the correlations between the different timeseries, using the plot_df.corr() function. For a simple case with only a handful of time series, a textual output is often preferred, but when looking at a large dataset, a graphical representation makes more sense:

#to make it visually easier to tell the different datasets apart, lets ad some empty columns

plot_df.insert(6, 'spacer', 0)
plot_df.insert(10, 'spacer2', 0)

fig=plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(plot_df.corr())
fig.colorbar(cax)
plt.show()

For the real data, this results in a correlation matrix which is much more visually pleasing, especially with a very simple legend added.

Visualization of a groundwater basin as a correlation matrix, with the groundwater wells sorted by their distance to the river
Visualization of a groundwater basin as a correlation matrix, with the groundwater wells sorted by their distance to the river

In this case, the data is also sorted, by the distance to the stream, which requires an additional df_GW = df_GW.sortlevel('location', axis = 1) before concatenating the groundwater, precipitation and river datasets.

In the case of this real dataset, a few things that would have been lost in a sea of numbers before are now easily visible:

Many of the groundwater wells are correlated with each other and the rivers in the area, as well as to a smaller amount with the three to nine months precipitation and there is a subset of five wells that are not correlated with anything, besides themselves.

Besides giving a quick, general overview over the system, this also can serve as a starting point to investigate the causation of these correlations. Especially the subset of the five “outliers” seems worth investigating.

Another option, is to use this tool to pick years or periods of interest, or to produce a series of images to be turned into an animated film:

startdate = 1976
enddate = 2010
date = startdate
while date < enddate:
	strdate = str(date)
	matrixslice = plot_df.loc[strdate]
	fig=plt.figure()
	ax = fig.add_subplot(111)
	cax = ax.matshow(matrixslice.corr(), vmin = -1, vmax = 1) #note the fixed minimum and maximum. Otherwise, every matrix might have a different colorsheme
	fig.colorbar(cax)
	plotname = '%s.png' % strdate
	plt.savefig(plotname, dpi=300, format = 'png')
	plt.close()
	date = date + 1
Development of correlations in an aquifer over time.
Development of correlations in an aquifer over time.

The workflow shown herein offers three key benefits:

  • The automated handling and plotting of files or numbers of files that would require a lot of handiwork in spreadsheet applications.
  • A visually pleasing way to look at a large number of time series, enabling us to identify correlations and outliers as a starting point for further investigations.
  • A way to add a time component, to visualize and investigate the change of correlations over time.

[1] SPI: McKee, Thomas B., Doesken, Nolan J., Kleist, John: The relationship of drought frequency and duration to time scales, Proceedings of the 8th Conference on Applied Climatology, 179–183, 1993
[2] SGI: Bloomfield, J. P., Marchant, B. P.: Analysis of groundwater drought building on the standardised precipitation index approach, Hydrology and Earth System Sciences 17(12), 4769–4787, 2013

To keep file sizes small, the sample files are only provided as headers, that can easily populated with times in the form DD.MM.YYYY and separated by a semicolon ;, data of different magnitudes, such as for example “groundwater levels” by using LibreOffices (and I assume it works the same in Excel) RAND()*1000 function for example.

Alternatively, the modules provided above can simply be changed to read in any CSV data.

Name:;the name of the location
Location:;450                #distance to the river in this case
Number:;1234                 #an alternative way to identify a measurement, besides its name
Type:;Waterlevel [m asl]
date;level                   #the scripts use this as an identifier to find the start of the data
01.01.2000;339.4
...
Name:;the name of the location
Location:;0
Number:;2345
Type:;Stage [cm]
date;level
01.01.2000;3
Name:;the name of the location
Location:;500
Number:;3456
Type:;Precip
date;level
Advertisements