How the f*ck do I handle Time Series in Python? - A Complete Tutorial
Anything that is observed or measured at many points in time forms a time series.
Time series data is an important form of structured data in many different fields, such as finance, economics, ecology, neuroscience, and physics.
Anything that is observed or measured at many points in time forms a time series.
Many time series are fixed frequency, which is to say that data points occur at regular intervals according to some rule, such as every 15 seconds, every 5 minutes, or once per month.
Time series can also be irregular without a fixed unit of time or offset between units.
How you mark and refer to time series data depends on the application, and you may have one of the following:
Timestamps, specific instants in time.
Fixed periods, such as the month January 2007 or the full year 2010.
Intervals of time, indicated by a start and end timestamp. Periods can be thought of as special cases of intervals
Experiment or elapsed time; each timestamp is a measure of time relative to a particular start time (e.g., the diameter of a cookie baking each second since being placed in the oven)
In time series, we will not deal with the last category. pandas provide many built-in time series tools and data algorithms. You can efficiently work with very large time series and easily slice and dice, aggregate, and resample irregular- and fixed-frequency time series. Some of these tools are especially useful for financial and economics applications, but you could certainly use them to analyze server log data, too.
1 . Date and Time ⌚
The Python standard library includes data types for date and time data, as well as calendar-related functionality. We can start by importing the datetime
module.
from datetime import datetime
We can get the current (Year, Month, Day, Hour, Minute, Second, Micro-Second) using the following command:
datetime.now()
A sample output for this is:
2023-12-22 18:09:15.920165
If we want just the year or the month or so on, we can use one of the following commands:
now = datetime.now( )
now.year
now.month
now.day
now.hour
now.minute
now.second
You can format datetime objects and pandas Timestamp objects, (while we will look at later), as strings using str
or the strftime
method, passing a format specification:
stamp = datetime(2011, 1, 3)
str(stamp)
The output of this will be a string, instead of a datetime object:
'2011-01-03 00:00:00'
We can also use strftime alternatively:
stamp.strftime('%Y-%m-%d')
The output will be a string:
'2011-01-03'
For strftime
, we use special symbols, which are given as follows:
Symbol | Description |
%Y | Four-digit year |
%y | Two-digit year |
%m | Two-digit month |
%d | Two-digit day |
%H | Hour (24-hour clock) |
%I | Hour (12-hour clock) |
%M | Two-digit minute |
%S | Two-digit second |
%w | Weekday as an integer, where 0 is Sunday and 6 is Saturday |
%U | Week number of the year [00, 53]; Sunday is considered the first day of the week, and days before the first Sunday of the year are “week 0” |
%W | Week number of the year [00, 53]; Monday is considered the first day of the week, and days before the first Monday of the year are “week 0” |
%z | UTC time zone offset as +HHMM or -HHMM; empty if time zone naive |
With the help of these codes, it is also possible to convert strings to datetime objects.
value = '2011-01-03'
datetime.strptime(value, '%Y-%m-%d')
The output of this will be:
2011-01-03 00:00:00
datetime.strptime
is a good way to parse a date with a known format. However, it can be a bit annoying to have to write a format spec each time, especially for common date formats. In this case, you can use the parser.parse
method in the third-party dateutil
package (this is installed automatically when you install pandas). We can start by importing it.
from dateutil.parser import parse
We can now parse common date formats easily. For example,
parse('Jan 31, 1997 10:45 PM')
The output will be a datetime object.
datetime.datetime(1997, 1, 31, 22, 45)
And when we print this, we will get:
1997-01-31 22:45:00
In international locales, day appearing before month is very common, so you can pass dayfirst=True to indicate this:
parse('6/12/2011', dayfirst=True)
The datetime object hence created will be:
datetime.datetime(2011, 12, 6, 0, 0)
dateutil is capable of parsing most human-intelligible date representations, including A.M. and P.M. for time. pandas is generally oriented toward working with arrays of dates, whether used as an axis index or a column in a DataFrame. The to_datetime method parses many different kinds of date representations. Standard date formats like ISO 8601 can be parsed very quickly:
datestrs = ['2011-07-06 12:00:00', '2011-08-06 00:00:00']
pd.to_datetime(datestrs)
This gives the following output:
DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)
It can also handle missing data easily. Just like we had NaN for numbers, we have NaT for timestamps, which stands for Not a Time.
idx = pd.to_datetime(datestrs + [None])
The output of this is:
DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00', 'NaT'], dtype='datetime64[ns]', freq=None)
2 . Basics of Time Series ⏲️
A basic kind of time series object in pandas is a Series indexed by timestamps, which is often represented external to pandas as Python strings or datetime objects:
from datetime import datetime
# Creating a list of datetime objects
dates = [
datetime(2011, 1, 2),
datetime(2011, 1, 5),
datetime(2011, 1, 7),
datetime(2011, 1, 8),
datetime(2011, 1, 10),
datetime(2011, 1, 12)
]
ts = pd.Series(np.random.randn(6), index=dates)
This creates the following Series object:
2011-01-02 0.151310
2011-01-05 -0.562340
2011-01-07 -1.189788
2011-01-08 0.182071
2011-01-10 -0.634554
2011-01-12 -0.265802
dtype: float64
Scalar values from a DatetimeIndex are pandas Timestamp objects. A Timestamp can be substituted anywhere you would use a datetime object. Additionally, it can store frequency information (if any) and understands how to do time zone conversions and other kinds of manipulations.
Time series behaves like any other pandas.Series when you are indexing and selecting data based on label:
ts[pd.Timestamp('2011-01-02')]
This will output: 0.15130951956196476
As a convenience, you can also pass a string that is interpretable as a date:
ts[('2011-01-02')]
ts['2011/01/02']
This will output 0.15130951956196476.
Slicing with datetime objects works as well:
ts['2011-01-08':]
As expected, this will output:
2011-01-08 0.182071
2011-01-10 -0.634554
2011-01-12 -0.265802
dtype: float64
Because most time series data is ordered chronologically, you can also slice with timestamps not contained in a time series to perform a range query.
As before, you can pass either a string date, datetime, or timestamp. Remember that slicing in this manner produces views on the source time series like slicing NumPy arrays. This means that no data is copied and modifications on the slice will be reflected in the original data.
There is an equivalent instance method, truncate, that slices a Series between two dates:
ts.truncate(after='1/9/2011')
This will truncate or delete the dates after 1 September.
2011-01-02 0.151310
2011-01-05 -0.562340
2011-01-07 -1.189788
2011-01-08 0.182071
dtype: float64
All of this holds true for DataFrame as well, indexing on its rows. In some applications, there may be multiple data observations falling on a particular timestamp. Here is an example:
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', '1/2/2000', '1/3/2000'])
dup_ts = pd.Series(np.arange(5), index=dates)
This is the resultant Series:
2000-01-01 0
2000-01-02 1
2000-01-02 2
2000-01-02 3
2000-01-03 4
dtype: int32
We can tell that the index is not unique by checking its is_unique property:
dup_ts.index.is_unique
This will output False is indices aren't unique, else True.
Suppose you wanted to aggregate the data having non-unique timestamps. One way to do this is to use groupby and pass level=0:
grouped = dup_ts.groupby(level=0)
This will create a group object. Now, let's apply an aggregate method on it.
grouped.apply('mean')
The output of this will be:
2000-01-01 0.0
2000-01-02 2.0
2000-01-03 4.0
dtype: float64
Let's count the number of values corresponding to each date.
grouped.count()
2000-01-01 1
2000-01-02 3
2000-01-03 1
dtype: int64
Generic time series in pandas are assumed to be irregular; that is, they have no fixed frequency. For many applications this is sufficient. However, it’s often desirable to work relative to a fixed frequency, such as daily, monthly, or every 15 minutes, even if that means introducing missing values into a time series. Fortunately, Pandas has a full suite of standard time series frequencies and tools for resampling, inferring frequencies, and generating fixed-frequency date ranges. For example, you can convert the sample time series to be fixed daily frequency by calling resample.
resampler = ts.resample('D')
The string 'D' is interpreted as daily frequency. We will discuss about this in some other article since it is a topic big enough to deserve its separate article.
Pandas allows us to generate a range of dates. pandas.date_range
is responsible for generating a DatetimeIndex with an indicated length according to a particular frequency:
index = pd.date_range('2012-04-01', '2012-06-01')
index will look like:
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
'2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
'2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
'2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
'2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
'2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
'2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
'2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
'2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
'2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
'2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
'2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
'2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
'2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
'2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
'2012-05-31', '2012-06-01'],
dtype='datetime64[ns]', freq='D')
By default, date_range generates daily timestamps. If you pass only a start or end date, you must pass a number of periods to generate:
pd.date_range(start='2012-04-01', periods=20)
This will generate only 20 timestamps:
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
'2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
'2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
'2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
'2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20'],
dtype='datetime64[ns]', freq='D')
D means that we are generating daily timestamps. We can also generate hourly or second timestamps. We must remember the following aliases:
Alias | Offset |
D | Daily |
H | Hourly |
min | Minutely |
S | Secondly |
W | Weekly |
There are other aliases as well, but we will not be discussing them here. Sometimes you will have start or end dates with time information but want to gener‐ ate a set of timestamps normalized to midnight as a convention. To do this, there is a normalize option:
pd.date_range('2012-05-02 12:56:31', periods=5, normalize=True)
The output of this will be:
DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
'2012-05-06'],
dtype='datetime64[ns]', freq='D')
One useful frequency class is “week of month,” starting with WOM. This enables you to get dates like the third Friday of each month:
rng = pd.date_range('2012-01-01', '2012-09-01', freq='WOM-3FRI')
This will output a DateTime object list.
DatetimeIndex(['2012-01-20', '2012-02-17', '2012-03-16', '2012-04-20',
'2012-05-18', '2012-06-15', '2012-07-20', '2012-08-17'],
dtype='datetime64[ns]', freq='WOM-3FRI')
“Shifting” refers to moving data backward and forward through time. Both Series and DataFrame have a shift method for doing naive shifts forward or backward, leaving the index unmodified.
Working with time zones is generally considered one of the most unpleasant parts of time series manipulation. As a result, many time series users choose to work with time series in coordinated universal time or UTC, which is the successor to Greenwich Mean Time and is the current international standard. Time zones are expressed as offsets from UTC; for example, New York is four hours behind UTC during daylight saving time and five hours behind the rest of the year. In Python, time zone information comes from the third-party pytz library (installable with pip or conda), which exposes the Olson database, a compilation of world time zone information. This is especially important for historical data because daylight saving time (DST) transition dates (and even UTC offsets) have been changed numerous times depending on the whims of local governments. In the United States, the DST transition times have been changed many times since 1900!
3 . Period Frequency Conversion 📅
The Pandas library in Python offers the ability to work with time series data efficiently. Periods and PeriodIndex objects represent fixed-frequency intervals of time. They can be converted from one frequency to another using the asfreq
method. The asfreq
method allows you to change the frequency of a period or a series of periods while specifying whether the new periods align with the start or end of the original periods.
Consider an annual period, '2007' with a frequency of 'A-DEC' (ending in December). Using the asfreq
method:
p = pd.Period('2007', freq='A-DEC')
p.asfreq('M', how='start') # Convert to monthly at the start
# Output: Period('2007-01', 'M')
p.asfreq('M', how='end') # Convert to monthly at the end
# Output: Period('2007-12', 'M')
This demonstrates the conversion of an annual period to monthly periods, either at the start or end of the year.
For a fiscal year ending in a month other than December, such as 'A-JUN' (ending in June):
p = pd.Period('2007', freq='A-JUN')
p.asfreq('M', 'start')
# Output: Period('2006-07', 'M')
p.asfreq('M', 'end')
# Output: Period('2007-06', 'M')
When converting from higher to lower frequency, like from monthly to an annual period, Pandas determines the super-period depending on where the sub-period "belongs." For instance:
p = pd.Period('Aug-2007', 'M')
p.asfreq('A-JUN')
# Output: Period('2008', 'A-JUN')
Entire PeriodIndex objects or time series can be converted similarly:
rng = pd.period_range('2006', '2009', freq='A-DEC')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts.asfreq('M', how='start') # Converting to monthly, starting each year
ts.asfreq('B', how='end') # Converting to business day frequency, ending each year
The examples illustrate how you can convert Periods, PeriodIndex objects, and time series to different frequencies while specifying the alignment at the start or end of periods. This flexibility is particularly useful when working with time series data at different granularities or frequencies.
4 . Frequency Conversion 📊
Resampling refers to the process of converting a time series from one frequency to another. Aggregating higher frequency data to lower frequency is called downsampling while converting lower frequency to higher frequency is called upsampling. Not all resampling falls into either of these categories; for example, converting W-WED (weekly on Wednesday) to W-FRI is neither upsampling nor downsampling. pandas objects are equipped with a resample
method, which is the workhorse function for all frequency conversion. resample has a similar API to groupby; you call resample to group the data, then call an aggregation function. Let's create a time series.
rng = pd.date_range('2000-01-01', periods=100, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
Let's apply resample to this.
ts.resample('M').mean()
The output of this will be:
2000-01-31 0.381562
2000-02-29 0.071821
2000-03-31 -0.243939
2000-04-30 0.182343
Freq: M, dtype: float64
resample is a flexible and high-performance method that can be used to process very large time series.
In downsampling, the target frequency must be a subperiod of the source frequency.
In upsampling, the target frequency must be a superperiod of the source frequency.
4.1. Downsampling
Aggregating data to a regular, lower frequency is a pretty normal time series task. The data you’re aggregating doesn’t need to be fixed frequently; the desired frequency defines bin edges that are used to slice the time series into pieces to aggregate. For example, to convert to monthly, 'M' or 'BM', you need to chop up the data into one month intervals. Each interval is said to be half-open; a data point can only belong to one interval, and the union of the intervals must make up the whole time frame. There are a couple of things to think about when using resample to downsample data:
Which side of each interval is closed
How to label each aggregated bin, either with the start of the interval or the end
Let's create a minute time series.
rng = pd.date_range('2000-01-01', periods=12, freq='T')
ts = pd.Series(np.arange(12), index=rng)
This will produce the following time series.
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
2000-01-01 00:09:00 9
2000-01-01 00:10:00 10
2000-01-01 00:11:00 11
Freq: T, dtype: int32
Suppose you wanted to aggregate this data into five-minute chunks or bars by taking the sum of each group:
ts.resample('5min', closed='right').sum()
The output of this will be:
1999-12-31 23:55:00 0
2000-01-01 00:00:00 15
2000-01-01 00:05:00 40
2000-01-01 00:10:00 11
Freq: 5T, dtype: int32
This is very similar to the groupby( ) function. The resulting time series is labeled by the timestamps from the left side of each bin. By passing label='right' you can label them with the right bin edge:
ts.resample('5min', closed='right', label='right').sum()
The output of this will be following:
2000-01-01 00:00:00 0
2000-01-01 00:05:00 15
2000-01-01 00:10:00 40
2000-01-01 00:15:00 11
Freq: 5T, dtype: int32
4.2. Opening High Low Close Resampling
In finance, a popular way to aggregate a time series is to compute four values for each bucket: the first (open), last (close), maximum (high), and minimal (low) values. By using the ohlc aggregate function you will obtain a DataFrame having columns con‐ taining these four aggregates, which are efficiently computed in a single sweep of the data:
ts.resample('5min').ohlc()
The output of this is the following:
open high low close
2000-01-01 00:00:00 0 4 0 4
2000-01-01 00:05:00 5 9 5 9
2000-01-01 00:10:00 10 11 10 11
4.3. Upsampling
When converting from a low frequency to a higher frequency, no aggregation is needed. Let’s consider a DataFrame with some weekly data:
Colorado Texas New York Ohio
2000-01-05 -1.552653 -0.040085 -0.521509 0.483905
2000-01-12 -0.070931 -0.732778 0.771743 0.535050
You can create this DataFrame with the following command:
frame = pd.DataFrame( np.random.randn(2, 4), index=pd.date_range('1/1/2000', periods=2, freq='W-WED'), columns=['Colorado', 'Texas', 'New York', 'Ohio'] )
Now, when you are using an aggregation function with this data, there is only one value per group, and missing values result in the gaps. We use the asfreq method to con‐ vert to the higher frequency without any aggregation:
df_daily = frame.resample('D').asfreq()
The output of this will be the following:
Colorado Texas New York Ohio
2000-01-05 -1.552653 -0.040085 -0.521509 0.483905
2000-01-06 NaN NaN NaN NaN
2000-01-07 NaN NaN NaN NaN
2000-01-08 NaN NaN NaN NaN
2000-01-09 NaN NaN NaN NaN
2000-01-10 NaN NaN NaN NaN
2000-01-11 NaN NaN NaN NaN
2000-01-12 -0.070931 -0.732778 0.771743 0.535050
Suppose you wanted to fill forward each weekly value on the non-Wednesdays. The same filling or interpolation methods available in the fillna and reindex methods are available for resampling:
frame.resample('D').ffill()
The output of this will be the following:
Colorado Texas New York Ohio
2000-01-05 -1.552653 -0.040085 -0.521509 0.483905
2000-01-06 -1.552653 -0.040085 -0.521509 0.483905
2000-01-07 -1.552653 -0.040085 -0.521509 0.483905
2000-01-08 -1.552653 -0.040085 -0.521509 0.483905
2000-01-09 -1.552653 -0.040085 -0.521509 0.483905
2000-01-10 -1.552653 -0.040085 -0.521509 0.483905
2000-01-11 -1.552653 -0.040085 -0.521509 0.483905
2000-01-12 -0.070931 -0.732778 0.771743 0.535050
You can similarly choose to only fill a certain number of periods forward to limit how far to continue using an observed value:
frame.resample('D').ffill(limit=2)
The output of this will be the following:
Colorado Texas New York Ohio
2000-01-05 -1.552653 -0.040085 -0.521509 0.483905
2000-01-06 -1.552653 -0.040085 -0.521509 0.483905
2000-01-07 -1.552653 -0.040085 -0.521509 0.483905
2000-01-08 NaN NaN NaN NaN
2000-01-09 NaN NaN NaN NaN
2000-01-10 NaN NaN NaN NaN
2000-01-11 NaN NaN NaN NaN
2000-01-12 -0.070931 -0.732778 0.771743 0.535050
Notably, the new date index need not overlap with the old one at all:
frame.resample('W-THU').ffill()
The output of this will be:
Colorado Texas New York Ohio
2000-01-06 -1.552653 -0.040085 -0.521509 0.483905
2000-01-13 -0.070931 -0.732778 0.771743 0.535050
This was all about the basics of time series in Pandas.