Introduction
A common need that exists in most of the programming languages is parsing strings into dates and converting dates into strings. Mastering this process is of vital importance for data analysts/data scientist because we get data from many different sources and countries where format changes, and for dates this is typically a headache. For instance, “31st of January of 2021” can be written as:
- 01/31/21: US format
- January the 31, 2021: US format
- 31/01/21: typically European format
- 31.01.21: typically German format
- 31-01-21: typically English format
- 31/JAN/21: variation of English format
- 31 January 2021: variation of English format
Conversely, in many occasions we have to extract specific parts of the date like day of week, calendar week, month name, day name, etc. or we need to produce a long string containing date parts embedded with text like in this example “today is Monday of week 23 (2021)”.
In this recipe we are going to learn how we can do this transformations in base R (we will not cover lubridate nor the new kid on the block: clock
How to get current Date and Time
Base R comes with 2 main functions to report current data and time:
Sys.Date() # [1] "2021-04-20"
Sys.time() # [1] "2021-04-20 10:44:18 CEST"
class(Sys.Date()) # [1] "Date"
class(Sys.time()) # [1] "POSIXct" "POSIXt"
Notice the inconsistency in the case: Date()
starts with uppercase
while time()
starts with lowercase.
Sys.Date()
returns and object of the class Date
with the current
date in the current time zone. Sys.time()
returns and object of class
POSIXct
. They can be formatted following below instructions. Both
classes store the number of seconds since 01/01/1970, so their
underlying data type is numeric (it’s a simple signed number). POSIXct
has more precision as it stores hours, minutes seconds and timezones:
t <- Sys.time()
typeof(t) # [1] "double"
d <- Sys.Date()
typeof(d) # [1] "double"
But, there is another data type or class for datetimes: POSIXlt
. This
class creates a named list with all the date and time components like
day, minutes, seconds… etc. We’ll see this class a bit later.
How to parse a Character String into a Date
Default date format
A date is a data type or class that exists within R. When you need to write a date into a report, document, csv file, etc. you convert it into a string. Also, to enter a date into the R terminal you produce a string that has to be converted into an R date type.
mydate_chr <- "2021-04-15" # we create a date as string
class(mydate_chr) # [1] "character"
mydate_date <- as.Date(mydate_chr) # we convert it into a date with default format yyyy-mm-dd
class(mydate_date) # [1] "Date"
Ok, we’ve cast a string into a date with default format, but what is the
default format?. As per the documentation (?as.Date()
), if we do not
indicate any format then the parameter tryFormats
comes into play,
whose default value is c("%Y-%m-%d", "%Y/%m/%d")
. Let’s test it!!
as.Date("2021/04/01") # It works: [1] "2021-04-01"
as.Date("2021-04-01") # It works: [1] "2021-04-01"
as.Date("2021.04.01") # Error: character string is not in a standard unambiguous format
as.Date("2021 04 01") # Error: character string is not in a standard unambiguous format
We’ve just seen that the two default formats work but not the other two, as expected. We’ll try with a character vector as input with multiple formats:
# First attempt
mydates <- c("2021/04/01", "2021/04/02", "2021-04-03", "2021.04.04") # Format with slashes "/" wins
as.Date(mydates) # [1] "2021-04-01" "2021-04-02" NA NA
# Second attempt
mydates <- c("2021-04-01", "2021/04/02", "2021-04-03", "2021.04.04") # Format with dashes "-" wins
as.Date(mydates) # [1] "2021-04-01" NA "2021-04-03" NA
What’s happened here? format()
tries all the patterns in tryFormats
with the 1st element of the vector and when it finds one that works then
uses this for the rest of the elements. Notice that in the second
attempt I’ve only changed the first date and it changed the behavior of
as.Date()
.
Customized date format
What if the default format is not convenient for us. Then, we use the
format
argument of the function as.Date()
mydates <- c("01/01/2021", "02/02/2021") # Changed to dd/mm/yyyy
as.Date(mydates, format = "%d/%m/%Y") # Works: [1] "2021-01-01" "2021-02-02"
mydates <- c("01/JAN/21", "02/JAN/21") # Format with slashes wins "/"
as.Date(mydates, format = "%d/%b/%y") # Works: [1] "2021-01-01" "2021-01-02"
mydates <- c("Thu, 15 of April, 2021") # Format with slashes wins "/"
as.Date(mydates, format = "%a, %d of %B, %Y") # Works: [1] "2021-04-15"
There are plenty of conversion specifications that you can query on
the documentation of strptime()
or
here.
Adding time and timezones
In order to parse a datetime string, we need the function strptime()
:
strptime("2021-02-15 12:05:59"
, format = "%Y-%m-%d %H:%M:%S")
strptime("2021-02-15 12:05:59"
, format = "%Y-%m-%d %T") # %T = %H:%M:%S
strptime("2021-02-15 12:05:59 PM"
, format = "%Y-%m-%d %I:%M:%S %p") # %p = AM/PM
strptime("2021-02-15 12:05:59 PM +0350"
, format = "%Y-%m-%d %I:%M:%S %p %z") # %z = timezone
strptime("2021-02-15 12:05:59 PM -0600"
, format = "%Y-%m-%d %I:%M:%S %p %z") # %z = timezone
As we said at the beginning, there is a class called POSIXlt
that
stores a date and time in a name list. See below"
x <- strptime("2021-02-15 12:05:59"
, format = "%Y-%m-%d %H:%M:%S")
class(x) # [1] "POSIXlt" "POSIXt"
typeof(x) # list
As we see, strptime()
constructs a POSIXlt
object. We can extract
its components individually:
x$sec # 59
x$min # 5
x$hour # 12
x$mday # 15
x$mon # 1 (it goes from 0 to 11: do 1+x$mon)
x$year # 121 (years from 1900: do 1900+x$year)
x$wday # 1 (day of week 0-6 -Sun to Sat-)
x$yday # 45 (day of year 0-364 -or 365 in a leap year-)
x$isdst # 0 (daylight savings: 0 not in place)
x$zone # CET (Central European Time)
x$gmtoff # The offset in seconds from GMT. NA is unknown
A POSIXlt
it’s a bit more complex class than a POSIXct
but it stores
time in a human-like way.
How to convert Dates into Strings
Extracting parts of Dates
Here we can use the function format()
. Let say, we want to know the
week number of Joe’s birthday, we can do:
Joe_bday <- as.Date("1975/03/28")
format(Joe_bday, format = "Week was %V") # [1] "Week was 13"
We can also check other date elements, like:
format(Joe_bday, format = "That day it was %A (%a)") # [1] "That day it was Friday (Fri)"
format(Joe_bday, format = "It was the day %u of the week") # [1] "It was the day 3rd of the week 5"
format(Joe_bday, format = "It was %Cth century") # [1] "It was 19th century (really ??)"
As can you see, %C doesn’t calculate the century correctly. It’s defined in the documentation as “Century (00–99): the integer part of the year divided by 100”. So you must add 1. I warned you!
Additionally, there is a set of methods to assist on this matter, namely:
weekdays(Joe_bday) # [1] "Friday"
weekdays(Joe_bday, abbreviate = T) # [1] "Fri"
months(Joe_bday) # [1] "March"
months(Joe_bday, abbreviate = T) # [1] "Mar"
quarters(Joe_bday) # [1] "Q1"
julian(Joe_bday) # [1] "1970-01-01" (number of days since 1970-01-01)
How to extract parts of Times
If you’ve reached this point, the rest it’s going to be a piece of cake for you:
Joe_bday <- strptime("1975-03-28 06:01:21 PM +0600"
, format = "%Y-%m-%d %I:%M:%S %p %z")
format(Joe_bday, format = "Hour: %H or %I %p") # [1] "Hour: 13 or 01 PM"
format(Joe_bday, format = "Minute: %M") # [1] "Minute: 01"
format(Joe_bday, format = "Second: %S") # [1] "Second: 21"
format(Joe_bday, format = "Timezone: %z") # [1] "Timezone: +0600"
And that’s all! If you’ve read up to here, you’ll feel much more comfortable dealing with dates and times in R. We haven’t seen how to operate with them (addition and subtraction of 2 dates). This will be another R recipe.
I’d be very happy to hear from your. You can get in touch with me on:
- Twitter: @jlarellanoh
- Email: r.data.science.101@gmail.com