Question :
dataset <- structure(list(PLACA = structure(c(5L, 5L, 5L, 4L, 1L, 2L, 3L,
7L, 6L, 8L), .Label = c("DSF9652", "EFR9618", "EQW6597", "ERB1522",
"EWM3539", "LOC1949", "LQQ5554", "OQT5917"), class = "factor"),
COD_REV = c(113195L, 113196L, 113197L, 113303L, 80719L, 80720L,
80722L, 113318L, 80788L, 113386L), DATA = structure(1:10, .Label = c("2016-01-14 12:13:00.000",
"2016-01-18 18:48:00.000", "2016-01-18 19:00:00.000", "2016-01-25 11:46:00.000",
"2016-01-25 19:20:00.000", "2016-01-25 19:28:00.000", "2016-01-25 19:33:00.000",
"2016-01-25 20:56:00.000", "2016-01-26 21:28:00.000", "2016-01-27 13:50:00.000"
), class = "factor"), KM_ATUAL = c(52100L, 52100L, 52100L,
110676L, 62300L, 31144L, 165022L, 41021L, 155646L, 55030L
), KM_MEDIA = c(0L, 42L, 40L, 20L, 17L, 18L, 120L, 100L,
10L, 38L)), .Names = c("PLACA", "COD_REV", "DATA", "KM_ATUAL",
"KM_MEDIA"), row.names = c(NA, -10L), class = "data.frame")
I have the above dataset and would like to group the boards to see how many visits the same customer did. So I need to calculate the difference between the dates and km_atual of visits, to compare with the field KM_media_dia and see the difference between these values.
I can not figure out the difference between the dates. This was my attempt so far:
library(tidyverse)
# Carregando os datasets
dataset <- read_csv2("dados_atuais.csv")
dataset_revisao_km <- dataset %>%
# selecionar apenas colunas importantes
select(CPF, PLACA, COD_REV, DATA, KM_ATUAL) %>%
arrange(DATA) %>%
group_by(PLACA) %>%
mutate(ORDEM_REVISAO = row_number()) %>%
# manter apenas placas com mais de uma revisao
filter(n() > 1) %>%
mutate(DIFERENCA_KM = KM_ATUAL - lag(KM_ATUAL)) %>%
# filtrar fora a primeira revisao da placa
filter(ORDEM_REVISAO > 1)
Answer :
The R
requires data with dates to be correctly specified so that it can perform calculations that may be required. One of the best ways to do this is with the lubridate
package:
library(lubridate)
dataset$DATA <- ymd_hms(dataset$DATA)
Note that I just replaced the DATA
column with its equivalent in ymd_hms
(YearMonthDay_HourMinuteSecond), as it was in the original dataset. From there it was only to calculate the difference in days between the equal plates, using the function difftime
:
dataset %>%
group_by(PLACA) %>%
filter(n() > 1) %>%
mutate(DiferencaDias=difftime(DATA, lag(DATA), units="days")) %>%
na.omit()
# A tibble: 2 x 6
# Groups: PLACA [1]
PLACA COD_REV DATA KM_ATUAL KM_MEDIA Diferenca
<fct> <int> <dttm> <int> <int> <time>
1 EWM3539 113196 2016-01-18 18:48:00 52100 42 4.27430555555556
2 EWM3539 113197 2016-01-18 19:00:00 52100 40 0.00833333333333333
Note that in the data set only the EWM3539 board appears more than once. As she appears 3 times, it does not make sense to speak on her first visit, since there is no difference of days. Therefore, we remove this information through na.omit
.