data

Phases of progression of a full lunar eclipse by Jean Beaufort

Say you have a table in your DB and and the way it’s set up is as a static “current state” rather than as a timestream. The following is example data from a table called towns

towns has an id, a name, an integer (0 or 1) representing blocked status and a planning stage string.

In this kind of setup, when a status parameter changes, it changes forever and the previous status is lost.

Here’s how you can make a daily status record so you can look back and see the history of an item’s status.

1. Create a stored function to figure out the status

If status is very obvious, you may not need this step. However, as data gets more complicated, you may find that status isn’t straight-forward. For the towns table above, the following conditions apply to work out the status:

The town can be blocked or not blocked.
The possible states for planning_stage are “pending”, “audited” or “complete”

The possible statuses follow (these examples are all made up so don’t worry too much about it making sense)

The stored function is going to take the town_id and return a string with the status. We’re also going to add a status “UNKNOWN STATUS” to catch cases that don’t fit the requirements and a “NO SUCH TOWN” status in case the town_id passed in is bogus.

drop function if exists get_town_status;
DELIMITER //
CREATE FUNCTION `get_town_status`(input_town_id int(10)) RETURNS varchar(31) CHARSET utf8
    DETERMINISTIC
BEGIN
  DECLARE town_status varchar(31) charset utf8;
  DECLARE town_exists, is_blocked, is_pending, is_audited, is_complete int;

select town_id,
       blocked,
       if(planning_stage="pending",1,0),
       if(planning_stage="audited",1,0),
       if(planning_stage="complete",1,0)
into town_exists, is_blocked, is_pending, is_audited, is_complete
from towns
where town_id=input_town_id;

set town_status = case
  when town_exists is null then "NO SUCH TOWN"
  when is_blocked=1 and is_pending=1 then "Blocked pending approval"
  when is_blocked=0 and is_pending=1 then "Ready for audit"
  when is_blocked=1 and is_audited=1 then "Blocked during audit"
  when is_blocked=0 and is_audited=1 then "Ready for certification"
  when is_blocked=1 and is_complete=1 then "Refused"
  when is_blocked=0 and is_complete=1 then "Certified"
  else "UNKNOWN STATUS"
end;

RETURN town_status;
END //
DELIMITER ;

We can then use it like

select get_town_status(123456789), get_town_status(36972), get_town_status(63406);

Which returns “NO SUCH TOWN”, “Blocked pending approval”, “Ready for certification”

2. Create a status record table

CREATE TABLE town_status (
   town_id int(10) unsigned NOT NULL,
   town_status varchar(31) CHARACTER SET utf8 DEFAULT NULL,
   recorded_at date NOT NULL,
   PRIMARY KEY (town_id,recorded_at)
) 
;

This sets up a link between the town and the status on a particular date. The primary key is a combination of the town_id and the recorded_at date, meaning you can only have one record for each town on any given day.

You’ll note that although town_id is an id from a foreign table, it doesn’t have a foreign key constraint. This is because even if the town is deleted, you will still want to enter an historic record of its final stage.

The initial population of the table will simply add today’s values for all towns to the status table.

insert into town_status (select id, get_town_status(id), curdate() from towns);

3. CREATE A current status view

The current view will show the latest recorded status for each town. This view will self-join the table town_status finding the latest_date for each town.

This view will be useful if you want to quickly get the current recorded status of all your towns (for example if calculating it using get_town_status is spendy).

create view current_town_status as (
    select ts.*
    from town_status ts
    join (
        select town_id, max(recorded_at) as latest_date
        from town_status
        group by town_id) x 
    on x.town_id = ts.town_id and x.latest_date = ts.recorded_at
)
;

4. make a daily event to store the status each day

We’ll make an event to recur at 00:15 every day, starting tonight. This event will record the status for each town if it has changed since the most recently stored value. Two caveats apply: 1) we’ll need to add a new row if the town is a new entry into the towns table and 2) we’ll also need to record if a town disappears from the towns table.

The left join and null check in the first insert statement will take care of new towns: if the town is not already on the town_status table, it will be added with its status.

To manage the second case, where the town has been deleted from the towns table, we check to see if there are any towns on the town_status table but not on the towns table, where the last recorded status was not “NO SUCH TOWN”. If any examples are found, then add a “NO SUCH TOWN” status for this town_id for today.

#Nightly event to keep track of town status changes
drop event if exists record_town_status;
delimiter |
create event record_town_status
on schedule
every 1 day
STARTS (TIMESTAMP(CURRENT_DATE) + INTERVAL 15 MINUTE)
do
begin

insert ignore into town_status (
    select t.id, get_town_status(t.id), curdate()
    from towns t
        left join current_town_status ts on t.id=ts.town_id
    where (ts.town_status is null or get_town_status(t.id) <> ts.town_status)
)
;

insert ignore into town_status (
    select ts.town_id, "NO SUCH TOWN", curdate()
    from current_town_status ts
             left join towns t on t.id = ts.town_id
    where t.id is null and ts.town_status <> "NO SUCH TOWN"
)
;

END |
delimiter ;

Over time, the town_status table will build and you will be able to see a history of events. E.g.

select * from town_status where town_id=43624;

Might return something like

43624 Blocked pending approval 2020-06-01
43624 Ready for audit 2020-06-30
43624 Ready for certification 2020-07-01
43624 Certified 2020-07-06

And then you’ll be able to things like analyse how long it typically takes to move from stage to stage.

The same principles apply for statuses that need to be recorded more or less frequently, you would just need to adjust the time unit in the status table and the event timing. Beware unchecked growth where statuses change frequently and your initial data table is very large.

(FYI: My DB is MySQL 5.7 so doesn’t support window functions)

I am within 3 weeks of first meeting Google Data Studio (DS). I am head of data at a small firm that was subsumed into a larger firm last year. Now that the initial merge issues are clearing, we’re getting down to the brass tacks of making our very different technologies work together. They’re very google-oriented – lots of BigQuery & Go and the data bods use R. I’m more MySQL and Python myself. We didn’t have a regular BI tool, hence why I’m getting up to speed with DS atm. DS gives a lot of bang for your buck (it’s free!), but has some interesting quirks.

I’ve had a few other battles with DS already, hopefully I’ll get to expand on those in further articles. Some of these are real head-scratchers; things you would think are such obvious requirements that the lack of them certainly makes me think my dearth of experience is the issue. Such a one seems to be the inability to filter values by time. An initial internet search turned up a couple of hits, but nothing really concrete:

So! How I can filter a set of timestamped results to show only values from the past hour?

This is the data I’m working with: the first column is a timestamp incrementing every minute and second column is an integer.

I’ve inserted a Chart > Scorecard and I want to sum up the values in column 2 if the time in column 1 falls within the past hour.

DS has grand tools if I want to filter by the date:

The image shows a configurable date picker – showing is a fixed start date of Jan 1st 2020 and an end date of now.

But it seems only to let you set the date, the time of day is not adjustable. Minimum “Unit” granularity is “day”; even when my data type is Date & Time or Date Hour Minute.

Okay, so now what? I’ll try to make a different kind of filter – based on the time value

A screenshot of a filter creation from DS

So, it starts out looking hopeful, I can choose the time metric and only include times that are later than a date and time I pick! But on closer inspection, that’s a fixed time, not a relative one, so no dice. Clicking around, nothing else in the filter allows relative comparison, so I’ll have to think of something else.

Hmmm. What about if I set an is_in_past_hour boolean? I’ll make a new calculated field. Oh, and by the way, DS doesn’t do IF statements. It’s case or nuffink.

My first attempt – apparently this is hideous. Have you seen all the limitations of case statements?

After a bit of jiggery-pokery, including

case when datetime_diff(current_datetime(),the_time,HOUR)=0 then true else false end

erroring with Invalid formula – Invalid input expression. – WHEN conditions must compare a dimension or metric with a literal value.

I ended up ditching the case statement altogether and made an hours_before_now variable instead. I will filter results so that only results where this variable is 0 are included.

datetime_diff(current_datetime(),the_time,HOUR)

Filtering the hours_before_now variable to 0

Phew. All done. But hang on, why are the results still not matching what I’m expecting?

Rounding.

It’s rounding half hours up or down.

Le sigh.

So I switch all the references in my filters and variables from hours to minutes and check there are fewer than 60 of them.

Change the filter to allow < 60 minutes

Finally, this works. I’m getting the same value in the DS report as this simple wee query in MySQL:

select sum(my_value) 
from my_table 
where the_time >= date_add(now(), interval -1 hour)
;

Time (out) for a coffee.

Catchloops

Data and knitting

Setting up a daily status record in MySQL

1. Create a stored function to figure out the status

2. Create a status record table

3. CREATE A current status view

4. make a daily event to store the status each day

Google Data Studio: filtering times to the past hour (you’d think it’d be easy)