For experienced data scientists with a technical background, feel free to skim before starting at Whizbang 3.0.
Arnold is a data-scientist at Ploopli. He develops a new program called Whizbang 1.0 that queries and creates a variety of tables in a database. The Whizbang data pipeline has two steps:
whizbang.r
# Create a table with a random sample of users
query <- '
CREATE TABLE arnold.wb_proto_user_sample AS (
SELECT *
FROM ploopli.user_hits
WHERE hash(userid)%42 = 1
)
'
ploopli_run_database_query(query)
# Create 3 features for each user: n_hits, min_datetime, max_datetime
query <- '
CREATE TABLE arnold.wb_proto_user_sample_features AS (
SELECT userid,
count(userid) as n_hits,
min(date_time) as min_datetime,
max(date_time) as max_datetime
FROM arnold.wb_proto_user_sample
)
'
ploopli_run_database_query(query)
After testing his prototype, Arnold is fired up. Whizbang is going to take the world by storm. Arnold shares Whizbang with his boss, who takes a sip of coffee, looks at the code, and sprays coffee all over his VT-100 out of sheer amazement. His boss recommends Arnold work with Barb immediately to develop Whizbang 2.0.
Barb takes a look at the code and comes back almost immediately. She can’t run the code because it has Arnold’s dataset hard-coded in all of the queries.
It’s clear that if Arnold and Barb are to collaborate on the development of Whizbang 2.0, they will need to separate parameters like tablenames and variables from the logic of their queries.
Arnold and Barb put their heads together and think
“Wouldn’t it be nice if we could each define our own configuration and then dynamically generate the queries when we run Whizbang?”.
Arnold discovers the mustache templating language and realizes that he can use the whisker R package to template the queries in Whizbang. Now Whizbang contains templated queries like:
arnold_config.r
whizbang.r
# Create a table with a random sample of users
query <- '
CREATE TABLE {{{dataset_id}}}.wb_proto_user_sample AS (
SELECT *
FROM ploopli.user_hits
WHERE hash(userid)%42 = 1
)
'
ploopli_run_database_query(whisker.render(query,config))
# Create 3 features for each user, n_hits, min_datetime, max_datetime
query <- '
CREATE TABLE {{{dataset_id}}}.wb_proto_user_sample_features AS (
SELECT userid,
count(userid) as n_hits,
min(date_time) as min_datetime,
max(date_time) as max_datetime
FROM {{{dataset_id}}}.wb_proto_user_sample
)
'
ploopli_run_database_query(whisker.render(query,config))
When Arnold and Barb now run Whizbang 2.0, their tables happily live in their own datasets.
Code that separates configuration from logic is easy to share and means the code only changes if its logic changes.
Whizbang 2.0 is successfully deployed and starts making waves around the globe. Orders for VT-100s rise as bosses everywhere spray coffee in disbelief. Euphoric with the success of Whizbang 2.0, Arnold’s boss calls a meeting:
Multiple teams at Ploopli need Whizbang to create daily random samples of users. Further, users are now assigned a region code. The number of region codes is currently 3, but will grow as Ploopli grows. Whizbang needs to create separate daily sample tables for each region.
Arnold and Barb put their heads together and think
“Wouldn’t it be nice if we could get region codes dynamically via a database query and then dynamically run Whizbang for each region on a particular date?”.
Barb finds the condusco R
package. With condusco, she creates a pipeline called
create_daily_region_user_sample
and repeatedly runs it for
each of the regions she gets from a query.
whizbang/create_daily_region_user_sample.r
create_daily_region_user_sample <- function(config){
# Create a table with a random sample of users
query <- '
CREATE TABLE {{{dataset_id}}}.wb_proto_user_sample_{{{region}}}_{{{date_suffix}}} AS (
SELECT *
FROM ploopli.user_hits
WHERE hash(userid)%42 = 1
AND DATE_FORMAT(date_time, "%Y%m%d") = "{{{date_suffix}}}"
)
'
ploopli_run_database_query(whisker.render(query,config))
}
barb_run.r
run_pipeline_dbi(
create_daily_region_user_sample,
#invocation query
'
SELECT "barb" as dataset_id,
DATE_FORMAT(CURRENT_DATE(), "%Y%m%d") as date_suffix,
region
FROM (
SELECT region
FROM ploopli.user_hits
GROUP BY 1
)
',
dbi_connection_to_ploopli
)
The run_pipeline_dbi
function runs the query provided to
it, called the invocation query, and receives a result.
For each row in the result, it calls the
create_daily_region_user_sample
pipeline and passes that
row to the pipeline as a dataframe.
When Arnold and Barb are ready to put the pipeline into production, they don’t have to change the pipeline definition. They just change their invocation query, changing the dataset_id to ‘prod’. When new regions are added, the pipeline automatically creates new tables for those regions.
With condusco, Arnold and Barb can iteratively execute any pipeline function, not just queries.
“condusco lets you iteratively run the same process for each of the results of a query”
Having discovered condusco, Barb and Arnold decide they want to expand their feature generation pipeline. Ploopli users visit dozens of categories of Ploopli products. Arnold has an idea to dynamically generate certain features:
Arnold can accomplish this with condusco, leveraging whisker and condusco’s feature that will convert any field that is valid JSON into an object before passing it to the pipeline for execution. Since whisker can template repeated fields in JSON, Arnold can dynamically generate a set of features for each user. Let’s take a look:
whizbang/create_top_n_features.r
create_top_n_features <- function(config){
query <- '
CREATE TABLE {{{dataset_id}}}.user_features_top_{{{n}}}_{{{fieldname}}} AS (
SELECT
{{{#top_features}}}
SUM(CASE WHEN {{{fieldname}}}="{{{val}}}" THEN 1 ELSE 0 END )as n_hits_{{val}},
{{{/top_features}}}
userid
FROM {{{source_table}}}
GROUP BY userid
);
'
ploopli_run_database_query(whisker.render(query,config))
}
arnold_run.r
run_pipeline_gbq_dbi(
create_top_n_features,'
SELECT
"arnold" as dataset_id,
"product_category" as fieldname,
10 as n,
"[" || GROUP_CONCAT("{ ""val"": """ || product_category || """ }") || "]" AS top_features,
"arnold.user_sample" as source_table
FROM (
SELECT
product_category,
count(product_category) as n_hits
FROM ploopli.user_hits
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
)
GROUP BY 1,2,3
',
dbi_connection_to_ploopli
)
In his invocation query, Arnold creates a JSON string called
top_features
representing an array of 10 objects with one
attribute called val
. Each of the val
attributes contains one product_category. When he calls
create_top_n_features
, condusco converts the JSON string
into an object before passing it to create_top_n_features
,
where the whisker.render()
function can now expand the
object into the templated query.
“condusco automatically converts JSON strings into objects for use by the pipeline”
Bonus
Since Arnold created the create_top_n_features
pipeline
with a fieldname option, it can now be used to create the top n features
for any field in the source dataset, not only product_category.
Multiple users can now call the create_top_n_features
pipeline for similar feature generation tasks without having to reinvent
the wheel.
Arnold and Barb started out with a prototype designed for one user and gradually added functionality to:
Many coffee-splattered VT-100s later, Arnold and Barb are able to consider releasing a library of generalized pipelines that can be used across Ploopli.