--- title: "Introduction" author: "Roland Stevenson" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## A Story For experienced data scientists with a technical background, feel free to skim before starting at Whizbang 3.0. ### Whizbang 1.0: A prototype Arnold is a data-scientist at Ploopli. He develops a new program called Whizbang 1.0 that queries and creates a variety of tables in a database. The Whizbang data pipeline has two steps: **whizbang.r** ```r # Create a table with a random sample of users query <- ' CREATE TABLE arnold.wb_proto_user_sample AS ( SELECT * FROM ploopli.user_hits WHERE hash(userid)%42 = 1 ) ' ploopli_run_database_query(query) # Create 3 features for each user: n_hits, min_datetime, max_datetime query <- ' CREATE TABLE arnold.wb_proto_user_sample_features AS ( SELECT userid, count(userid) as n_hits, min(date_time) as min_datetime, max(date_time) as max_datetime FROM arnold.wb_proto_user_sample ) ' ploopli_run_database_query(query) ``` After testing his prototype, Arnold is fired up. Whizbang is going to take the world by storm. Arnold shares Whizbang with his boss, who takes a sip of coffee, looks at the code, and sprays coffee all over his VT-100 out of sheer amazement. His boss recommends Arnold work with Barb immediately to develop Whizbang 2.0. ### Whizbang 2.0: Templated queries Barb takes a look at the code and comes back almost immediately. She can't run the code because it has Arnold's dataset hard-coded in all of the queries. It's clear that if Arnold and Barb are to collaborate on the development of Whizbang 2.0, they will need to separate parameters like tablenames and variables from the logic of their queries. Arnold and Barb put their heads together and think > "Wouldn't it be nice if we could each define our own configuration and then dynamically generate the queries when we run Whizbang?". Arnold discovers the mustache templating language and realizes that he can use the [whisker R package](https://github.com/edwindj/whisker) to template the queries in Whizbang. Now Whizbang contains templated queries like: **arnold_config.r** ```r config <- data.frame( dataset_id = 'arnold' ) ``` **whizbang.r** ```r # Create a table with a random sample of users query <- ' CREATE TABLE {{{dataset_id}}}.wb_proto_user_sample AS ( SELECT * FROM ploopli.user_hits WHERE hash(userid)%42 = 1 ) ' ploopli_run_database_query(whisker.render(query,config)) # Create 3 features for each user, n_hits, min_datetime, max_datetime query <- ' CREATE TABLE {{{dataset_id}}}.wb_proto_user_sample_features AS ( SELECT userid, count(userid) as n_hits, min(date_time) as min_datetime, max(date_time) as max_datetime FROM {{{dataset_id}}}.wb_proto_user_sample ) ' ploopli_run_database_query(whisker.render(query,config)) ``` When Arnold and Barb now run Whizbang 2.0, their tables happily live in their own datasets. > Code that separates configuration from logic is easy to share and means the code only changes if its logic changes. ### Whizbang 3.0: Query-driven pipelines Whizbang 2.0 is successfully deployed and starts making waves around the globe. Orders for VT-100s rise as bosses everywhere spray coffee in disbelief. Euphoric with the success of Whizbang 2.0, Arnold's boss calls a meeting: Multiple teams at Ploopli need Whizbang to create daily random samples of users. Further, users are now assigned a region code. The number of region codes is currently 3, but will grow as Ploopli grows. Whizbang needs to create separate daily sample tables for each region. Arnold and Barb put their heads together and think > "Wouldn't it be nice if we could get region codes dynamically via a database query and then dynamically run Whizbang for each region on a particular date?". Barb finds the [condusco R package](https://github.com/ras44/condusco). With condusco, she creates a pipeline called `create_daily_region_user_sample` and repeatedly runs it for each of the regions she gets from a query. **whizbang/create_daily_region_user_sample.r** ```r create_daily_region_user_sample <- function(config){ # Create a table with a random sample of users query <- ' CREATE TABLE {{{dataset_id}}}.wb_proto_user_sample_{{{region}}}_{{{date_suffix}}} AS ( SELECT * FROM ploopli.user_hits WHERE hash(userid)%42 = 1 AND DATE_FORMAT(date_time, "%Y%m%d") = "{{{date_suffix}}}" ) ' ploopli_run_database_query(whisker.render(query,config)) } ``` **barb_run.r** ```r run_pipeline_dbi( create_daily_region_user_sample, #invocation query ' SELECT "barb" as dataset_id, DATE_FORMAT(CURRENT_DATE(), "%Y%m%d") as date_suffix, region FROM ( SELECT region FROM ploopli.user_hits GROUP BY 1 ) ', dbi_connection_to_ploopli ) ``` The `run_pipeline_dbi` function runs the query provided to it, called the **invocation query**, and receives a result. For each row in the result, it calls the `create_daily_region_user_sample` pipeline and passes that row to the pipeline as a dataframe. When Arnold and Barb are ready to put the pipeline into production, they don't have to change the pipeline definition. They just change their invocation query, changing the dataset_id to 'prod'. When new regions are added, the pipeline automatically creates new tables for those regions. With condusco, Arnold and Barb can iteratively execute *any pipeline function*, not just queries. > "condusco lets you iteratively run the same process for each of the results of a query" ### Whizbang 3.1: Query-driven query generation Having discovered condusco, Barb and Arnold decide they want to expand their feature generation pipeline. Ploopli users visit dozens of categories of Ploopli products. Arnold has an idea to dynamically generate certain features: - Using the top 10 visited categories of products for all users, generate the number of hits for each category of products as a feature for each user Arnold can accomplish this with condusco, leveraging whisker and condusco's feature that will convert any field that is valid JSON into an object before passing it to the pipeline for execution. Since whisker can template repeated fields in JSON, Arnold can dynamically generate a set of features for each user. Let's take a look: **whizbang/create_top_n_features.r** ```r create_top_n_features <- function(config){ query <- ' CREATE TABLE {{{dataset_id}}}.user_features_top_{{{n}}}_{{{fieldname}}} AS ( SELECT {{{#top_features}}} SUM(CASE WHEN {{{fieldname}}}="{{{val}}}" THEN 1 ELSE 0 END )as n_hits_{{val}}, {{{/top_features}}} userid FROM {{{source_table}}} GROUP BY userid ); ' ploopli_run_database_query(whisker.render(query,config)) } ``` **arnold_run.r** ```r run_pipeline_gbq_dbi( create_top_n_features,' SELECT "arnold" as dataset_id, "product_category" as fieldname, 10 as n, "[" || GROUP_CONCAT("{ ""val"": """ || product_category || """ }") || "]" AS top_features, "arnold.user_sample" as source_table FROM ( SELECT product_category, count(product_category) as n_hits FROM ploopli.user_hits GROUP BY 1 ORDER BY 2 DESC LIMIT 10 ) GROUP BY 1,2,3 ', dbi_connection_to_ploopli ) ``` In his invocation query, Arnold creates a JSON string called `top_features` representing an array of 10 objects with one attribute called `val`. Each of the `val` attributes contains one product_category. When he calls `create_top_n_features`, condusco converts the JSON string into an object before passing it to `create_top_n_features`, where the `whisker.render()` function can now expand the object into the templated query. > "condusco automatically converts JSON strings into objects for use by the pipeline" **Bonus** Since Arnold created the `create_top_n_features` pipeline with a fieldname option, it can now be used to create the top n features for *any field* in the source dataset, not only product_category. Multiple users can now call the `create_top_n_features` pipeline for similar feature generation tasks without having to reinvent the wheel. ### Epilogue Arnold and Barb started out with a prototype designed for one user and gradually added functionality to: - separate logic from configuration - call a pipeline iterately for each record in the results of an invocation query - develop pipelines that dynamically generate queries using JSON objects > Many coffee-splattered VT-100s later, Arnold and Barb are able to consider releasing a library of generalized pipelines that can be used across Ploopli.