Christian Llave

As a data scientist with a rich background in digital marketing, I apply data-driven solutions to real-world challenges in media. With experience in independent roles, I've managed end-to-end processes—from raw data to production models—ensuring impactful contributions for my stakeholders.

Projects

Rebuilding evaluation calculations for recommendation systems

A lot of models work. Evaluation metrics tell us which ones work well.

Evaluation metrics describe how well a model predicts on unseen data, and are used in fine-tuning. In the Python package "implicit", the evaluation module returned errors for common metrics, such as Normalised Discounted Cumulative Gain (NDCG). After some research, the accepted solution was to perform the evaluation manually.

Goal:

Create my own version of the evaluation module.

Considerations for the approach:

I'm particularly interested in the NDCG metric, because it accounts for the position and relevance of a recommended item. Starting with NDCG also allows for the easier addition of other metrics (eg. Precision, Recall, and AUC) because they make use of the components of NDCG.

Result: I chose a NDCG metric as a starting point.

For reference, I'm following this paper, which describes each metric.

Loops have their place in evaluating recommendation outputs, as the metrics are often expressed in iterations and positions that determine the rank of an item. Cython is said to be more optimised at loops compared to regular Python, which may be why it was used in the original evaluation module. However, I found setting up C to be an additional overhead at the prototyping and tuning stages.

Result: I chose a vectorised approach using Numpy arrays, range logic, and SciPy matrices as opposed to loops as an alternative method.

The recommended item needs to be associated with at least two properties (relevance, position in the recommendation). Three methods came to mind:

Batched loops on either the sparse matrix or long-form user interaction dataframe
- This is straightforward and made efficient with Cython. However, without Cython, it may run into performance issues.
- This method is outside the scope of this project.
Use the explode method on Pandas on a long-form user interaction dataframe
- Maintains the order and position of recommendations.
- Can work with a small number (K) of recommendations for each user, as using this method increases the size of the data K times.
- Can cause troubles for those working on limited memory.
Compare sparse matrices directly
- The sparse property of the matrix representation allows for memory efficiency.
- Using the position as data values preserves the order of recommendation.

Result: Produced a new dataframe that indicated the recommendations' relevance as indicated by a positive position value.

Array operations within Pandas was not as straightforward as doing math on int or float values. To regain access to element-wise array operations:

Use the Pandas Series as a Numpy array by using the pandas.Series.values method.
Use numpy.stack to convert the 1D array of arrays into a 2D array.
Apply the mathematical operations.
Re-assign to the Pandas Dataframe column using the list function.

Result: We can then calculate for the NDCG components: Delta function, Discounted Gain, Discounted Cumulative Gain

For the annotated code: Read more

Instead of a fixed K-sized list, IDCG is based on the number of relevant items for each user. This means the the list of relevant items for each user has varying sizes. The size is at most K, but could be less if they have fewer relevant items. Representing the number of relevant items as a list intuitively would call for a jagged array (array values with different sizes); however this is not supported in Numpy as of writing. This leads us to padding the rest of the array as 0 to keep a uniform list size. Zero is used as it doesn't interfere with the necessary calculation.

Result: Padded arrays create the range where we can reuse the Discounted Gain function to calculate for IDCG.

Outcome:

This project resulted in modules that calculated the NDCG from recommendation systems that:

is more easily generalisable by avoiding the overhead of a C compiler.
is particularly useful for Python coders who are not familiar with C.
can be more computationally efficient by avoiding Python loops.

Skills applied:

Translating Linear Algebra to code
Using Sparse Matrices:
- Enabled a more memory-efficient execution while using less computational resources.
- Matrix operations (ex. element-wise math) allowed me to compare predictions with the validation set without loops.
Array operations on DataFrames:
- Provided an alternative to nested loops for tabular data.
- Prevented the use of the .apply method.
Using Padded Arrays:
- Acted as a workaround the lack of support for jagged arrays.
- May be limited to use cases where the padding value doesn't interfere with calculations.

User churn model

Predicting inactivity using activity, and the lack thereof

Digital media rely on user activity to determine marketable audiences. Being able to predict churning users allow for retention efforts to be put in place. In this project, I defined churn as a 30-day streak of a logged-out state, or 30-day streak of non-listenership.

My main considerations were:

Data leakage: Temporal splitting into 30, 60, 90 day windows of activity prevented this.
Representing user activity: To level the amount of potential activity for each user, I used a moving window of time spent listening.
Including or excluding churned users in training: Reactivation was uncommon, but the possibility of happening warranted its inclusion. Training a model including churned users improved model performance.
Correct training metric: As it was important to determine actual churned users, I used the Precision-Recall AUC metric for tuning with Optuna.
Feature selection: Absolute listenership is tied to activity, so I included changes in listenership, and directions of change as features to better inform the model.

The resulting model predicted around 70% of users correctly, with less than 10% being missed churners. Users are then assigned churn predictions based on the model, which informs the retention team for activation.

Main Takeaways:

Temporal considerations: splitting the dataset by time windows, engineering features based on time windows, representing inactivity (gaps vs numbers), representing change over time
Data science: decision-making and experimenting on different models, representations, and modelling components.
Data analysis: presentation of the model's behaviour to identify trends in user behaviour relevant to churn, broken down by brand and changes in user activity.
Data engineering: Identifying feature importance in the working model justifies more effort into data quality for relevant fields in the warehouse.

Google Analytics Raw Data: Wrangling

Zero to Hero: From raw data to business-ready assets.

At the nascency of this digital radio product, I focused on establishing big data pipelines to meet business needs for reporting, visualisation, and machine learning applications. Google Analytics as a platform displays aggregated data; however, the business required analyses involving event-level data. This prompted the need for the Raw Data, which scales up in volume relative to the brand's user activity. With the requirements in mind, I ensured all my contributions were compute-optimised for the large-scale nature of the data. This foundational initiative resulted in high quality useable data for querying, integrating, and modelling.

A number of core challenges arose from the nature of the data and the needs of the business:

Using dataclasses and vectrorised operations in Snowflake Python (SnowPark, similar to PySpark):

Determine all the fields that will be extracted.
Apply the extraction function to the nested field to extract the fields as columns.
Partition the data by event types (audio engagement, general engagement, etc) and apply the function by partition.

This parallelised procedures and tasks, which enables smaller partitions (ex. onboarding events) to be processed independently from larger partitions (ex. audio engagement).

The solution needed to be closer to deterministic considering errors would flow downstream, so I went with a nearest neighbour model:

Create an initial lookup table with the sessions, start and end times.
Create pseudo-identifiers for each event.
Map the events to the lookup table, and calculate each event's timestamp difference to the joined session's start and end timestamps.
Take the closest session within an acceptable time difference.
Iterate until no more mappable events are available.

Reducing the null values in sessions allowed for a more complete means of calculating session counts per user and the session duration. Both are used mostly for reporting requirements.

Solution:

Determine mappable and unmappable user IDs.
Map 1:1 pseudo IDs with user IDs using a mapping table.
Map 1:many pseudo IDs with user IDs using window functions on Snowflake SQL. This attributes the most recent user ID to the event with a missing user ID.

This imputed over 90% of null values for audio events, which attributes listening metrics to each user more accurately. This is primarily in preparation for machine learning applications.

Some show_name values were assigned to the wrong brand and a constant 'ERROR' string occupied the user ID field. Solution:

Communicate with the errors with the product team to send correct data for the succeeding records.
Once the correctly associated fields start coming in, create a mapping table using the latest assignments, and apply the mapping.
For constant values, apply the same 1:1 or 1:many mapping strategy.

For reporting, this solution attributes the correct performance metrics to the brands. For feature engineering, this assigns the correct degree and preference of listenership to the users.

Main Takeaways:

Flattening, transforming, and imputing large-scale tables without default join keys.
Partitioning, parallel processing, vectorisation, and caching helped with computational efficiency.
Mapping functions support categorical imputation; however, it is best to communicate with developers to fix data issues on the backend.

Ad Request Optimisation

Increased profit margins by limiting ad requests to profitable observations.

Supply-Side Platforms manage ad space by deciding which instances of web traffic would be good to show ads to; however, not all traffic is profitable, and sending these instances to the auction (real-time bidding) can be costly. The nature of the data and potential target variables were akin to the marketing funnel that had preceding steps that informed the latter. This project aimed to drop unprofitable traffic to save on costs, while retaining as much revenue as possible. As a solution in the digital ad industry, it must deliver the predictions in a fraction of a second.

Given the nature of the data being mostly categorical, the solution made use of categorical features. I explored three ensemble frameworks commonly used for the said data type. Given the funnel-like nature of the data, I created three different architectures:

Simple: directly predict each candidate target variable using identified features. This was simple and effective.
Narrowing: have target variables learn only based on observations present in the preceding step.
Cascading: predict succeeding target variables using outputs from previous steps of the funnel as features. This mimicked the real-time bidding behaviour most closely.

The delivered solution made use of a simple machine learning architecture to predict revenue. The result satisfied the company's data needs while having 20% less opportunity costs than the previous model.

Additionally, this project had also undergone the production process by creating the accompanying assets: threshold seeker, hyperparameter tuning, modules, tests, a docker image, Airflow DAG and deploying as a cloud service.

Main Takeaways:

Categorical handling: Exploratory analyses, transformations, imputation, and modelling.
Data engineering: Performing ETL on data extracted from a cloud source.
Data science: Decision-making and experimenting on different models, infrastructure, and methodologies.
Data analysis: Visualisation and presentation of the solution's value to key stakeholders.
Software engineering: Experiencing the process of putting code into production.

3-Step ETL Reporting Tool

Integrate 5 Brand Data Sources programmatically to populate reporting records.

Management makes use of spend, revenue, and eCommerce metrics to make marketing decisions. The stakeholders had a preference for tables that could be explored by date range and broken down by date. Another consideration for this project is the fact that the tool will be used by non-technical stakeholders.

The way to accomplish the data needs was by integrating brand assets: Facebook, Instagram, Shopify, Google Ads, and Google Analytics into an interactive reporting tool. With non-technical stakeholders in mind, I have set my own limitations on top of meeting the data needs:

I should not require them to code or install Python for them to use the tool.
The data should be formatted in a way they recognise: tabular like Excel.
Ensure fool-proof API access and security.

Google assets had native integrations to the selected reporting tool. For the other channels, I used their respective APIs to extract, transform, and store information on a cloud asset. I then integrated all the data on the reporting tool, created calculated fields that combined metrics from different sources, and created the necessary visualisations.

Main takeaways:

One of the challenges at the time was dealing with APIs, which required a lot of reading up on documentation. One of my main takeaways was to make my code style match the demands of each API's documentation. This allowed me to follow the logic and ensure convenient points to edit my code should there be any changes in the documentation.
Dealing with JSON objects meant I had to transform column values that were non-singular, such as lists and dictionaries. Working on this project allowed me to develop ways to convert non-relational formats to relational in a vectorised fashion. Using vectorised functions allowed for more efficiency as opposed to methods that iterate over observations.
Having the end-user in mind allows for a clear picture of limitations in terms of user exprience and infrastructure.

Exploring mental health incidents in New Zealand Police Data

Explore trends in mental health report misclassifications to the police as indicators of supplementary social services.

In 2021, the New Zealand Police expressed a need to approach call-outs related to mental health issues. To free up the workload of the police force in that regard, it would be good to explore existing trends on mental health-related issues.

Using the NZ Police's Demand and Activity as the main data source, I used R to group and visualise police reports related to mental health based on the activity's classifications, time, and date.

This allowed trends of mental health reports to surface, showing which aspects of social services could be focused on to provide better mental health support.

Main takeaways:

There is value in being able to engineer additional columns from existing ones (feature engineering) even for exploratory analyses like this.
This is a good exercise for squeezing out options to find trends in data.
For exploratory tasks, there is value in starting from line items with a high proportion of observations as a filter when drilling-down on data.

Youtube API Data Extraction

Extract video and channel data from search queries, and store them to a database.

For my first Python project, I wanted to understand popular video topics in a certain niche, and explore Youtube as a data source. One of the largest limitations to the project is the rate limit, which compelled me to consider an ETL process to avoid redundant extraction.

I used Python to extract data, Pandas to perform transformations, and SQLITE to store the results. In its first iteration, the project allowed for visualisations and inspections of overarching trends of videos and the connectedness of channels. In future iterations, I aim to apply machine learning techniques as a form of feature engineering or predictive analytics.

I intend to use this project as a foundational data source for me to explore other data projects and techniques.

Main takeaways:

Refactoring is useful for repeatable functions.
When the cost of API calls is high, store API response results allows for the experimentation of the contents.
There is value in making connections between the available data between API endpoints.
When result relevance is important, it's good to know if the API responses are already arranged by relevance to reduce the amount of data to store.
De-duplicating and batching allow for more efficient API querying by reducing the number of times calls are made.

Engagements

Talked about basic Categorical Variable Handling

For Christchurch Data Science Meetup

Covered the introduction to data types, exploratory analyses, encoding, feature engineering, and machine learning options for categorical variables.

20 March 2024

Experience

Data Scientist

MediaWorks

As the sole Data Scientist, I elevated the digital radio data from its raw and nascent state to a mature and analytics-ready asset. I pioneered the end-to-end development of scalable data pipelines, implementing methodical transformations to deploy machine learning (ML) models and stakeholder reports. With the large volumes and varieties of data, I developed ML models such as user activity and churn modelling using Python and Snowflake to guide business strategy.

June 2024 - Present

Pioneered and owned developing, deploying, and optimising big data pipelines for app and website data via Python and Snowflake. This results in a single source of truth for app and website data at a granular scale.
Increased user data quality by 90% using imputation models with Snowflake SQL.
Predicted the #1 song and artist on The Rock 2000 Countdown for 2024 using machine learning.
Fine-tuned a Retrieval-Augmented Generation chatbot model for financial reporting data.
Developed and automated ETL apps that connect to APIs via Python and Snowflake.
Developed bespoke engagement metrics to model listening behaviour more accurately compared to legacy methods.
Guided business strategy by forecasting user activity and churn using machine learning, and presented insights using automated interactive Tableau dashboards.
Modelled the database structure for scalable and compute-optimised reporting, visualisation, and machine learning.
Advise on digital analytics and marketing operations such as information gathering and integrations.
Spearheaded using GitHub and established guidelines within the Business Insights and Analytics team.

Intern Data Scientist

Ströer Labs NZ

Worked on the ad request optimisation project, which aimed to prevent unprofitable traffic from entering auctions, ultimately improving profit margins through cost savings.

November 2023 - February 2024

Created a model that reduced unprofitable advertising traffic down to 50%, reduced opportunity costs by 20%, and increased profits compared to the past model.
Experimented with various categorical encoding methods, machine learning techniques, and solution architectures in Python.
Performed ETL on advertising traffic data through pySpark as preparation for modelling.
Productionised the project by creating Python modules, test files to verify bespoke functions, a docker image to run the entire program, and an Airflow DAG file for scheduling.
Ran the project on a cloud-based machine learning platform.

Part-Time Google Specialist and Digital Analyst

Orba Shoes

Manage the creation, optimisation, and reporting of Google assets.

March 2023 - November 2023

Developed a multi-channel strategy for the brand using localised market research data on key buyer personas.
Audited and managed Google Assets, and optimised towards ad operations and SEO.
Planned, implemented, analysed, and optimised Google Ad operations through historical data analysis and A/B testing.
Migrated and managed Google Analytics properties to Google Analytics 4 by ensuring past data points and analyses were maintained.
Used Python to periodically extract campaign data from Facebook Ads and the eCommerce platform onto cloud storage.
Created a dashboard integrating 5 data sources.
Optimised copy and targeting for Google Ads.

Part-Time Research Assistant

University of Canterbury - Arts Digital Lab

Programmatically gather, analyse, and visualise sentiment on the Taniwha using Natural Language Processing techniques.

August 2023 - November 2023

Extracted text data from online sources using Python.
Used NLP via Python to extract, transform, and prepare text data for analyses.
Used machine learning to group documents and perform sentiment analysis.
Research, collate, and analyse information, and summarise findings in a variety of formats.

Senior Performance Marketing Manager

David & Golyat Management, Inc.

Provided end-to-end marketing consultancy services ranging from market analyses and brand audits to campaign management and optimisation.

September 2021 - January 2023

Conduct media planning using quantitative market analysis tools and brand audits to cover all stages of the consumer journey.
Manage and optimize Paid Search, Facebook, and Programmatic (Google) campaigns from end-to-end.
Reduced App Install costs by 30% and increased Programmatic clickthrough by 4x.
Implement improvements to planning, implementation, analyses, and optimization processes in the organization.

Reduced lead generation acquisition costs by 44% in 2 months through systematic audience segmentation, A/B Testing, and significance testing in RStudio.
Increased eCommerce Return on Ad Spend by 4.5x on Facebook Ads through consumer segmentation and increased Programmatic clickthrough by 9x by optimizing customer touchpoints.
Use Web Scraping to create sentiment analyses to provide insights to the creative direction.
Use Google Analytics to create comprehensive analyses on consumer profiles, acquisition channels, and website behavior leading to efficient follow-through campaigns.
Use Google Query Language to automate the weekly reporting process, reducing completion time by 60%.

Digital Project Manager

Viventis Search Asia

End-to-end management of recruitment marketing campaigns and change management towards the use of the Hubspot Suite for Marketing.

February 2020 - September 2021

Gained up to 30x return on ad spend (ROAS) for Recruitment Marketing campaigns using Facebook Ads for a Financial Services account with highly specialized roles, and 76x ROAS for an account in the BPO industry.
Handled end-to-end Facebook Ad and Google Ad B2B campaigns ⁠— covering setup, optimization, and reporting within a restricted set of KPIs and gained at least 70% sales accepted leads among signups.
Created a business case, implementation plan, and digital marketing strategy for B2B operations, and got executive buy-in.
Created a company-wide lead nurture strategy, lead qualification strategy, content plan, and reports dashboard aimed at accelerating pipeline conversions.
Increased the ease of sales enablement through the creation of training plans, automation, and other digital assets.
Produced creative assets for both Facebook Ads and Programmatic Ad B2B campaigns.

Increased ROAS using Google Tag Manager to set up custom events for remarketing and audience-building.
Reduced 47% of manual B2B marketing operations in 3 months by spearheading the technical setup and integration of the CRM and marketing automation tool.
Created an interactive dashboard using Google Data Studio to combine data from Facebook Ads, Google Analytics, and Google Sheets. Enabled auto-updating using Facebook Graph API and Google Query Language.
Optimized data management using marketing automation workflows to determine contact properties for segmentation, qualification, and sales enablement.
Optimized marketing operations management through KPI creation and reporting by establishing clear data points with the use of marketing automation.
Visualized Google Search Console data through Google Data Studio to identify SEO gaps and content opportunities.

Senior Digital Marketing Specialist

SQREEM Technologies

Managed global campaigns from end-to-end and modelled the data structure for the proprietary Ad Serving tool.

March 2019 - January 2020

Handled Facebook Ad and Programmatic Ad campaigns ⁠— covering setup, data-led optimization, and reporting within a restricted set of KPIs.
Doubled a multinational pharmaceutical company's sales, increased website traffic, and increased event attendance. This campaign yielded contracts with the client's 80+ global branches.
Reduced a major banking client’s Cost per Acquisition by 90% compared to historical performance.
Produced creative assets for both Facebook Ads and Programmatic Ad campaigns.

Reduced up to 60% of ad operations through automated ad setup by spearheading and modelling the development of the proprietary Ad Serving tool, which mapped data requirements and platform restrictions.
Analyzed gaps in ad operations to be automated and created a business case to demonstrate the projected increase in efficiency.
Worked with developers to create the minimum viable product, its expected functions and workflows.

Interim Creatives Director

SQREEM Technologies

Liased high level creative requirements to the team and contributed domain expertise for the proprietary Ad Creation tool.

February 2019 - December 2019

Liaised, evaluated, and managed creative requirements from the clients to the Digital Creatives Team to meet strict KPIs in the effort to optimize creative performance for the Digital Marketing Team.
Crafted and implemented a training plan for the team.
Troubleshot issues regarding interactive ad production.
Spearheaded developments of the proprietary Digital Creatives tool in the effort to optimize the workflow for both the Digital Marketing Team and Digital Creatives Team.

Digital Marketing Specialist

SQREEM Technologies

Managed global campaigns from end-to-end and designed creative materials.

July 2018 - March 2019

Handled Facebook Ad and Programmatic Ad campaigns ⁠— covering setup, data-led optimization, and reporting within a restricted set of KPIs.
Produced creative assets for both Facebook Ads and Programmatic Ad campaigns.
Produced creative studies, workflows, and developments to the company's proprietary creative tool.
Acted as the main guide for Digital Marketing team on the effective use of the proprietary creative tool.

Education

University of Canterbury

Master of Applied Data Science

GPA: 8.58/9
Graduating with Distinction

February 2023 - February 2024

Ateneo de Manila University

Bachelor of Science in Electronics and Communications Engineering

Second Honors (2nd Semester AY2015-2016)
Minor in Japanese Studies

2012 - 2017

Skills

Programming Languages & Tools

Other Programming Tools

Snowflake + SnowPark
SQL / SQLITE
Apache Airflow
Google Cloud Platform APIs
Meta APIs

Commercial Competencies

Project Management
Data Management and Database Design
Digital Advertising and Marketing Consulting
Market Research
Campaign Management
Digital Analyses
Brand Audits

Interests

Besides data science work, I enjoy activities like:

Hiking and going to the gym: I like staying active, usually through a group fitness activity.
Playing story-driven video games: I like interesting narratives in interactive media as it allows for a creative way to learn more about the world.
Cooking: I cook for both practicality (through meal prep) and enjoyment (through the occasional indulgent treat).

Awards & Certifications

University of Canterbury (Dec 2023): Certificate of Excellence (Master of Applied Data Science)
DataCamp (Apr 2021):
- Introduction to Python
- Intermediate Python
Google Analytics Academy:
- Google Tag Manager Fundamentals (Mar 2021)
- Google Analytics for Beginners (Mar 2021)
- Advanced Google Analytics (Apr 2021)
The Trade Desk:
- Targeting and Data Management (Sep 2019)
- Programmatic Principles (Sep 2019)
- Optimization & Strategy (Sep 2019)
- Omnichannel & Inventory Types (Sep 2019)
- Connected TV Curriculum (Sep 2019)