Posts

Slop & Shop: Another vertical video, snap-scrolling, content consumption app

But for a cause!

A few weeks ago, I was hypnotically scrolling through a vertical video, snap-scrolling, content consumption app when I encountered a bug: every video in my infinite feed was a “sponsored” post trying to get me to buy something. That same week, there was a new app gaining attention: a vertical video, snap-scrolling, content consumption app in which every video was AI-generated. I paused for a moment and thought to myself, “Am I in hell?” I quickly decided to save that question for another day, then thought, “Why not an app that is… both?” So here, I’m embarrassed to announce my latest and dumbest creation yet: Slop & Shop, a vertical video, snap-scrolling, content consumption app in which every video is an AI-generated advertisement that tries to get you to buy something. The app can only do two things: play AI-generated ads in an endless feed, and (plot twist!) encourage you to make our future less sloppy and less shoppy by donating to a few nonprofits committed to making high-quality, human-created knowledge available to the world, for free. Play around with the app for a minute, then please tap one of the “Shop now”/“Get offer”/“Sign up” buttons to donate to worthy causes. Want to donate without slopping and shopping? Check out the following non-profits: Wikimedia Foundation (Wikipedia): For reliable human editing and review and ad-free knowledge. Donate or learn more. Creative Commons (CC): For tools that enable ethical sharing and reuse of creative works. Donate or learn more. The Internet Archive: For maintaining a permanent digital library and historical record. Donate or learn more. ...

Camera 3000: The Camera of the Future

An AI-powered camera app that captures everything in the scene except the important details.

I’m excited to share my new web app: Camera 3000: The Camera of the Future1. It’s an AI-powered camera app that captures everything in the scene except the important details, so the photo you get is eerily similar to reality, but not quite right.2 It just barely “takes a photo.” I was loosely inspired by the “What is a photo?” debate and the rampant overuse of AI. I’m sure there’s something deep to say about both of these, but I have nothing new to add to that discourse. In reality, I just thought it’d be fun to take those ideas to the extreme. How it works: Given a photo, the app first uses the Gemini API to generate a detailed description of the image. Next, the Imagen API uses that description as a prompt to create a new, entirely-AI image of something that only sort of resembles the original scene. How it’s made: In the spirit of overusing AI, I used Gemini/Canvas3 to write most of the app’s code and followed its instructions to get the app live. In the spirit of over-overusing AI, I did almost everything it told me to do, too, including things I knew to be terrible, like publicly exposing my API key4. I’m sure there’s a lesson here somewhere.5 The app could use some polish: The error states are wonky, the UI is barebones, and I’ve given up on allowing a range of aspect ratios for the photo, but alas, it’s just a silly little novelty app. It’s functional, so just have fun with it. Please be kind in your use of this app: I recognize that image generation can be used for malicious purposes, so the app will error out if a user attempts to generate harmful images. All of the AI generated images adhere to the SynthID digital watermarking standard too. Please go try it out! The AI-everywhere future is already here, and it’s partially running on a publicly exposed API key.6 It’s the “camera of the future” in the same way that Dippin' Dots is the “ice cream of the future”: it’s not, and it’s not even good at being the camera / ice cream of the present. For what they are, Camera 3000 and Dippin’ Dots are both incredibly energy inefficient too. The parallels are unparalleled. ↩︎ This is somehow only the second-dumbest thing I’ve ever created. See my froyo map from 2016. Still ranking #1 on Google for “froyo map nyc,” despite the map being broken 🎉. ↩︎ I used Gemini/Canvas. I just learned about Firebase Studio last week, which would’ve made building this app 1000x easier. ↩︎ Don’t worry, I put some severe restrictions on the API key’s use. But still, please send your thoughts and prayers that I don’t wake up to a million-dollar GCP bill. ↩︎ In the spirit of over-over-overusing AI, I tried to use Gemini to write this blog post based on the app’s code, but the generated text was just awful. The number of times I prompted “make this sound less pompous” … ↩︎ Gemini wrote this closing sentence, though. Kinda funny, but still sounds a bit pompous. I’m sure there’s a lesson here too. ↩︎ ...

Defining meaningful metrics for product teams

A checklist for developing better product metrics & KPIs

Defining clear KPIs is one of the most important things for a Product team, and (unfortunately) it’s incredibly easy to do poorly. A good metric not only helps evaluate your product’s performance, but also guides your team in building the right things to progress you toward your goal. The framework described below is one I’ve used for a few years while working with teams that develop software products, and has been very helpful for me, my Data Science teammates, and our Product partners in measuring the success or failure of the products we build. Shared definitions for objectives, KPIs, & product metrics Before diving in, let’s define a few relevant terms and how they relate to each other. An objective is a description of the impact you want to have on the business. It should be qualitative, easy to understand, and somewhat obvious. A key performance indicator (KPI) (aka success metric) is the one, top-level metric that measures if your product is driving the outcome outlined in the Objective. This is the quantifiable top-level goal of your product, and should be a more quantitative statement of your objective. This is the metric you use to measure the performance of your product. Your success metric should be set longer-term and the definition of it shouldn’t change very much, if at all, from quarter to quarter. A product metric measures the success of a specific feature. These are lower level metrics than KPIs: product metrics are something you should monitor to make sure your feature is facilitating the user behaviors that ultimately contribute to your KPI. Product metrics are also helpful in reporting: to ensure users are making use of the feature you’ve built in the way that you intend. A health metric is used to monitor the technical health of a product. These are usually more pertinent to Engineers, but important for Product leaders to monitor as well. Performance metrics include things like page load time, API response time, etc. I won’t focus on them here, but want to call out that they’re distinct from these other types of metrics. You may have different names for some of these concepts, but the ideas behind them should be more or less the same. What’s most important is aligning on a common vocabulary for you and your team to use when talking about these ideas. (a bad metric) The “good metric” checklist Since it’s so easy to come up with the wrong metric, I like to think through a few criteria while brainstorming to ensure we settle on a metric suitable to the product. The criteria below apply to both KPIs and Product Metrics. Specific & sensitive: Metrics should be specific to the product or feature, and need to be explicitly and quantitatively defined. The metric should also be sensitive enough to measure the impact we expect to see. Robust: To complement the sensitivity criteria above, we also need to make sure the metric is measuring only the effect of the product of interest, and that it isn’t reactive to things we expect to change but don’t control. Related to internal validity, we should try to avoid using a metric that can be significantly influenced by anything other than the product/feature we care about. Measurable: This one is kind of obvious, but a metric must be something that we can actually measure. It’s not uncommon to ideate a bunch of “ideal” metrics that would perfectly measure the impact of your product, but end up being impossible or infeasible to really capture. Interpretable: Metrics should be easy to understand and agreed upon by those whose success is measured by the metric. There’s often a tradeoff between simplicity and accuracy, and I typically err on the side of simplicity. A metric that’s hard to understand provides none of the benefits listed in the section below. Aligned: Objectives, KPIs, and Product Metrics are hierarchical: every product should have an objective, a KPI that quantitatively measures progress toward achieving the objective, and then multiple product metrics to evaluate the performance of individual features. A product’s KPI should also be aligned upward with higher level company metrics. Any movement in these metrics should also be reflected in and contribute to those above them in the hierarchy. This certainly isn’t an exhaustive list, but captures some of the most important criteria. If your metric meets all five of these conditions you should be in fairly good shape. If your metric doesn’t meet one or more of these criteria, you’ll likely need to ideate other metrics to use for your product. Benefits of this approach It takes a fair amount of effort to pick the right metric, and this is why it all matters! Clarity of thought: Often what seems obvious to you is not so clear to others. Defining your goals & objectives in terms of hard numbers forces you to make your personal intuition clear to your team and to others at the company. Opportunity for innovation: KPIs and product metrics represent a goal, but don’t mandate the path to get there. This frees up everyone on the team to think about new ways to move the needle rather than focusing on a prescribed solution. Alignment on goals: When you set your metrics, you tell your business partners what return it should expect from its investment in your team. If expectations aren’t aligned, you’re able to pivot before it becomes a problem. Insight for prioritization: You can use your pre-defined KPI to compare the potential impact of a handful of projects you’re considering working on. Proof of success: It’s much easier to communicate your impact on the organization when your goals are quantifiable. Indication of failure: By monitoring progress toward your goals, it’s easy to correct course or sunset a product when your efforts aren’t as effective as intended. Notes on this framework & further reading This is a work in progress! I’ve added to and generalized this framework over the past few years as I’ve worked with different types of teams (based on their domains, operating style, degree of data literacy, etc.), and will keep doing so in the future. There’s a lot I don’t touch on in this article, like leading & lagging indicators, reliability, defining targets for your metrics, balancing metrics, and more. For further reading, I like: Finding the metrics that matter for your product by the Product & Data team at Intercom How to grow product with KPIs and How to prioritize work with KPIs by Ilya Leyrikh A framework to define your product metrics by Zandre Coetzer 10 tips on how to choose the right key performance indicators by Roman Pichler ...

In defense of “nothing interesting”

A tribute to useful, but less interesting research findings

A few years ago as a Data Scientist I was presenting to co-workers an analysis I’d been working on. The presentation went fine and the work was well-received, but I could tell the group was a little underwhelmed. Towards the end of the presentation, one co-worker asked, “Did you find anything that surprised you? Anything we didn’t already know?” I had uncovered some new information, but most of what I’d found was well-aligned with what we already thought to be true. Still, I understood their sentiment. Any Data Scientist or Researcher will tell you that the most common thing we find when analyzing a dataset is… nothing interesting. It happens constantly. Many of our findings corroborate what we and our business partners already thought to be true, even when we’ve asked the right question. This can be frustrating for Data Scientists, Researchers, and our partners, but finding “nothing interesting” is very different from finding “nothing useful,” and I’m a strong believer that finding nothing interesting after asking the right question is still worthy of celebration. The Utility-Interest plane Before diving in, I want to emphasize the difference between a result being interesting and a result being useful. All analytical results (and the questions that spawned them) will fall somewhere in the Utility-Interest plane. A. Useful results that are also interesting are the holy grail. Findings from these analyses drum up tons of excitement with stakeholders and have the potential to create a huge impact. B. Useful results that aren’t too interesting are less exciting, but are equally valuable! These are the only types of uninteresting results that are still defensible (the main topic of this post!). Useful, yet uninteresting results often arise when evaluating a hypothesis that everyone had assumed to be true, or when tackling a question that’d been answered through other methods in the past. C. Useless results that aren’t very interesting are just a poor use of time. These come from asking the wrong question, and a question to which everyone already knew the answer. These won’t gain much traction with stakeholders, and the primary downside is just wasting your own time. D. Useless, but interesting results are dangerous. Very dangerous! Useless, yet interesting results arise when finding an exciting answer to the wrong question. Stakeholders can latch onto these findings and invest their own time into addressing a topic that should be lower priority. By finding “nothing interesting” in the data (i.e., a result in quadrant B) and presenting it to your stakeholders, you’re able to make decisions with more confidence, ask meaningful follow-up questions, and increase stakeholders’ trust in using data in the future. Knowing when your intuition is right is just as important as knowing when it’s wrong Asking a question of the data means you’re unsure about something: maybe a course of action to take, the reason behind something happening, or something else. Exploring a dataset and finding no surprises just means that, in this case, your intuition wasn’t too far off. Even when you and your business partners have some intuition about a problem area, evaluating your hypotheses with data will let you know, without a doubt, if your hypotheses were true. Knowing when you’re right is just as important as knowing when you’re not, and by evaluating your hypotheses you’ve learned to either maintain or change course. Asking meaningful follow-ups Assuming you asked a worthwhile question of the data, finding “nothing interesting” will help inform what questions you should ask in the future. Any useful finding — whether interesting or not — gives you more confidence in the problem area and refines your area of focus, helping you to ask better questions going forward. Reinforcing confidence in data Findings that contradict our intuition can be hard to accept — especially when the findings tell us that not only was our intuition wrong, but that our actions or plans were too. By finding and presenting “nothing interesting,” you help build trust between your stakeholders and the data, making it easier for them to accept information from you in the future, especially when it’s counter to some of their beliefs. What to do now Ask the right questions of your data. Of course, the points above only hold true if you’ve asked the right question in the first place. Poor questions can sometimes lead to interesting answers, but the usefulness of these answers will be limited. My favorite way to refine a research question is to brainstorm with a cross-functional group of stakeholders (plus with this approach, you get stakeholder buy-in at the same time). Celebrate “nothing interesting.” A finding doesn’t have to be interesting in order to be useful. Next time you find “nothing interesting,” remember to celebrate it. Further reading For resources on asking good questions, I really like Asking Great Questions as a Data Scientist by Kristen Kehrer, and How to solve a business problem using data by Laura Ellis. (Please let me know if you have any others!) ...

Moving beyond the Net Promoter Score

A guide to building a more meaningful metric

The Net Promoter Score is a widely-used survey question that companies use to measure customer satisfaction, loyalty, and growth. Proponents of NPS are drawn to it because it’s a single number that appears — on the surface, at least — to be linked to some significant indicators of performance. NPS a bad measure of success, though. It uses a poorly phrased question, a response scale that’s entirely too big, and an absurd method of calculation. There are other metrics you can use that will be more accurate, more interpretable, and much more predictive of satisfaction, loyalty, or growth. Background The standard Net Promoter Score (NPS) question asks, “How likely is it that you would recommend [company X] to a friend or colleague?” Respondents are given a scale ranging from 0–10, with 0 labeled with “Not at all likely,” and 10 labeled with “Extremely likely.” Under the NPS methodology, respondents who submit a 9 or 10 are considered “promoters,” 7 or 8 are considered “passives”, and 0–6 are considered “detractors.” The Net Promoter Score for a group of respondents is defined as the percentage of respondents who are promoters minus the percentage of respondents who are detractors. NPS = (# of promoters - # of detractors) / (total # of respondents) Origin NPS was originally proposed in a December 2003 Harvard Business Review (HBR) article by Fred Reichheld, a director at the Bain & Company management consultancy. Reichheld proposed the score as a “loyalty” metric, which he defines as the “willingness of someone… to make an investment or personal sacrifice in order to strengthen a relationship.” Reichheld tested eight survey questions among 4,000 consumers, and tracked these consumers’ future purchases and referrals. He then measured the link between the survey responses and actual purchase and referral behaviors. The NPS question — “How likely is it that you would recommend [company X] to a friend or colleague?” — was the 1st- or 2nd- most predictive question in 11 of Reichheld’s 14 case studies, showing “the strongest statistical correlation with repeat purchases or referrals.” Reichheld chose a 0-to-10 scale, where 10 meant “extremely likely” to recommend and 0 meant “not at all likely.” He claimed that this scale was “simple and unambiguous,” “divide[s] customers into practical groups deserving different… organizational responses,” was “intuitive to customers when they assign grades,” and was intuitive “to employees and partners responsible for interpreting the results and taking action.” I disagree with all of these. More on that below. Reichheld then grouped the 11-point scale into the three clusters: promoters, passives, and detractors. In his analysis, Reichheld found a strong correlation between companies’ net-promoter figures and their revenue growth rates. I highly recommend reading the original article with a critical eye: It’s full of anecdotes and correlations that Reichheld frames as causal relationships. Every few paragraphs he attempts to argue that NPS is superior over some alternative measurement in terms of assessing loyalty, growth, etc.; but fails to sufficiently justify his claims. Why do companies use NPS? NPS is popular. I mean very popular. Tons of companies ask customers the NPS question, and many use it to measure and assess their performance. Proponents of NPS are drawn to it because it’s a single number that appears — on the surface, at least — to be linked to some significant KPIs. It’s also easy to measure and produces a statistic that changes easily over time. If you’re trying to get your organization to be more data-driven, then NPS is certainly better than nothing. Why you shouldn’t use NPS Now that we’ve gotten that out of the way, it’s time to discuss the many, many shortcomings of NPS. The phrasing of the NPS question, the measurement scale it uses, and method of calculation all go against the basic principles of survey sciences. Question The NPS question asks a respondent to rate the likelihood of a hypothetical future; but strong, reliable survey questions ask respondents about their past behaviors, which tend to be much more predictive than forward-looking hypotheticals. “Do you plan to begin a diet in the next 6 weeks?”, for example, is a very different question from “Did you begin a diet in the last 6 weeks?” The NPS question forces the respondent to predict an ideal, future self, as opposed to reporting on their actualized behaviors. The HBR article also claims that the NPS question measures loyalty and growth. In reality, though, it fails to ask about either, and isn’t necessarily what users of NPS are attempting to measure. In many cases, survey questions should be phrased to directly measure the quantity of interest. An NPS-like question asking about actualized behavior would look more like, “In the last 6 weeks, have you referred [company X] to a friend or colleague?” Reichheld promotes the NPS question as the most accurate one in predicting revenue growth rate. Proponents of NPS often fail to realize, however, that the NPS question was the most accurate from a set of 8 poorly phrased options in Reichheld’s study. In Reichheld’s findings, the NPS question wasn’t even the most accurate predictor in all industries: In database software and computer systems, for example, other questions were stronger predictors of revenue growth rate. Scale Responses collected from a large, 11-point scale are extremely noisy, and meaningful changes in ratings are hard to detect. On the NPS scale, the difference between a “6” and a “7” isn’t clear in the survey analysis, and the lack of labels on the intermediate (i.e., non-extreme) choices also make the distinction very subjective to respondents. The NPS scale is poorly calibrated, and so are the responses. A better scale would use a 3-option Yes/Maybe/No system or a similar scale with 5 options. For any survey question, the response scale and number of options should be crafted to the individual question, and an 11-option scale is likely always too big. Method of calculation The method of calculation is one of the stranger aspects of the NPS. The bucketing methodology that groups respondents into Promoters, Passives, and Detractors ends up hiding some improvements and exaggerating others. Even if respondents were able to make meaningful distinctions between a “5” and a “6”, or between a “4” and a “5”, the bucketing method categorizes all of these into the “Detractor” group, and these changes aren’t reflected in the score. An extreme example is when a company with all “0” ratings improves to having all “6” ratings: this is a huge improvement, but the NPS methodology makes it so the score doesn’t change at all. Some changes are also exaggerated by the method of calculation: the distinction between a “6” (detractor) and a “7” (passive), or between an “8” (passive) and a “9” (promoter) is exaggerated by the bucketing methodology. The method of calculation — subtracting the percentage of detractor respondents from the percentage of promoter respondents — also produces a metric that’s difficult to interpret and hides important information. All three of the following response sets, for example, produce an NPS of +60: Finding an alternative measurement NPS alternatives The most commonly used alternatives to NPS entail a rephrasing of the question and usage of a smaller scale. If you’re truly trying to measure growth or word-of-mouth promotion, I highly recommend Netflix’s retrospective phrasing of an NPS-style question. In its early days, Netflix asked subscribers, “In the last 6 weeks, did you recommend us to a friend or family member?” and gave respondents only a Yes/No scale to respond. Netflix also paired this with another question: they asked new subscribers, “Were you recommended to us by a friend or family member?” Other companies, like Vox Media’s Polygon, use a similar phrasing of the question with binary response options. These are both significant improvements over the standard NPS question and scale. YouTube has a version that deviates less from the standard NPS question and scale, but still makes some significant improvements: They keep the standard NPS question, but instead use a smaller, labeled scale. Some alternatives that are even better In any survey, the quantity you’re trying to measure should dictate both the question you ask and the scale you use. The questions and scales you design to measure growth, loyalty, satisfaction, etc. should each be customized for a given use case. A few examples for different measurements are outlined below. Growth: In the last 3 months, have you recommended [company X] to a friend, colleague, or family member? [Yes/No] Loyalty: In the last 6 weeks, have you considered [canceling your subscription, switching to another provider, etc.]? [Yes/No] Satisfaction: How satisfied are you with [company X]? [1 (Very dissatisfied), 2 (Dissatisfied), 3 (Neither), 4 (Satisfied), 5 (Very satisfied)] What to do now Identify what you’re trying to measure and write an appropriate question. If you want to measure word-of-mouth promotion or estimate future growth, then the “Growth” question above would be a good start. Measuring customer loyalty or customer satisfaction require entirely different questions, so make sure to ask about the thing you’re trying to measure. Pick a reasonable scale for your question. 3- or 5-option satisfaction scales, and a Yes/No binary scale all capture accurate information and are easy for respondents to select from in a response. Use a simple, logical method of calculation. If your response scale has 5 or fewer options, then it’s easy enough to report on the entire response distribution. If you need to have a single number to use in further analysis, then you might like a top-box or top-two-box percentage. You could also use the average of the numeric-encoded values, although you lose some information this way. Whatever you do, pick something simpler and more interpretable than NPS. Additional reading and references Net Promoter Score Considered Harmful by Jared M. Spool On Surveys by Erika Hall Measuring the WeWork Member Experience by Tomer Sharon ...