Skip to content

Instantly share code, notes, and snippets.

View veekaybee's full-sized avatar
💫
in the latent space

Vicki Boykis veekaybee

💫
in the latent space
View GitHub Profile
@veekaybee
veekaybee / normcore-llm.md
Last active April 30, 2024 11:43
Normcore LLM Reads

Anti-hype LLM reading list

Goals: Add links that are reasonable and good explanations of how stuff works. No hype and no vendor content if possible. Practical first-hand accounts of models in prod eagerly sought.

Foundational Concepts

Screenshot 2023-12-18 at 10 40 27 PM

Pre-Transformer Models

how to properly select from DuckDB

SELECT review_text,title,description,goodreads.average_rating, goodreads_authors.name 
FROM goodreads 
JOIN goodreads_reviews 
ON goodreads.book_id = goodreads_reviews.book_id 
JOIN goodreads_authors  
ON goodreads_authors.author_id = (select REGEXP_EXTRACT(authors, '[0-9]+')[1] as author_id FROM goodreads) LIMIT 10;

See synthesized write-up here

  • Do a quick performance check in 60 seconds
  • Use a number of different tools available in unix
  • Use flamegraphs of the callstack if you have access to them
  • Best performance winds are elimiating unnecessary wrok, for example a thread stack in a loop, eliminating bad config
  • Mantras: Don't do it (elimiate); do it again (caching); do it less (polling), do it when they're not looking, do it concurrently, do it more cheaply

Information retrieval is the practice of asking questions about large documents.

  • It became especially popular when doing discovery for lawsuits
  • or AWS in guiding you to the relevant products
  • One of the first recommenders was GroupLens for newsnet

Collaborative Filtering: Involves running Ratings and Correlations through a CF engine.

  • The goal is to find a neighborhood of users
  • Recommendation Interfaces: Suggestion, top n

Isolation forests versus decision trees

Isolation forest paper Screen Shot 2023-02-01 at 9 47 19 PM

Screen Shot 2023-02-01 at 9 47 58 PM

Screen Shot 2023-02-01 at 9 49 41 PM

  • Isolated points should be lower and closer to the root of the tree

This book is all about patterns for doing ML. It's broken up into several key parts, building and serving. Both of these are intertwined so it makes sense to read through the whole thing, there are very many good pieces of advice from seasoned professionals. The parts you can safely ignore relate to anything where they specifically use GCP. The other issue with the book it it's very heavily focused on deep learning cases. Not all modeling problems require these. Regardless, let's dive in. I've included the stuff that was relevant to me in the notes.

Most Interesting Bullets:

  • Machine learning models are not deterministic, so there are a number of ways we deal with them when building software, including setting random seeds in models during training and allowing for stateless functions, freezing layers, checkpointing, and generally making sure that flows are as reproducible as possib

Screen Shot 2023-02-01 at 12 05 27 PM

Algorithms find the best ways to do things, but they don't explain "how" they came to those conclusions.

Screen Shot 2023-02-01 at 12 07 02 PM

This is a common way to formulate ML problems, using target functions that we don't know but we want to approximate and learn.

@veekaybee
veekaybee / largestreams.md
Last active August 9, 2023 01:34
Counting cumulative elements in large streams

Counting cumulative elements in large streams

An interview problem that I've gotten fairly often is, "Given a stream of elements, how do you get the median, or average, or sum of the elements in the stream?"

I've thought about this problem a lot and my naive implementation was to put the elements in a hashmap (dictionary) and then pass over the hashmap with whatever other function you need.

For example,

import typing
@veekaybee
veekaybee / README.md
Last active January 7, 2024 18:58
whisper.ipynb

Using Whisper to transcribe audio

This episode of Recsperts was transcribed with Whisper from OpenAI, an open-source neural net trained on almost 700 hours of audio. The model includes an encoder-decoder architecture by tokenizing audio into 30-second chunks, normalizing audio samples to the log-Mel scale, and passing the data into an encoder. A decoder is trained to predict the captioned text matching the encoder, and the model includes transcription, as well as timestamp-aligned transcription, and multilingual translation.

Screen Shot 2023-01-29 at 11 09 57 PM

The transcription process outputs a single string file, so it's up to the end-user to parse out individual speakers, or run the model [through a sec