Skip to content

Instantly share code, notes, and snippets.

@boutros
Last active February 7, 2023 10:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save boutros/d52ddaba27e7197cbc72d1f957756f4a to your computer and use it in GitHub Desktop.
Save boutros/d52ddaba27e7197cbc72d1f957756f4a to your computer and use it in GitHub Desktop.
Samtaler mellom kompjutere
SIRKULATOR overordnet plan
1 metatada import/export roughly working, inc oai provder
2 job scheduler/runner, with som exampels: harvesting snl and wikidata descriptions for persons, reindexing, oai harvesting
3 items and circulation, incl borrowers/users/staff/bot agents
4 public facing website for catalog, search and explore, but no ordering yet
5 ncip/sip2 support
6 stresstest/integrationtest
* rename interne og eksterne beskrivelser - beskrivelser? tekster om? (+antall)
* slå sammen wikidata, wikipeda, snl, isbnforlag til enhetlig Publisher struct for import: ta en bokstav i ny og ne! ca 200-forlag?
* hindre at jobber låser tabeller (ex prøv harvest_nb_links og update_snl_descriptions samtidig med import av metadata)
* Unngå både has_contributor med Forlag/utgiver rolle og published_by relasjon ex: 978-82-691039-2-2
* Unngå dobbel main_entry+contributor, ex: 82-05-17893-3
* "Ok, saved" får ikke riktig språk.
* interne og eksterne beskrivelser (count)
* publisher utgivelser (count) + pagination
* Resource Events, bruke på utgiver (grunlagt, slått sammen, kjøpt opp, nedlagt etc)
* person: convert to character (ex apollon, in 978-82-419-1999-2)
* søk og koble komponent - forsøk med webcomponent
* hx-extension: hx-err-target + hx-swap-on-err (400/500 statuses)
* jobruns: filter select: on [name] [status]
* jobruns: pagination?
* hx-extension: run arbritary js function after hx-swap
* oai harvester: arhived_at not set, deleted upsert working?
* index tags (dewey terms, publication audience/genre etc)
* bluge analyzer tokenfilter strip '--' from dewey labels
* janitor job: if relation published_by.label not found in publisher.nameVariants - add it.
* save corporation
* save publication
* oversatt tittel: https://bibsys.alma.exlibrisgroup.com/view/sru/47BIBSYS_NETWORK?version=1.2&operation=searchRetrieve&recordSchema=marcxml&query=alma.isbn=9788203264450
* 653 emner
* vocab/gender/gender.go -> func Options(lang) [2]string
* person og corporation page samme, bortesett fra PersonFrom vs CorporationForm, og ResourceType da
=> agent_page?
Next:
* slette identifikator, legg til identifikator
undo?
CREATE TABLE undo (
id INTEGER PRIMARY KEY,
q TEXT NOT NULL, -- ex: "INSERT IGNORE INTO link (from_id, type, id) VALUES ('a','b','c')"
at INTEGER NOT NULL
)
* Søk og koble "utgitt av"
Not sortable if length of table is 1
Connect resource dialog:
https://blog.benoitblanchon.fr/django-htmx-modal-form/
CREATE TABLE resource_event (
resource_id TEXT,
at INTEGER NOT NULL, -- timestamp, year, date,
type TEXT NOT NULL, -- what happened: prize won, birthdate etc
data JSON NOT NULL DEFAULT '{}',
PRIMARY KEY (resource_id, at)
)
type ResourceEvent struct {
ResourceID string
At time.Time
Type string // merged_with, replaced_by, prize_nominee, prize_win
Data map[string]interface{}
}
type Resource struct {
Timeline []ResourceEvent
}
relations: merged_with (another publisher)
NEXT
* hindre dobbel forfatter-relasjon til utgivelse. droppe main-entry?
ex: https://bibsys.alma.exlibrisgroup.com/view/sru/47BIBSYS_NETWORK?version=1.2&operation=searchRetrieve&recordSchema=marcxml&query=alma.isbn=9788202722319
* search pagination
* store indexed_at timstamp
* parse identifiers (ean/isbn/issn)
* etl: personer uten ID, lag review? ex ISBN 8205267995
* import: bug med "already in catalog" ved flere isbn nr ISBN 9788380043220 ISBN 9788380043466
* parse 245c to resource/relation
want = []sirkulator.Relation{
{
Type: "contributes_to",
Data: map[string]interface{}{role:"aut", "name": "Torbjørn Egner"},
},
{
Type: "contributes_to",
Data: map[string]interface{}{role:"trl", "name": "Lars Fiske", "note": "Nynorsk"},
},
}
et al. = m.fl.
// DownloadImage will try to download image from the given urls,
// stopping as soon as one image is sucessfully stored. It returns
// the url along with the image.
// * urls is assumed to be sorted according to priority.
func DownloadImage(urls []string) ([]byte, string, error) {
var id string
for _, url := range urls {
b := ioutil.ReadAll(r.Body)
if http.DetectContentType(b) != "image/jpeg" {
continue
}
return b, url, nil
}
return id, nil
}
package http
func Download(url) ([]byte, error)
func DownloadTo(url, w io.Writer) error
package etl
type WebTarget struct{
Name string
URL string
Ingestion func(body io.ReadCloser) (Ingestion, error)
}
type SPARQLTarget struct {
Name string
URL string
Query string
Ingestion func(graph rdf.Graph) (Ingestion, error)
}
{
Name: "britishlibrary",
URL: "https://bnb.data.bl.uk/sparql",
Query: `
PREFIX bibo: <http://purl.org/ontology/bibo/>
DESCRIBE ?book WHERE { ?book bibo:isbn13 "978-0-593-13677-5" . }
`,
Ingstion: func
}
var httpLookupsByID = map[string][]Target{
"isbn": []Target{
{
Name: "bibsys/sru",
URL: "https://bibsys.alma.exlibrisgroup.com/view/sru/47BIBSYS_NETWORK?version=1.2&operation=searchRetrieve&recordSchema=marcxml&query=alma.isbn=8273504166"
},
{
Name: "open_library",
URL:"https://openlibrary.org/isbn/%s.json"
},
{
Name: "worldcat",
URL: "https://www.worldcat.org/search?q=isbn%3A%s"
}
{
Name: "gcd"
URL: "https://www.comics.org/isbn/%s/
},
}
"issn":
"viaf":
}
LATER
- update existing by isbn? go fetch data and see whats new/different
- enrich resource jobs: getting description from wikidata, snl
}
green-background: #ccffd8
green-background-em: #abf2bc
red-background: #ffebe9
red-background-em: #fe8282
https://bibsys.alma.exlibrisgroup.com/view/oai/47BIBSYS_NETWORK/request?verb=GetRecord&identifier=oai:urm_publish:999921380896302201&metadataPrefix=marc21
https://www.niso.org/schemas/ncip/v2_02/ncip_v2_02.xsd
SELECT 'R' || lower(hex(randomblob(4))) || strftime('%s','now');
https://git.sr.ht/~mariusor/wrapper/tree/master/item/examples/main.go
https://github.com/thanos-io/thanos/blob/main/docs/contributing/coding-style-guide.md
Beviste valg og prioriteringer:
ekspertsystemet (intra) er
* ikke mobiltilpasset, krever vanlig stor skjerm
* støtter kun nye nettlesere (testes i chrome og firefox)
frontend er:
* mobiltilpasset
* støtter "alle" nettlesere
GO HYGIENEFAKTORER
https://dave.cheney.net/practical-go/presentations/gophercon-israel.html
EMAIL
=====
https://explained-from-first-principles.com/email/
DEB PACKAGING / SYSTEMD
======================
https://blog.knoldus.com/create-a-debian-package-using-dpkg-deb-tool/
https://mgdm.net/weblog/systemd/
https://old.reddit.com/r/golang/comments/rcebag/zero_downtime_restarts_and_deploys_using_systemd/
Backup
======
https://news.ycombinator.com/item?id=29209455
https://archive.md/1jHmP#selection-381.0-385.28
METRIKK
========
Måle de riktige og viktige tingene?
* Surrogate variable that is measured: the thing we measure because it is measurable
* Variable of true or greater interest: the thing we actually want to know about
* Measurement technique of surrogate variable: whether we have the ability to get the actual direct value, or whether it is rather inferred from other observations
* Artefactual influences: what are the things that can mess up the data in measuring it
* Certainty of "Normal Range": how sure are we that the value we read is representative of what we care about?
The actual value is not in the metric nor the alert, but in the reaction that follows. They're a great trigger point for more meaningful things to happen, and maintaining that meaningfulness should be the priority.
https://ferd.ca/plato-s-dashboards.html
DATAKILDER
=============
https://data.norge.no/datasets/cdbe6acc-573f-48bc-9808-46bf538fcf30
https://bibliotekutvikling.no/kunnskapsorganisering/
https://www.oclc.org/content/dam/research/publications/2020/oclcresearch-transitioning-next-generation-metadata-a4.pdf
https://ns.editeur.org/thema/nb
https://www.editeur.org/files/Thema/1.4/Thema_v1.4_nb/Thema_v1.4.2_nb.html
Musikk:
Nordisk litteratur
http://runeberg.org/authors/
http://runeberg.org/search.pl?born=1949
Tegneserier:
http://www.minetegneserier.no/pls/htmldb/f?p=100:3:22511890183313::NO::P3_SERIER_ID:1339&cs=11AF7B0B0CF11D08FBA345BA65068C848
https://www.comics.org/publisher/1609/
https://beta.comics.org/series/49089/covers/
Historie
https://www.norgeshistorie.no/
Lokalisering språk norsk+engelsk
https://phrase.com/blog/posts/internationalization-i18n-go/
https://angelika.me/2021/11/23/7-gettext-lessons-after-2-years/
DEV env live reload
https://news.ycombinator.com/item?id=28015798
NB LISTE OVER INTEGRASJONER FOR NORSKE BIBLIOTEK
https://bibliotekutvikling.no/kunnskapsorganisering/krav-til-biblioteksystem-integrering-mot-nasjonale-tjenester/
DRIFT HOSTING
=============
https://specbranch.com/posts/one-big-server/
https://news.ycombinator.com/item?id=29471986
SIKKERHET
========
https://github.com/FiloSottile/age
Stride trusselmodellering
https://en.m.wikipedia.org/wiki/STRIDE_(security)
https://lobste.rs/s/onn8vc/simple_things_are_actually_hard_user
Autentisering: https://news.ycombinator.com/item?id=29761728
https://www.youtube.com/watch?v=10Qj0eYqbuo
PRIVACY / ANONYMISERING
=======================
https://www.youtube.com/watch?v=RNykMU7wF7s
Ontologi metadata bibliotek rdf
========================
https://news.ycombinator.com/item?id=28710081
http://digitalcuration.umaine.edu/resources/shirky_ontology_is_overrated.pdf
https://news.ycombinator.com/item?id=29141800
SQL database modellering
=====================
Never delete rows referenced in other tables, use ON DELETE RESTRICT
All timestamp are stored as UTC
All application log timestamps are also UTC
slug timestamp IDer: https://news.ycombinator.com/item?id=34436625
Recursion:
https://news.ycombinator.com/item?id=28018058
https://nessuent.xyz/posts/2021-07-18_detecting_cycles.html
Lagre binære data (bilder) i blob eller filsystem?
https://news.ycombinator.com/item?id=14550060
window functions
https://medium.com/analytics-and-data/the-versatility-of-row-number-one-of-sqls-greatest-functions-53ec78e74096
https://learnsql.com/blog/get-to-know-the-power-of-sql-recursive-queries/
https://www.startdataengineering.com/post/6-concepts-to-clearly-understand-window-functions/
sql pro con
https://news.ycombinator.com/item?id=27791539
actual time vs record time
https://lukeplant.me.uk/blog/posts/life-on-the-diagonal-adventures-in-2d-time/
triggers o.l
https://brandur.org/fragments/code-database-vs-app
SQLITE
https://news.ycombinator.com/item?id=34346411
https://epilys.github.io/bibliothecula/notekeeping.html
https://zeroclarkthirty.com/2022-05-21-json-diffing-with-sqlite
https://lobste.rs/s/ts0vtk/sqlite_is_not_toy_database
https://news.ycombinator.com/item?id=26580614
https://news.ycombinator.com/item?id=26217754
https://antonz.org/sqlite-3-35/
https://news.ycombinator.com/item?id=26103776
https://sqlite.org/forum/info/dfd4739c57e02eea
https://dgl.cx/2020/06/sqlite-json-support
https://www.sqlite.org/sqlanalyze.html
https://old.reddit.com/r/sqlite/comments/ocsahk/db_encryption/
https://jcuenod.github.io/bibletech/2021/07/26/full-text-search-for-pdfs/
https://news.ycombinator.com/item?id=28050198
https://news.ycombinator.com/item?id=28259104
https://news.ycombinator.com/item?id=29727707
I prod
We've been using SQLite as our principal data store for 6 years. Our application services potentially hundreds of simultaneous users at once, each pushing 1-15 megabytes of business state to/from disk 1-2 times per second.
We have not had a single incident involving performance or data integrity issues throughout this time. The trick to this success is as follows:
- Use a single SqliteConnection instance per physical database file and share it responsibly within your application. I have seen some incorrect comments in this thread already regarding the best way to extract performance from SQLite using multiple connections. SQLite (by default for most distributions) is built with serialized mode enabled, so it would be very counterproductive to throw a Parallel.ForEach against one of these.
- Use WAL. Make sure you copy all 3 files if you are grabbing a snapshot of a running system, or moving databases around after an unclean shutdown.
- Batch operations if feasible. Leverage application-level primitives for this. Investigate techniques like LMAX Disruptor and other fancy ring-buffer-like abstractions if you are worried about millions of things per second on a single machine. You can insert many orders of magnitude faster if you have an array of contiguous items you want to put to disk.
- Just snapshot the whole VM if you need a backup. This is dead simple. We've never had a snapshot that wouldn't restore to a perfectly-functional application, and we test it all the time. This is a huge advantage of going all-in with SQLite. One app, one machine, one snapshot, etc...
DATAMODELLERING
https://minimalmodeling.substack.com/archive?sort=new
https://rtpg.co/2021/06/07/changes-checklist.html
https://news.ycombinator.com/item?id=27482243
https://lobste.rs/s/x0fk0a/simple_graph_graph_database_sqlite
https://johnnydecimal.com/
"status"-felt bør kunne representeres med en finite-state-machine, ellers så er det bedre å modellere med flere flags
Tags, stikkord
https://news.ycombinator.com/item?id=33248391
https://twitter-thread.com/t/1534301374166474752
SØK INDEKSERING
https://spinscale.de/posts/2020-10-20-search-engines-and-libraries-overview.html
https://news.ycombinator.com/item?id=28187675
https://scribe.rip/p/what-every-software-engineer-should-know-about-search-27d1df99f80d
SCHEDULING / CRON
https://trstringer.com/systemd-timer-vs-cronjob/
https://wiki.archlinux.org/title/Systemd/Timers
https://github.com/alash3al/exeq/blob/main/internals/queue/job.go
https://www.fullstory.com/blog/why-errgroup-withcontext-in-golang-server-handlers/
PROGRAM STRUKTUR
http helpers: handler returning error STEAL THIS!
https://vladimir.varank.in/notes/2021/03/little-things-of-go-http-handlers/
discussion: https://old.reddit.com/r/golang/comments/mhf04c/little_things_of_go_http_handlers/
https://eli.thegreenplace.net/2021/a-comprehensive-guide-to-go-generate/
https://www.simplethread.com/20-things-ive-learned-in-my-20-years-as-a-software-engineer/
https://www.gobeyond.dev/
https://goatspeed.substack.com/p/putting-context-into-context
https://www.ardanlabs.com/blog/2019/09/context-package-semantics-in-go.html
https://millhouse.dev/posts/graceful-shutdowns-in-golang-with-signal-notify-context
https://www.youtube.com/watch?v=ZdXDjYsH83M&list=PLtoVuM73AmsIQv2wba8Hpl424XmWQZu5E&index=33
https://developer20.com/http-connection-livetime/
https://www.joeshaw.org/error-handling-in-go-http-applications/
https://ketansingh.me/posts/golang-x-sync/
https://lobste.rs/s/vzdoor/cult_go_test
https://pkg.go.dev/gotest.tools/v3/assert
https://freshman.tech/linting-golang/
https://blog.carlmjohnson.net/post/2021/how-to-use-go-embed/
https://github.com/benbjohnson/hashfs
CONFIG: https://bitfieldconsulting.com/golang/cuelang-exciting
ERRORS:
https://peter.bourgon.org/blog/2019/09/11/programming-with-errors.html
https://blog.carlmjohnson.net/post/2020/working-with-errors-as/
https://github.com/valyala/quicktemplate
eller
https://github.com/benbjohnson/ego
SIKKERHET
https://news.ycombinator.com/item?id=30514560
https://purelymail.com/docs/security
https://news.ycombinator.com/item?id=26851037
https://mvsp.dev/mvsp.en/index.html
https://news.ycombinator.com/item?id=30499618
LOGGING
https://blog.kowalczyk.info/article/fc9203f7c72a4532b1ae51d018fef7b3/trade-offs-in-designing-versatile-log-format.html
https://presstige.io/p/Logging-HTTP-requests-in-Go-233de7fe59a747078b35b82a1b035d36
ANALYTICS METRIKK RAPPORTER STATISTIKK
https://www.robinlinacre.com/parquet_api/
https://www.robinlinacre.com/demystifying_arrow/
https://deepnote.com/@abid/Data-Science-with-DuckDB-9KKvj1EoQrmj6nj4Y2prkg#
parquet & duckdb!
https://news.ycombinator.com/item?id=31355050
https://news.ycombinator.com/item?id=29966238
https://old.reddit.com/r/Python/comments/y8tu99/analyzing_46_million_mentions_of_climate_change/
TESTING
https://www.clinicallyawesome.com/2021/10/go-reference-mutation-testing.html
https://earthly.dev/blog/property-based-testing/
LOKALISERING / 18LN
https://www.alexedwards.net/blog/i18n-managing-translations
WEB INSPIRASONER
https://apps.npr.org/best-books/index.html
http://blog.apps.npr.org/2019/12/03/book-concierge.html
https://shepherd.com
https://fivebooks.com/
https://hackernewsbooks.com/
https://muan.co/
https://mxb.dev/blog/container-queries-web-components/
HTML / CSS
https://tdarb.org/blog/notice-box.html
https://tdarb.org/blog/craigslist-gallery.html
https://elisehe.in/2022/10/16/attribute-selectors
https://news.ycombinator.com/item?id=32972004
https://news.ycombinator.com/item?id=30512512
https://ishadeed.com/article/defensive-css/
https://www.joshwcomeau.com/css/custom-css-reset/
https://1linelayouts.glitch.me/
https://web.dev/patterns/layout/
https://www.matuzo.at/blog/html-boilerplate/
https://markodenic.com/css-tips/
https://markodenic.com/html-tips/
https://jdan.github.io/98.css/?ref=hn
https://docs.google.com/presentation/d/1hvnPpsJo44BTPfJx28CV95vqk_dt6na1awUbk0kmZYM/edit#slide=id.g3e31444916_1_13
https://emoji.muan.co/#
https://dohliam.github.io/dropin-minimal-css/
https://news.ycombinator.com/item?id=27388691
https://riggraz.dev/no-style-please/
https://engineering.kablamo.com.au/posts/2021/my-first-css
https://news.ycombinator.com/item?id=28116888
https://www.joshwcomeau.com/tutorials/css/
https://www.smashingmagazine.com/2021/07/hsl-colors-css/
støtte reader mode: https://www.ctrl.blog/entry/browser-reading-mode-metadata.html
https://www.joshwcomeau.com/css/designing-shadows/
https://www.gwern.net/Sidenotes
https://7guis.bradwoods.io/flight-booker/
https://github.com/argyleink/gui-challenges
tabeller: https://alistapart.com/article/web-typography-tables/
UI / UX
https://news.ycombinator.com/item?id=31502193
https://open-ui.org/components/datepicker.research
https://www.bbc.co.uk/gel/guidelines/how-to-write-useful-error-messages
ETL (postgress)
Dr. Martin Loetzsch did a great video, ETL Patterns with Postgres. He covers some really good topics:
- Instead of updating tables build their replacements under a different name then rename them. This makes updating heavy-to-compute table instant. Works even for schemas: rebuild a schema as schemaname_next rename the current to schemaname_old then rename schemaname_next to schemaname.
- Keep all the source data raw and disable WAL, you don't need it for ETL.
- Set memory limitis high.
And lots of other good tips for doing ETL/DW in postgres. It's here: https://www.youtube.com/watch?v=whwNi21jAm4
I really appreciate having data in postgres. It's often easy to think that a specialised DW tool will solve all your problems, but that often fails to consider things like:
- Developer experience. Postgres runs very easily on a local machine, more specialized solutions often don't or are tricky to setup.
- Learning another tool costs time. A developer can learn postgres really well in the time it takes them to figure out how to use several more specialised tools. And many devs already know postgres because it's pretty much the default DB nowadays.
- Analytics queries often don't need to run at warp speed. Bigquery might give you the answer in a second but if postgres does it in a minute and it's a weekly report, who cares?
- Postgres is boring and has been around for many years now, it will probably still be here in 10 years so time spent learning it is time well spent. More niche systems will probably be superseded by fancier, faster replacements.
I would go so far as to say don't necessarily need to split out your DW from your prod DB in every case. As soon as you start splitting out a DW to a separate server you need some way to keep it in sync, so you'll probably end up duplicating some business logic for a report, maintaining some ingestion app, shuffling data around S3 or whatever. Keeping your analytics in your prod DB (or just a snapshot of yesterdays DB) is often good enough and means you will be more likely to avoid gnarly business-rules going out of sync between your app and your DW.
https://github.com/mara/mara-pipelines
Worker pools vs semaphore
==========================
For those who want a bit more context to this discussion - it's basically the difference between having N goroutines running all the time (worker pool), waiting for work to come in from outside requests, or having every outside request start a new goroutine, but limit that with a threadsafe count of no more than N at once (semaphore).
The nice thing about semaphores vs. a worker pool is that with a semaphore, you can make it look more like normal old "start a goroutine" code, using a specific function. (e.g. go specificTask(a, b, c)), rather than using a worker that has to be generic and passing it some kind of function to run (e.g. workerCh <- func() { specificTask(a, b, c) })
https://old.reddit.com/r/golang/comments/pzwppr/worker_pool_vs_semaphore/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment