The current algorithm is CPU intensive and blocks the event loop for
multiple seconds in my deployment. This is not acceptable, as other
requests can not be answered during that time.
I do not have time to fully fix the issue here, but I did implement an
optimization for ALL_TIME reports:
Before, the all time report was generated for every timeFrame since 1970,
which iterated over the listens many hundred times. We can instead only start
the interval at the day of the first listen, and therefore skip 50+ years
of calculations.
This should help with failing health checks while the crawler is running.
Quick math: 10 users, 30 songs each, each song requires at least 3
queries => 900 db queries every minute.
With the default of 10 pool connections, this blocks all available db
bandwidth for some time and causes slow UI and failing healthchecks.
Listory could miss some listens when the Spotify Api misbehaves.
Sometimes the listens added to the last-recently-played endpoint
are showing up are out-of-order.
Because of our optimization to only retrieve listens newer than
the lastRefreshTime we would skip retrieving those out-of-order listens.
By always retrieving the maximum 50 listens we can be reasonably
sure that we got all the listens.
We also had to improve the handling of duplicate listens, as we now
have a lot of them, curtesy of removing the lastRefreshTime
optimization.
When the spotify crawler loop would import an artist multiple
times in parallel the first would succeed but the following queries
would throw with following exception:
QueryFailedError: duplicate key value violates unique constraint "IDX_ARTIST_SPOTIFY_ID"
This error also could happen for the album or track.