For repositories with daily updated data plots (including slightly changing background color gradients), I asked myself if there is some preferred format (or compression algorithm) to use, so that git can store them more efficiently, instead of having to re-write about 90% of them, all the time.
Is there any kind of image format which is more 'git-friendly' then others?
CodePudding user response:
The theory
Formats that are "Git friendly" will be formats that share long identical byte sequences, whether they are binary or text.
Now, a lossy binary format will probably change most bytes when you change even just the background colour gradients, whereas a more descriptive text-based format might not.
Testing things with your own files
I recommend this test to calculate the compressed size of different file formats in your actual use case.
- Before you start, take a sandbox or a clone, and aggressively compress it so we know further compression in later steps is not due to the images being added: run
git gc --aggressivea few times, untildu .gityields the same answer twice.
Now, for each file format you want to test, copy that sandbox into a new directory and do the following steps:
Add one set of images and aggressively compress the repo again by running
git gc --aggressivea few times, untildu .gityields the same answer twice.Write down what
du .gittells you: that's your baseline size.Add and commit a new set of files, slightly changed in the way you describe in your question.
Now
du .gittells you the size of just adding those files into the repo. On commit, Git does not (normally) try to apply delta compression or packing, it just add a new blob for each file being committed, unless an identical blob already existed.Again, run
git gc --aggressiveuntil the size is stable.Now
du .gittells you how much Git was able to compress those files, by whatever means it found, possibly delta compression. The size here minus the size at step 2 is your space cost for adding one new set of files.
By running the above procedure for different file formats for your images, you'll get an answer specific to your use case.
Git LFS is probably your friend
PS: All that being said, I stand by @Nicolas Voron's answer: unless the size cost above is actually small for the file format you end up choosing, use Git LFS to avoid creating problems in the future when your repo gets too large to clone.
CodePudding user response:
Since git is not designed for (depsite the fact that it can) deal with binary files, I recommand you the excellent git-lfs extension (originally suported by github):
Because with git, the problem is not what you are versionning, but how you do it. Daily updated dataplots will generate a huge amount of data over time, which will be a problem in several years for cloning & fetching.
How to use it :
Download and install the Git command line extension. Once downloaded and installed, set up Git LFS for your user account by running:
git lfs installYou only need to run this once per user account.In each Git repository where you want to use Git LFS, select the file types you'd like Git LFS to manage (or directly edit your .gitattributes). You can configure additional file extensions at anytime.
git lfs track"*.psd" Now make sure .gitattributes is tracked:
git add .gitattributesNote that defining the file types Git LFS should track will not, by itself, convert any pre-existing files to Git LFS, such as files on other branches or in your prior commit history. To do that, use the git lfs migrate1 command, which has a range of options designed to suit various potential use cases.There is no step three. Just commit and push to GitHub as you normally would; for instance, if your current branch is named main:
git add file.psd git
commit -m "Add design file"
git push origin main
What it does :
Git LFS stores a pointer file in the git repo in lieu of the real large file. The pointer is swapped out for the real file at checkout (using smudge and clean). The smudge and clean filters are part of core Git and are designed to allow changing a file on checkout (smudge) and on commit (clean). Git LFS uses these techniques to replace the pointer files with the actual large files that are in use.
EDIT
As i commented under your question, you might consider going uncompressed image types like PNG so git can optimise the delta over time, since two relatively close pictures in this format will have a close binary representation, which is not necessarily the same for compressed format (e.g. JPEG ) (it depends of your pictures and their variabilities each day, but since this is a plot, png should definitively do the trick).
Another recommendation is to handle pictures inside a submodule (unless it's a dedicated image-only repo), so the overweight of versionned images will not impact the whole repo for cloning & fetching.
