git-annex devblog (Joey devblog)
day 572 thinking please wait

Not a lot of coding the past few days, but a lot of skull sweat!

I've been working through the design for the import tree feature, and I think I finally have a design that I'm happy with. There were some very challenging race conditions, and so import tree may only be safely able to be implemented for a few remotes; S3 (with versioning enabled), directory, maybe webdav and I hope adb. Work on this included finding equivilant race conditions in git's update of the worktree, which do turn out to exist if you go looking, but have much narrower time windows there.

And I'll be running a tutorial for people who want to learn about git-annex internals at the code level, to start development or be better able to design their own features. That's in Montreal, March 26th-27th (8 hours total), hosted at McGill university. There may be one or two seats left, so if you are interested in attending, please get in touch with me by email. Haskell is not a prerequisite.

git-annex devblog (Joey devblog)
day 571 survey results

The 2018 user's survey is closed, time for a look at the results. Several of the questions were also on the two past surveys, so we can start to look at historical trends as well.

Very similar numbers of people responded in 2018 as in 2015. The 2013 survey remains a high water mark in participation. My thoughts on the 2015 survey participation level mostly still stand, although there has been a consistent downwards trend in Debian popcon since 2015.

Also interesting that several people skipped the first question on the survey, perhaps because it was a fairly challenging question? And later questions saw much higher response rates this time than in either of the previous surveys, thanks to improvements in the survey interface.


v7 unlocked files are being used by 7% of users, pretty impressive uptake for a feature that has only been really finished for a couple of months. Direct mode is still used by 7% of users, while its v7 replacement of adjusted unlocked branches is only used by 1% so far. That's still some decent progress toward eliminating the need for direct mode.

command line vs assistant

Well that's plain enough isn't it? Although note that I myself have the assistant running in some repos all the time, but would of course vote "command line" since I interact with that much more.

Also notice that people who apparently don't use git-annex but wanted to fill out the survey anyway was the same for 2013-2015, but has now declined.

operating system

Android users have more or less gone away since I deprecated the app. I hope the termux integration brings some back.

how git-annex is installed

Good to see the increase in using git-annex packages from the OS or a third-party package manager.

missing/incomplete ports

Good improvement here since 2015 with 60% now satisfied with available ports.

Worth noting that in 2013, 6% wanted a way to use git-annex on Synology NAS. That is possible now via the standalone linux tarball. This year, 2% wanted "Synology NAS (app store package)".

Also honorable mention to the anonymous person who rewrote git-annex in another language. You should release the code!

number of repositories

Increasingly users seem to have just a couple repositories or a large number, with the middle ground shrinking. A few percent have 200+ repositories now. The sense is of a split between causual users who perhaps clone one repository to a few places, and power users who are adding new repositories over time.

data stored in git-annex

Increasing growth in the high end with many users storing dozens of terabytes of data in git-annex and a couple storing more than 64 terabytes. And a bit of growth in the low end storing under 100 gb.

The total data stored in git-annex looks to be around 650-1300 terabytes now. It was around 150-300 terabytes in 2013. That doesn't count redundant data. And it could be off slightly if shared repositories were reported by multiple users.

(Compare with the Internet Archive, which was 15000 terabytes in 2016 but I think they keep two copies of everything, so call it 7000 terabytes of unique data.)

git level

The same question was asked in the git surveys so I have included those in the graph for comparison.

git-annex users trend more experienced than git users, which is not surprising. You have to know some stuff about git to understand why you'd want to use git-annex.

Notice that git knowledge level is generally going up over time in both surveys.

happyness with the software

A similar question on the git survey included for comparison.

There's a bimodal distribution to git-annex user's happyness, with more unhappy with it than with git, but also more so happy they gravitate toward extreme praise.

There seem to be more unhappy users in 2018 than in 2015 though. The 2018 results are very close to the 2013 results.

blocking problems

Notably 15% of users now find git-annex too hard to use, up from 5% in 2015. Which seems to correlate with some users being more unhappy with it. I don't think git-annex has gotten any harder to use, so this must reflect a change in expectations and/or demographics. (2013 had similar numbers to 2018.)

Very few complain about the documentation now, down to 3% from 13% in 2015, but 12% want to see more tutorials showing how to tie the features together.

And a staggering 21% picked a write-in, "no issues personally, but people don't see (or realize they need) the immense benefits it provides". Need to find better ways to market git-annex, essentially.

size of group using git-annex together

A similar distribution to 2015. One person said they're using git-annex in a group of 50+, and 5 reported groups larger than 10 people.

scientific data

A new high of 11% of respondants are using git-annex to store scientific data. (Other kinds of data it's used for seem more or less the same.)

Part of that growth is because of the companion 2018 git-annex scientific data survey which was promoted in some scientific communities, and so brought more scientists to the main survey.

The use for neuroscience is no surprise, but so much use for astronomy and physics is. And "other" in that pie chart includes statistics, social sciences, mathematics, education, linguistics, biomedical engineering, EE, and physiology -- wow!

survey reach

All participants in the science survey did go on to answer at least part of the main survey. So 37% of respondants to the main survey are scientists.

A full 27% of survey respondants have their name on the thanks page, many for financial support. Which is really great, but also speaks to the fraction of the git-annex user base who saw the survey, because I really doubt that a quarter of the users of any free software are financially supporting it.

As with any online survey, the results are skewed by who bothers to answer it. Still, a lot of useful information to mull over.

git-annex devblog (Joey devblog)
day 570 brrr

Started off the day with some more improvements and bug fixes for export remotes.

Then I noticed that there is no progress displayed for transfers to export remotes; it seems I forgot to wire that up. That really ought to be handled by the special remote setup code, the same way it is for non-export remotes. But it was not possible to do it there the way that export actions are structured.

I got sidetracked with how S3 prepares a handle to the server. That didn't work as well as it might have; most of the time each request to the remote actually prepared a new handle, rather than reusing a single handle. Though the http connection to the server did get reused, that still caused a lot of unncessary work. I fixed that, and the fix also allowed me to restructure export actions in the way I need for progress bars.

I've ran out of time to finish adding the missing progress bars today, so I'll do it tomorrow.

Today's work was sponsored by Jake Vosloo on Patreon.

git-annex devblog (Joey devblog)
day 569 another week another release

Today's release is to fix a data loss bug, that affects S3 remotes configured with exporttree=yes that got versioning=yes turned on after some unversioned data is stored in them. If you use the new versioning=yes feature with S3, please upgrade.

Also, there are only two days left to fill out the git-annex user survey if you have not already.

Today's work was sponsored by Jake Vosloo on Patreon.

git-annex devblog (Joey devblog)
day 568 release day

After a long struggle with the test suite, the new git-annex release is finally out today.

A few last-minute changes in the release include removing from the webapp since their webdav gateway is EOL at the end of the month, supporting armv71 in the android installation script, allowing installation with 64 bit git on windows, and shortening the estimated time to completion display.

Today's work was supported by Trenton Cronholm on Patreon.

git-annex devblog (Joey devblog)
day 567 neither rain nor snow

Offline today due to weather, but there's lots of nice backlog to work on...

I've written down a external remote querying transition plan. If you maintain an external special remote that implements WHEREIS or GETINFO, please take a look as your code would need to be updated if this is done.

Ilya suggested making git annex testremote be able to test readonly remotes, and I implemented that.

There was a discussion in the forum about .git/annex/misctmp/ containing cruft left by an interrupted git-annex process. I was surprised to find half a gigabyte of old files on my own laptop due to this problem. I've put in a fix, so git-annex will clean up such temp files that were left behind by a previous interrupted git-annex process.

git-annex devblog (Joey devblog)
day 566 stopping place

I said I was going to stop with the ByteString conversion, but then I looked at profiling, and I knew I couldn't stop there -- conversion between String and ByteString had became a major cost center.

So today, converted all the code that reads and parses symlinks and pointer files to ByteString, now ByteString is used all the way from disk to Key. Also put in some caching, so git-annex does not need to re-serialize a Key that it's just deserialized from a ByteString.

There's still some ByteString to String conversion when generating FilePaths; to avoid that will need an equivilant of System.FilePath that operates on RawFilePath, and I don't think there is one yet? But the profiling does show improvement, it's more and more dominated by IO operations that can't be sped up, and less by slow code.

This really does feel like a stopping place now.

Updated benchmarks (compared to last git-annex release):

find on 10000 files, none present... 8% speedup
whereis on 1000 files............... 12% speedup
info on dir with 1000 files......... 7% speedup
local get ; drop of 1000 files...... 4% speedup
setting metadata in 1000 files...... 8% speedup
getting metadata from 1000 files.... 7% speedup
finding a single file out of 1000 that has a given metadata value... 8% speedup


List of feeds:

  • Anna: last checked (50 posts)
  • Anna and Mark: Waldeneffect: last checked (4529 posts)
  • Joey: last checked (185 posts)
  • Joey devblog: last checked (194 posts)
  • Jay: last checked (50 posts)
  • Errol: last checked (53 posts)
  • Maggie: Cannot detect feed type (35 posts)
  • Maggie too: last checked (60 posts)
  • Maggie also: Can't connect to (276 posts)
  • Tomoko: last checked (77 posts)
  • Jerry: last checked (28 posts)
  • Dani: last checked (22 posts)
  • Richard: last checked (40 posts)