Manjaro has announced they are testing a new data collection tool for their system and well its started a bit of a controversy. So let’s talk about it.
Links:
- https://forum.manjaro.org/t/testers-needed-manjaro-data-donor/170163
- https://destinationlinux.net (My Podcast)
- https://www.phoronix.com/news/Manjaro-Linux-Data-Donor
- https://www.gamingonlinux.com/2024/11/manjaro-linux-want-your-system-info-with-their-new-data-collection-tool/
Transcription:
View full transcription
[0:00] Manjaro has announced that they are testing a new controversial tool because it’s a data collection tool called the Manjaro Data Donor. So let’s talk about that. Hi, I’m Michael. And if you’re new to the channel, I make tech videos with a fairly heavy focus on Linux. I’m also a marketing professional. I’ve worked in technical marketing. I’ve worked as a product manager in advertising, public relations, and more. So this topic is kind of in my wheelhouse in a lot of ways, whether it’s the
[0:29] tech side, the Linux side, and also in the marketing parts. So here is the forum announcement from Roman from the Manjaro team about the Manjaro Data Donor. And this is about getting testers for, you know, seeing if how they can improve it and things like that. Now, the Manjaro Data Donor is, first of all, kind of a weird name, because we’ll get to it in a second, but just wanted to put that out there.
[0:51] Suggest, you know, this be a placeholder for now. So they say it is a way for us to gather a few usage statistics about Manjaro. The motivation for that at the start was to improve our user counting. Until now, what has been done was counting users via ping.Manjaro.org, and these pings are sent from Manjaro Systems via the network manager.
[1:15] So, let’s go to ping.Manjaro.org, and you will see that it is an ad for the Slimbook laptop with the Manjaro edition of the Slimbook laptop, and that’s about it. It’s pretty much the whole thing. There’s not that much information about what they’re doing, how they’re doing it to send the pings or much at all, but they do go on to say that there are some problems with the approach. They’re saying that individual systems were only distinguished on basis of IP address. This doesn’t allow statistics over time, and systems behind the same NAT are counted as one, which is very true. That’s the value of machine ID versus having just an IP address.
[1:57] Also, one needs to store the IP address at least for a short period of time. And the analyst software, or the analysis software, that was used was Matomo. And they promise IP addresses are masked, but we still had to rely on this promise.
[2:11] I mean, technically you don’t necessarily because it is an open source project and I have used it and I have tested it and they do mask the IP addresses. In fact, they have a system where they hash it, and they don’t ever actually store the real IP address. They store a hash of the IP address, and then they continue the process and hash it again and see if they get the same result when they’re compared to multiple visits and that sort of stuff. So there is that, and you also control how much of an IP address is even used in the first place. So you can knock it down from being the entire IP to just being country-based as well as many other things. So you could you can do a lot with Matomo. So I do think Matomo is pretty good, but it is not for this kind of thing at all. So Matomo is rather bulky tool, but wasn’t really made for system telemetry.
[3:02] Right, exactly. It is meant for website analysis. The setup, therefore, was kind of hacky, while the results rather meager and the data was
[3:09] only available to a few people. Using network manager pings to check the online status for user counting is acceptable, but also not what it is meant for. nor was it communicated as such. I think it’s better to be explicit and transparent about these kinds of things. That’s true. And I didn’t know that they were doing this ping thing, but also it’s interesting because there was a forum post asking about this. The forum post says, Did Mandro track me? It’s saying that they’re finding stuff in their blocking list for PiHole for matomo.Manjaro.org.
[3:42] And kind of, like depending on how this was done, this could have been the ping. This could have been them just going to a website that is doing, like the forum, for example, has Matomo Analytics for it and that sort of thing. So it could be, but it’s also, we’re not really sure how they got this information, but we do know that at least they’ve been doing it before. So Roman says that he wanted to improve upon the data collection and stuff for quite some time. And the MDD or the Manjaro Data Donor is now the tool for that. And some more as it will also provide interesting hardware and environment statistics about the systems that Manjaro is being used on. This is something right now it’s for testing, but they are planning to do something that a lot of people are not fans of. And that is opt out versus opt in. So if you go down to here, you’ll see that with this systemd service later in place, sending the hardware data with MDD will be opt out because I believe if you do opt in, the data you gather will be so heavily skewed that you can just leave it be. So he says that he wanted to improve the data that they are collecting is more useful. And he says that MDD is now the tool for that and some more as it will also provide interesting hardware and environment statistics about the systems that Manjaro is being used on.
[5:02] Now, I do agree that this would be much more useful about having information about the desktop environments or the hardware more so than just how many people. How many people is very important, but also there’s a lot more to it that could be useful.
[5:17] Now, before you send the data, there is an ability to do a dry run so you can see what it is prior to sending it. Then also you can just run it and send the data. So let’s take a look at the kind of data that is being sent so as you can see there’s a lot of data that is related to system information such as the kernel that’s being used the form factor which is at a laptop or a desktop what version of Manjaro they’re using and you get a lot more information than you would normally get with just an ip pinging and you get the the cpu information and whether it’s using wayland or not, what DE it’s using because KWIN actually telling you what DE, but it also says somewhere what DE it is. It tells you pipe wire and all that. So it gives you a lot of information.
[6:06] Clevo, so you even know what kind of hardware it is. Now, it also tells you how much RAM, how much swap is available and all that sort of stuff. So there’s quite a bit of information in here. But the most important thing is that most of it has no ability to be personal data whatsoever. Now, you could take some of this information and combine it with other things and do some kind of fingerprinting. And that’s another issue that’s problematic. But the data that is being sent out is a, not going to be provided publicly in the sense of like, here’s all the data of everything. What they’re doing is going to be giving you information in like an aggregate, which I think is okay. I think aggregate information is okay. There is location information in this data, which some people are going to be bothered by, but it does not go past the country, except for this time zone thing. It does say Paris. I don’t know if France has more than one time zone. Maybe it doesn’t. I don’t know that. But it does say Paris. So that might be more specific.
[7:07] But it does around here, it says the country config is France. So that information could be useful in terms of how many users are in different countries and what kind of hardware that people have in different countries and what kind of like preferences for desktop environments based on country like there’s a lot of cool information that could be gathered by the country rather than any anything beyond that. Like I would not be comfortable with anything beyond the country level. I think the country level is plenty of information that would be able to be used in an aggregate form. So I’m okay with that personally, that’s up to the user whether or not they’re okay with it. But you know, maybe they can make a flag where when you send it, you could say don’t send the country data or something like that. I don’t know. But that’s, that’s the only thing I feel like could even be remotely related to personal data.
[7:54] And still not even then, because you’re not the only person in that country. Now, of course, not everyone is happy with this information. So let’s take a look at one of them says that hands off my data, opt out is a disgrace and an embarrassment and that it’s a rude, you’re using a rude method.
[8:13] And if you need data, then ask. So this is an interesting take because I understand that a lot of people are going to be anti opt out.
[8:24] It’s definitely going to happen because people prefer opt in because then you’re asking for the request of the data. And there you go. I get it. It depends on how the opt out is done. Now, I think of the way that Ubuntu does it is actually fine because Ubuntu does a thing where you are asked the question immediately when you do the installation and you can just say no. And if you say no, it sends the only thing that it sends is that you said no. And I think that that is powerful in two ways. One, gives the user the control to say whether or not they can have the data. Two, it also gives a total amount of installs because saying no and the only thing you get is still information to say that it at least has been installed because otherwise you wouldn’t get the no. So I think that the no is actually also useful to send. Some people would say that, well, if I say opt out, I shouldn’t send anything. And that’s somewhat true, but you’re also already downloaded the ISO. So you still sent the data. This is more of like a similar type of thing of saying, hey,
[9:31] this person has installed it, but that’s all we know. And I think that that’s totally fine. Now, there are people talking about how this is spyware. And this is very important to clarify.
[9:44] This is not spyware. Spyware is something that is being done without your knowledge of it happening. If they tell you that it’s happening, then it’s not spyware. You could still disagree with it in general, but that’s the definition of the term.
[9:56] Now, here is the data that they are sharing. And this is real-time data, I’m pretty sure, that they are providing with the aggregate information that has already been shared. And you can see the amount of countries that are participating and which countries are participating, and the level of resolution that they have, and all sorts of stuff. There’s a lot of details like which de that they are using um like what kind of device class they have like how much desktop how much laptop and how much uh like a system on a chip like a pi or you know just all sorts of stuff and you know arm versus intel and amd and uh nvidia versus amd and intel like this i think this information is very useful and very interesting so especially with giving it like this in an aggregate form so you have some data but you don’t have like a huge amount of data and you don’t have any specific details of anyone, this can be very useful, not only for the individual users, this could be useful in a lot of ways just for Linux entirely, like Manjaro and Linux. Now I would prefer.
[11:02] Every distro have something like this.
[11:05] Let’s talk about that. I know there’s a lot of people who are going to be annoyed by this entire thing. So let’s talk about the pros and cons of telemetry. Yes, telemetry can be used as a form of violation for security and privacy and that sort of thing. So there are definitely some pros and cons. And there’s also the aspects of data collection slash telemetry can be bad if it’s used badly, but it also can be done ethically and therefore the data can be useful to everyone. So the fact that they’re also sharing this data, I feel like is probably the most powerful aspect of it. It’s great that they’re, you know, giving the information, giving the option for the users, but also it’s, they’re giving the data that when people who do submit it, everybody can see it. And that’s good because we have some more information about how many users there are in Manjaro, the configurations of their hardware and all that sort of stuff. Now, I would prefer that every distro have something like this, not necessarily this, because I haven’t really dug into it to say for sure,
[12:08] but something like it, because we have a big problem with Linux and usage and convincing people. Now, previously, I was talking about how I’m in marketing and public relations and advertising and that sort of stuff. And the amount of times I’ve had conversations with people about, so whether it’s a product, whether it’s a service or insert whatever.
[12:29] The amount of times I’ve had conversations where they would say, show me the data.
[12:33] What kind of data do you have? What like even for this YouTube channel, even for my podcast, when I try to go to a company and say, hey, would you like to sponsor it? You can get benefits from helping the show and blah, blah. That’s not what they’re interested in. How much value can you bring to me for me to pay for your sponsorship? That’s what they want to know. If I don’t know how many people are watching the show or downloading the podcast or subscribing to the channel, how would I be able to convince them? I would have no idea. And they would have no way to know that it’s verifiably useful because they wouldn’t be able to see how many views the video got, how many the views the podcast got, or how many subscribers that things have, like if they don’t have that information, then it’s not helpful.
[13:22] And also pretty much impossible to convince them. Now apply that same process to convincing a company to make support of their software on Linux. And you go to them and say, hey, you should make Linux ports to your software. And they go, how many users are on Linux? And our answer for decades have been, we have no idea. We have no ability to know. In fact, we have this huge process and problem of when people want to implement something like this, they’re ostracized and completely ridiculed and attacked as if they’re the enemy. If the data is ethically gathered and ethically provided in the sense that it is anonymized and it is having zero information about the individuals and it doesn’t try to take any data from you that is about you, it’s just basic hardware information, or even just a machine ID to prove how many people use it. Now, some people could argue that there is too much information in this, that I’m not taking that stance, but some people could. And other people could say, what do you need more than just a device ID to prove that someone has installed it, or whatever. And those are all fair points. I think that the nuance of how this is done, that’s important. But I feel like the instant outrage of telemetry, and it can never be okay.
[14:47] That is problematic for the future of the platform. As far as the desktop of Linux, we all know that Linux is already dominant in everything else. The only thing that we’re slow to have adoption is the desktop. So how do we get it there?
[15:05] We need applications, we need the applications to be on the platform to get the users who want those applications to be able to use them how do we get the people who make that software to put it on linux well we get them to do it by convincing them that there is enough market to justify the process of doing it if we don’t know what the data is how do we justify the process of doing it like that is the main fundamental problem I have with the anti-data collection position. Yes, there is tons of examples of companies who have done horrible things about information and taking people’s personal data to the point where governments are now putting out stuff to stop people and giving you the option to get rid of it. And yeah, fair enough. Companies have done it. However, just because one company has done it doesn’t mean that everyone will do it. And also, the open source system, the open source…
[16:12] Philosophy is more likely to do it the right way so i feel like we should be more open-minded to this kind of thing provided that they do it in a way that is respectful of the user and, anonymizes the data and provides no personal information ever to the servers so like don’t take any personal data at all and then yes there’s fingerprinting as a thing but that’s you can come you can anonymize that too so i feel like that’s not necessarily a problem either but depending on how you do it so yes this could be done improperly but at the same time it also could be fantastic and i feel like if every major distro and preferably every distro did something like this it would make it possible for us to say hey insert giant company who wants to justify by spending thousands and thousands of dollars to port their software to Linux. We have millions of users.
[17:15] We do, by the way, we do actually have millions of users, but we can’t prove that. We can only just estimate that. So that’s why I wanted to cover this. Data collection can be done properly. If it is done properly,
[17:29] it can be very useful to the Linux desktop and the Linux ecosystem overall. So let’s get to the opt-out versus opt-in part of it. Now, depending on how this is done, if opt-out is done properly, which I would hope that it will if it is done properly whereas when you install the system it just says hey do you want to participate in this just like how Ubuntu does it where you when you first install Ubuntu it says do you want to send the information you can choose to do so or not Manjaro would hopefully do that kind of thing where you get the option up front and I would be okay with that personally and then there you go. Now, if they make it any harder than that, I feel like that would be a problem. As long as they don’t make you have to use the terminal in order to stop it or whatever, then we’re fine. Because if you have average users or everyday users trying out a distro, and you could argue whether or not Manjaro is a distro for them, people are still going to do it. But you need to make sure that it’s not got a barrier to participate or not, like choose to participate or not.
[18:36] Because if you create a barrier, then you start getting into the sketchy world. But what about opt-out itself? Is that a rude method? Is that an egregious thing to do? Well, in my opinion, no. Because opt-out is, first of all, the only option really. If you think about it, opt-in is telling people that you would like them to give you the data.
[19:02] What percentage of people, do you think are going to say yes to that information? Some people would be like, well, if I got to put in extra effort to do that, even if it’s just three letters, I don’t want to. Some people would be, if I have to even… Run a terminal command, guaranteed not going to. Like there’s, there’s people who are not, who are averse to even using the terminal if they don’t have to. And then there’s people who aren’t even going to know that this is a thing because they didn’t hear, they didn’t watch this video, they didn’t see the announcement on the forum or any of those sorts of things. If it’s opt-in, you have to do a level of marketing to let them know that first that it’s even a possibility to do it, convince them to do it. And if they don’t care at all, they might not even bother to even look up information about it. So the only the only way you could say opt in is that if you do the exact same thing in the beginning of the installation, and make them choose and not have it set up either way, you just have to choose yes or no. Now you could argue that that would be better. And I would say that there is merit to that. But also, you have to force them to choose. And forcing to choose is going to create a burden for some people who are annoyed by it and just don’t care.
[20:25] And yes, some people react negatively to that and the fact that they will just stop using the system and move to something else because it’s not going to bother them with something. Is that an outlandish reaction? Sure, but it’s possible.
[20:38] So I feel like if you’re going to do this kind of thing, opt out is the right choice. As long as you’re doing it ethically, making sure that everyone who installs the system is asked the question and can participate or not. If you have the checkbox already set to send, that’s OK, because if people don’t care and just click through anyway, then there it is. And if they say no, all they have to do is check a box, then that’s also totally fine. I feel like the aversion to it is feeling like opt-out is only possible to be done in a way that is malicious. Like how all of these different EULAs that exist and you sign a contract with some subscription thing and then you have to scroll down to the very bottom to know that you can opt out for some collection you didn’t even know they were doing and all that sort of stuff. In that sense, yes, that can be done poorly. And that can be done unethically and horrifically. And there’s a lot of examples for that. I’m not going to give you a list because…
[21:43] It’s sad, but it doesn’t have to be done that way. And I feel like Manjaro is probably going to handle it the right way. Based on the fact that they’re an open source Linux distro, they’ll probably handle it the right way. I hope they do. And if they do, then great. If they don’t, well, they will hear a different message from me. Let me know what you think in the comments below. If you disagree with me, let me know why. I understand that not everybody’s going to agree with me, and that’s okay. So just let me know in the comments. And if you want some more controversial information and more controversial news that you can dig your teeth into, then check out this video where I talk about how WordPress is going after WP Engine and they’re doing it in kind of a weird way.
Start the discussion at forum.tuxdigital.com