PDA

View Full Version : The END of RETENTION LIMITATIONS?



Beck38
01-29-2010, 03:55 PM
The past month or so, I've been going back 'rummaging through' some old nzb's, going back further and further on, particularly, both x.264 and blu-ray postings some 500 days (or further) back.

What seems to be happening, is that even with usenet adding some 5TB or so per day, that somewhere back, again, some 500 or so days ago (I use Astraweb although it appears that Giganews is not quite as far back, maybe 350+ days or so), but that with the ability of the major providers to keep adding that 5TB of disc space per day, that we may be entering an era where anything posted to usenet (text OR binary) is there 'forever'.

To 'keep up', again, with that 5TB+ per day, costs around $300/day or thereabouts, in disc addition. Simple arithmetic yields that a provider would need around 800+ subs to pay just for the hardware upgrades, say double that to provide an ongoing staff to maintain 'hands on'. I'm a bit out of date as to the current costing analysis on large scale (say OC48 or OC192 internet connections, but that's easily looked up), to provide the user interconnections, but...

The upshot is that somewhere back around, say, August 2008 or so, everything posted to usenet since may well be (again, for the major server plants), be there 'forever'.

All a plant may need, again, is some 3000 or so subscribers to maintain 'equilibrium', so to speak (pay all the bills, keep adding hardware to the plant, maintain interconnections).

Just some musings, early in the morning for me, that has kinda become super-obvious, as I sit here d/l'ing yet another posting form aound 500 days back (from today) or thereabouts. I know that the text groups have been 'virtually complete' back several years for quite a while, but I'm talking about the binary groups as well.

One of the things I've noticed is that (some, most?) the 'major' indexing sites are finding it hard to 'keep up' with the retention of the majors (GN and Astra). Certainly Newzbin is, but with them 'taking down' stuff (listings) it's hard to tell.

One of (the?) biggest pluses of the P2P folks has always been that there are/were so many individual folks 'on the network', that anything 'on' the network was there virtually forever (as long as someone, somewhere, retained the file).

Usenet may be on the verge, or already past, that point as well.

2000TB of retention per year (at the 5TB/day), is simply not that big of a deal. $200K/per year hardware wise (simple $100/TB), or thereabouts.

:w00t: Something to think about, remembering back to when a couple weeks to a couple months of retention was a 'big deal'.

Rart
01-29-2010, 07:26 PM
This was a great point you brought up.

Whenever I checked the retention data, it always seemed like GN and Astraweb were increasing retention at exactly the same rate(1 day for every day), and it never stopped (and as of such Astraweb is always just a tiny bit behind, probably due to starting later or something).

It would appear that the growth and development of storage space (and it's feasibility/cost) is outpacing the rate at which content is added to usenet.

The wonders of modern technology :).

Beck38
01-29-2010, 08:38 PM
It would appear that the growth and development of storage space (and it's feasibility/cost) is outpacing the rate at which content is added to usenet.


I think most folks are wondering what the next 'step' is in magnetic data storage, after perpendicular drives (being the last 'great leap forward). Unless I've missed some news release somewhere, 2TB has been the 'plateau' for a couple years now, but the pricing continues to fall even on those (with or without 'green' types).

I kinda wonder exactly how many subscribers the 'big boys' have, though. Certainly greater (MUCH greater) than the number needed in my small attempt at cost analysis.

I think it's over. Maybe 1 Jan 2009 was the last day anything anywhere 'rolled off' usenet. Anywhere.

tesco
01-29-2010, 09:00 PM
The past month or so, I've been going back 'rummaging through' some old nzb's, going back further and further on, particularly, both x.264 and blu-ray postings some 500 days (or further) back.

What seems to be happening, is that even with usenet adding some 5TB or so per day, that somewhere back, again, some 500 or so days ago (I use Astraweb although it appears that Giganews is not quite as far back, maybe 350+ days or so), but that with the ability of the major providers to keep adding that 5TB of disc space per day, that we may be entering an era where anything posted to usenet (text OR binary) is there 'forever'.

To 'keep up', again, with that 5TB+ per day, costs around $300/day or thereabouts, in disc addition. Simple arithmetic yields that a provider would need around 800+ subs to pay just for the hardware upgrades, say double that to provide an ongoing staff to maintain 'hands on'. I'm a bit out of date as to the current costing analysis on large scale (say OC48 or OC192 internet connections, but that's easily looked up), to provide the user interconnections, but...

The upshot is that somewhere back around, say, August 2008 or so, everything posted to usenet since may well be (again, for the major server plants), be there 'forever'.

All a plant may need, again, is some 3000 or so subscribers to maintain 'equilibrium', so to speak (pay all the bills, keep adding hardware to the plant, maintain interconnections).

Just some musings, early in the morning for me, that has kinda become super-obvious, as I sit here d/l'ing yet another posting form aound 500 days back (from today) or thereabouts. I know that the text groups have been 'virtually complete' back several years for quite a while, but I'm talking about the binary groups as well.

One of the things I've noticed is that (some, most?) the 'major' indexing sites are finding it hard to 'keep up' with the retention of the majors (GN and Astra). Certainly Newzbin is, but with them 'taking down' stuff (listings) it's hard to tell.

One of (the?) biggest pluses of the P2P folks has always been that there are/were so many individual folks 'on the network', that anything 'on' the network was there virtually forever (as long as someone, somewhere, retained the file).

Usenet may be on the verge, or already past, that point as well.

2000TB of retention per year (at the 5TB/day), is simply not that big of a deal. $200K/per year hardware wise (simple $100/TB), or thereabouts.

:w00t: Something to think about, remembering back to when a couple weeks to a couple months of retention was a 'big deal'.They store data in more than one place.
The big guys have seperate server farms in different cities, and they can't run off of one source per article. The articles (especailly the newest ~7 days?) would have to be on many machines for optimization/load balancing. Not the mention backups...

edit: Maybe backup is unneeded, if content goes missing it can be looked up in some sort of master 'index' then redownloaded from other servers (same way it gets the data in the first place).

Beck38
01-29-2010, 10:40 PM
They store data in more than one place.
The big guys have seperate server farms in different cities, and they can't run off of one source per article. The articles (especailly the newest ~7 days?) would have to be on many machines for optimization/load balancing. Not the mention backups...


Actually, no they don't. The trend over the last few years has been to consolidate (costing is the driver), and I know that both Astraweb and Giganews are a single plant (other than European and with GN, Asian). I've not only been to where GN is (Hampton Roads, VA) and I used to live a few blocks from Astraweb (Santa Clara, CA). When GN was hq in Phoenix several years ago, I was there as well. WAY back when, when GN started out in Austin, TX, I used to live there as well (small world, or I've lived all over the place, did TONS of traveling while building the 'fiber planet' for more clients than I can remember).

The 'machines' are multiple, to an extent, for the load balancing you mention. But short of a close nuclear weapons detention, they are pretty solid.

iLOVENZB
01-29-2010, 11:30 PM
...
edit: Maybe backup is unneeded, if content goes missing it can be looked up in some sort of master 'index' then redownloaded from other servers (same way it gets the data in the first place).

Imagine if it was a RAID backup :O.

I remember reading that Usenet wasn't on just one server, if it was it would be very easy to shutdown not to mention if a server died etc.

This is why Usenet is near impossible to bring down.

tesco
01-30-2010, 02:09 AM
...
edit: Maybe backup is unneeded, if content goes missing it can be looked up in some sort of master 'index' then redownloaded from other servers (same way it gets the data in the first place).

Imagine if it was a RAID backup :O.

I remember reading that Usenet wasn't on just one server, if it was it would be very easy to shutdown not to mention if a server died etc.

This is why Usenet is near impossible to bring down.
Yes I know.
The question is whether within one host/serverfarm there are multiple copies of the same article or just a single one.



They store data in more than one place.
The big guys have seperate server farms in different cities, and they can't run off of one source per article. The articles (especailly the newest ~7 days?) would have to be on many machines for optimization/load balancing. Not the mention backups...


Actually, no they don't. The trend over the last few years has been to consolidate (costing is the driver), and I know that both Astraweb and Giganews are a single plant (other than European and with GN, Asian).

The 'machines' are multiple, to an extent, for the load balancing you mention. But short of a close nuclear weapons detention, they are pretty solid.
Elabroate a bit, I'm really interested in this. What's the server setup like?

Beck38
01-30-2010, 03:12 AM
This is a pretty good picture:

http://usenet-news.net/index1.php?url=home

although most colo (co-location) setups are basically chain-link fencing with the rack mounts within each 'separate' area. Lots of keys and high-security locks abound.

I've got several 'bankers boxes' full of pictures, from 'back in the day' when everything was 'film'. Just about the time things started changing to digital (2002-3) was when I retired from the 'rat race'.

Most everyone rents/leases space in a colo facility, even back then. Even folks as big as GN or Astra only have 2-3 rows of racks. Only the VERY big corporations run enough to have their 'own' facilities, like Microsoft.

The last big facility I was involved in designing/building was for QWest in south Seattle, about 15,000sqft. Everybody worth anything in town had a big chunk of space there, including MS, Boeing, Comcast, you name it.

But usenet suppliers? Small operations. Toss the numbers, one of those 19" racks in the usenet-news pictures can hold 300+ HD's, so that's 600TB right there. Just replicate that a few times, and the amount becomes mind-numbing.

Oh, a couple of added items, since a lot (uh, the whole?) of what the picture shows might be a bit unknown.

The largest drive assembly I think I've seen ('tripped over' while skimming mr. internet) is 100+ drive boxes, 19" x maybe 48 RU's (rack units). That would be about 3' high or thereabouts, so one could fit 3 of them in a 'standard' 7-9' rack. The 600TB would be about 1/3rd of a year of the total usenet (5TB a day, 600/5 = 120 days), so it would take 3 racks to hold about a year (120x3 = 360). Six racks, two years. VERY small space needed.

Now, back 'in the day' when I was 'gainfully employed', and drives were at best say 500MB, it would take 4 times as much space. STILL pretty small. Heck, the backup battery plant would be larger, floor space wise!

Schrutastic
01-30-2010, 05:07 AM
Even though I'm pretty sure I've downloaded all I want (and more) in the last two years, it is nice to think that it will always be there, just in case.

As a mere user from a non-technical background, it's very interesting to read about the infrastructure.

Beck38
01-30-2010, 06:21 AM
The last month or so, I've been going back and 'reviewing' some of the first HiDef nzb's I collected right after I had built my HTPC. Then I found that the player s/w (both commercial and PD) had various limitations, and I kinda put things on hold (which is why the 'old' nzb's), until I finally got a Popcorn Hour STB.

Meanwhile, I've moved over to Astra, and in going back, have gotten many things upwards of 450 to almost 500 days back. I'm d/l'ing something now that's from 1 Nov 2008, and I've done other 'stuff' back to around Aug08.

Then it kinda hits a wall, 'file not found', etc.

I wonder how far back GN is, though, again these are binary groups (x.264 and the like). Would be interesting to find out. But I think we've hit a 'paradigm shift', maybe even bigger than most/many have realized.

ericab
01-30-2010, 09:19 AM
can you tell us a bit about the harddrives used ? brand ? how can these drives be accessed by so many people at the same time, with no slowdowns ?

Beck38
01-30-2010, 10:13 AM
All depends; I've been 'out of the loop' on pretty much everything for several years, but a 'real' industrial strength array today would probably be built on SAS (serial attached scsi), whereas back in my 'day' the best (and most expensive) were scsi array's, and there were (cira 2000) a 2-3 different types of scsi interfaces. 'Single-ended' had ruled the roost for quite a while, but 'double-ended' was coming into it's own (better compatability, higher speed, longer and more stable cable lengths). If I remember correctly (!), HP had a type that never really caught on.

Today, SAS it 'top-line', but SATA isn't any slouch. But things are moving fast (don't get run over), both in the SATA/SAS world and in the USB world. We might be having this conversation a year from now, and the 'talk of the town' may be USB3 killing off everything else.

The high-grade drives today aren't much, if any, different that what you'd put in a destop PC. Seagate just started shipping their 'XT' line of drives, in the $3-400 price range, but still 2TB (perpendicular), 7200rpm, 64MB cache, but performance wise (like for databases and such) the SAS drives beat them hands down.

But we're talking mass, super-mass, storage here. We don't need 'fast' drives, just super-stability and super-capacity. The OS is all specialized UNIX, with super-nested RAID. Each array has several 'hot stand-by' drives, and the SAS cluster has redundant drivers/interfaces. Take a look at the top of the line Adaptec (literally hundreds of drives supported on the interface card).

In fact, it's really enough to blow your mind. I was specifying/building/installing systems for companies 10 years ago that were in the $20-200K range, that today could be done for 1/10th that with a 100-1000 time as much storage, and CPU speed 15-20 times as fast.

Where will it end? I've read several articles in the past few months that postulate that 'Moore's Law' is running out of steam... but I've heard that more than a couple of times in the last 30+ years.

Really, the last vestige of mechanical computing IS the rotating hard drive.... but one can see that SSD's are making inroads, 'netbooks' are really taking off, and it's only a matter of time before 'boot' drives will be standard on SSD's.

But, what's the next 'step'? There are lots of 'tricks' in the lab to take perpendicular magnetic surfaces (and their read/write heads) up a number of notches, but when they'll make it to market is unknown.

Like I said previously, I think most of the manufacturers are taking a 'pause' right now, working on yields and cost containment. 2TB drives (SATA) that were $3-400 3 years ago when they were first introduced, are hitting as low as $130 ('green' or not-green) today.

On the performance, remember that 'you' are not 'directly' attached, but through, what is for the server, a VERY SLOW connection. How fast is your line? 3Mb, 10Mb, 50Mb? The LAN's now in an industrial system are running 1Gb, with the main lines at 10Gb.

So, say you got 1000 customers leeching at 3Mb/s. That's 3Gb. Hardly breaking a sweat. Take a look at the specs on the SAS cards, 10,000 times that much is within their capability. I look at it from a telecom perspective (that's my field), but the amount of bits these systems can move around are way faster than the communications links they attach to, even up at the OC192 (10Gb/s) level (and multiples of those and faster). Shoveling bits is easy for computers, spewing them out across the planet is hard for the telecom infrastructure.

For 'fun', I've run the numbers on what would happen if, say, every internet user in the US had a FIOS 50Mb line. Things get REALLY FUNNY/INTERESTING, to me! Just think, you could (and those with it do so now, that's a thought!) transfer a Blu-Ray disc (50GB, 400Gb/400000Mb) in just over some 2 hours or thereabouts.

Oh well. H*ll would break loose! It'll take 20 years though....if we really put out minds to it.

ericab
01-30-2010, 09:24 PM
thanks for taking the time for that as well as all the info beck38 !

Beck38
01-31-2010, 12:34 AM
Most the time I get pretty d*m* long winded, but what the hey.

BTW, I sent off a 'complaint' to one of the indexing 'services' I use (and pay for, based at Giganews, BTW), and come to find out, they 'index' only back 300 days. What a crock! I'll see what they have to say when they read my return message, probably come Monday.

tesco
01-31-2010, 12:37 AM
Thanks, really useful information there.

I'm wondering, is anything stored compressed and then uncompressed on-the-fly when someone requests it? Maybe the older, less-downloaded articles, in the less popular groups?

Beck38
01-31-2010, 04:08 AM
I'm wondering, is anything stored compressed and then uncompressed on-the-fly when someone requests it? Maybe the older, less-downloaded articles, in the less popular groups?

Virtually all the binary groups, like anything video (mpeg2/4/x.264/etc) are HIGHLY compressed to begin with; you can do some test of your own and try to rar (at max compression) such things, but you'll quickly find out that, at best, you'll get .001% compression or something equally pathetic. And tons of time/CPU cycles to do so.

Which is why basically no one uses rar compression, but only 'store'.

Now, things like jpeg (a compressed picture format) can be a bit more effective (maybe 5% or so), and of course text can be.

But such material is a sliver of a percent of usenet. Simply not worth the effort.

Cabalo
01-31-2010, 05:36 AM
What a great insight how server farms work.
It was a joy to read. I've learned more in those minutes it took me to read the post than in years around using usenet.

Beck38
01-31-2010, 04:13 PM
As I've gone 'way back', I'm unfortunately reminded of a period of time (late 2008) where a lot of 'newbies' (or perhaps 'oldies') forgot (or didn't know in the first place) basic usenet posting routines.

Anyone who's been around for any time remembers back when 20% PARS was the standard. Even the largest server plants (like Giganews) had retention levels of only a few months, and 'propagation' between servers was fragile at best.

Slowly, folks 'forgot'. % pars became less and less. Rar parts became slowly bloated (50MB was the 'standard', I've seen postings where the parts exceed 10GB!). Added to that, was many folks access to 'hyper-speed' lines, where a 50GB BD disc could be posted in a couple of hours max (50Gb/s).

The servers really couldn't handle it. Many (if not all) of the parts contained 'skips', where a single RAR part had 5000+ sub-parts, and a good percentage of those simply weren't stored by the server they were posted to (even Giganews).

Then, the number of PARS were dropped to truly insanely low levels. Like .001%, where it wasn't even enough to recover a single part. People thought all they had to do is 'throw' the data at the servers, never take a look to see if it got there, or propagated from that server to any others.

In short, the thinking was they everything worked perfectly, all the time, because a hand didn't come out of their screen and slap them upside the face and yell 'wake up!'. :w00t:

The server staffs have upgraded their plants, and things seem to have calmed down a bit lately, even as more and more folks have access to 'super-speed' lines. The 'insanely low' levels of pars seems to have gone away quite a bit, or the folks/groups using that 'technique' have faded away (probably due to so many complaints at the time).

Things were 'creaking' a bit, hopefully we're entering a 'calm' period.

mesaman
01-31-2010, 09:30 PM
Astraweb is using duplicate Message-ID's which is pissing off Giganews.

Beck38
01-31-2010, 09:56 PM
Astraweb is using duplicate Message-ID's which is pissing off Giganews.

I seem to 'dimly' remember that. It's all a lack of software in their plant, I'd suspect. But s/w folks are, IMHO, generally allocated to the 10th level of the hot place.

My big bitch right now are the 'search' servers; it's not like they all (from newzbin to nzbmatrix to whoever) didn't KNOW long time ago that retention would be.... 'forever' at some point. Yet most/all seem to hit a wall back xxx hundreds of days, far short of the actual retention of the servers they are 'indexing'.

Actually, it's kinda like IPv4 addresses. Hey, I was 'almost there' back when all this was set in stone (federal gov't worker) when DARPA came up with the 'internet' (late 70's early 80's). I'm sure nobody thought we'd need addresses for Refrigerator's, for gods sake. But it was fairly foreseeable that if the system indeed would have 'taken off', that the amount was woefully short.

This was a hot topic as far back as the late 80's! I remember talking at length about it at meetings of a computer club in Tulsa, OK, with the SysAdmins of both Tulsa University and Oral Roberts. But I'm a telecom engineer, the hand-writing was 'on the wall' in big fat letters well before then.

Message ID's for Usenet? The RFC for this is so old, it creaks. Giganews I think has been bitching about an expansion of this for a LONG time. They saw, again, the writing on the wall. Look, back in the early-mid 80's, when the fastest fiber networks across north America were running 140Mb/s (yes MEGA bits!), 2000 voice channels, even then it was foreseeable that links faster than 40Gb (GIGA bits, 80 million voice channels) were within possibility, and with frequency (color) multiplexing, 10's to 100 times that, on a single fiber.

Done today. Across Oceans. Remember, AT&T/Bell Labs/Concordia or whatever they're calling themselves today, SAT on this technology from around 1960 through 1980, so when it got started in earnest (due to Supreme Court decisions like AT&T v. MCI and others), it was already 25 years behind. I think it's 'caught up' now....

So, the changeover from IPv4 to IPv6 is going to be.... interesting. I know that there are draft committee's working on updates to RFC 1036 (usenet), but how much real progress is up for grabs. Progress (?!) will run over them, flatten like a pancake. Road Kill.

Like the telcos. Talking around the globe was $$$'s per minute. Now, a couple bucks a month UNLIMITED from anywhere TO anywhere. Anyone feeding on that old paradigm is... Road Kill.

Beck38
02-02-2010, 05:04 AM
Tripped across this today:

http://www.slyck.com/forums/viewtopic.php?t=50333#p542147

I hadn't read that, but lots of press on the newzbin trial.

Two quotes:

"Conclusion: usenet is now a mode of permanent file hosting."

"Pretty much all the major providers have for the last 6-12 months been retaining everything and increasing retention 1:1."

And one person who doesn't realize that usenet traffic has been increasing several megs per day over the last couple years, to the 5TB/day+ *easily verified at several site that count such things*

"Large retention has also been brought about by less people using this format"

Cabalo
02-03-2010, 06:53 PM
Bigger retention = bigger attention brought on this protocol.
Not good, not good.

Beck38
02-03-2010, 08:47 PM
Bigger retention = bigger attention brought on this protocol.
Not good, not good.

In the US, the law pertaining to newsgroups and newserver/providers, has been in place literally for 30+ years.

Irrespective of 'new' laws like the DMCA, news-providers/servers are protected by the same laws that protect the telcos from getting sued everytime someone uses a telephone to phone in a bomb threat (or anything else for that matter).

Where the FCC/Government went 'off the tracks' a bit was in classifying the internet as a 'message service', not part of the 'common carrier' regimen, some 20+ years ago; of course, at the time it appeared that doing so was 'correct', and the corporate entities of the time (AOL et. al.) liked it because it 'solved' their problems quickly and easily. At the time.

The FCC, in now reviewing that decision, in light of 'network neutrality', has pretty much come to the conclusion that originally doing so was a BIG mistake, and now, how to they correct it and put the entire ISP/Internet infrastructure into common carrier status (like voice traffic).

This is all pretty esoteric to those without some fairly deep knowledge of telco regulations, but right now it's where the hammer meets the road.

But as far as the actual providers, they're completely protected, sans some wacko Supreme Court decision overturning 100+ years of law (which we've already had this year from the 'activist' right-wing court, so....)

The part left 'dangling' is the part between those providers and the customers. The situation we have today is that the infrastructure (like the telcos) are completely protected (you can't sue or arrest them for 'carrying' that bomb threat), but the entity carrying it the last 10 yards can mangle, delay, halt, or simply loose the 'message', AT THEIR DISCRETION.

Pretty wild. Then again, I'd have never thought that telcos would go along with wholesale warrentless wiretapping, and where the Government would (though the backdoor) give them bennies for doing so. (Patriot Act). I know people back during Nixon that stared down (armed) Federal Marshals keeping them out of phone plants, now they welcome them in with open arms to do what they please.

Anyway, things now are in flux, but as I said, the law (as it is today, as I'm typing this) totally protects the servers (Giganews, Astraweb, et. al.). Everything else, not so much.