Main:
Home

Installer:
Windows Install
OSX (Dropflux)
Linux HOWTO

Documentation:
Forums
Main Docs
Command Index
Main POD
Efficiency
Module Docs
ChangeLog
TODO List
FAQ

Tutorials:
First Tutorial
Second Tutorial

Perl Modules:
WWW::URLToys
LWP::TkIO

Scripts:
Console - Modular
Console - Standalone
GUI Downloader

Extra Utils:
XTTitle for Unix
XTTitle for Win32

Contact:
Contact Me

Scripts
Tutorial

wget.u
notepad.u
jolene.u
indexes.u

Logos









 
Current Version: 1.28, updated 6/19/04 [Change Log]
Calculating Efficiency, or "The Evolution of a Good .flux"

Skip to the interesting part (the example), or keep reading if you like math.

Hello all. I'm not sure if you should take this article on effiency seriously... I'm not sure I take it seriously myself. It's a little lengthy, but there are lots of small words to keep you interested. This is important, however, because one thing that has been more and more interesting as flux takes hold is the steady new flow of really poor flux. It takes someone that knows URLToys really well to properly gauge the "quality" of a flux, instead of judging it by its final content. This article will discuss a standardized way of explaining how efficient a flux file is, along with some examples.

In true programming, people tend to gauge the efficiency of an algorithm by using its best case, worst case, and average case scenarios for it, and use what they call the "Big O" notation to describe it mathematically. For those that already know the Big O notation, this article will almost read as a parody, and is somewhat intended as such. For those that don't know it, either Google it (if you're curious), or don't worry about it. Since I'm already going to bastardize the hell out of the idea in the first place, knowing it really well won't buy you much.

flux files can deal with changing content (such as pulling the first ten files off some updating site), but realistically they don't have a "best case" or a "worst case". They tend to be quite static. On the other hand, URLToys is so (annoyingly) flexible that there are many, many ways to flux the same site. The inspiration for this article is based on that notion. There's a handful of good ways to flux a site, one or two great ways, and an infinite number of poor ways. By creating a way to describe how 'good' a flux setup is, we basically create a way to brag about how much more efficient our flux file is than someone else's, thereby making us infinitely nerdy, but help the flux community at the same time.

Instead of worrying about the Big-O number ( O() ), we are going to talk about the Big-E value. E represents the efficiency of a flux. Our goal here isn't to come up with a singular value that can be represented, but instead to be able to take two fluxes and compare them using all of the factors. Specifically, we are trying to pin down exactly what an efficient flux is, and a set of rules that decide where someone needs to reshape their flux. Here's some math to get us started:
E = aL - ( bH * cS )

E                = Overall efficiency of the flux
L (lines)        = [finished URL count   ] / ["addition line" count]
H (hits)         = [number of server hits] / [finished URL count   ]
S (server speed) = [some value representing a common speed comparison between servers]
a,b,c            = nonzero coefficients that help define a reasonable comparison value
You              = a nerd for reading this article, but not as nerdy as me for writing it

In its reduced form, the formula is simple (E=L-HS). The higher the final E value, the better the efficiency. The variables (a,b,c) are there for use if someone were to take this article seriously enough to turn flux files into an actual efficiency number. I'm using the math here as a stepping stool to discuss some of the finer points of good flux creation. Let's break it down.

The variable L represents a ratio of the final URL count to the "addition line" count. I consider addition lines in a flux file to be all of the commands in the (add,fusk,fusker,seq,zeq) category. The commands that add the seeding content to a flux file. The start. For example, if I had a single line in a flux file that looked like:

zeq http://www.example.com/04.jpg

My "addition line" count would be 1 (its the only line in the flux, even), and my total URL count would be 4 (this generates 4 URLs). In this scenario, my L value would be 4/1, or 4. In a list of "add" lines that just add a single URL over and over would set L to a 1, since the ratio of add lines to finished lines would be 1/1. This brings me to Rule #1:

Rule #1: Use zeq, seq, and fusker whenever possible.

You should never have a list of numbered URLs in a flux file. We want our flux file to be nice and lean, and allow URLToys to fill in the blanks. Since L is the only positive part of the equation, you want this value to be as high as possible. This is why fusk, zeq, and seq are so good. They cost one "addition" line, and net at least two final URLs. The add command should be used solely for HTML pages that are about to be read. This brings us to our next variable, H.

H is short for "Hits", which is actually the total number of HTTP conversations URLToys needs to have to make the final list, divided by the final URL count. Seeing as this (multiplied by S) represents the negative part of the equation, your natural instinct is to completely remove all hits and just give a huge list of URLs to download. If URLToys doesn't need to make any hits, the only variable weighing in your efficiency is the value of L... the problem is that sometimes a single hit can bring in a few hundred distinct URLs that would cost 100 addition lines to achieve, making a negligible H value, and an enormous L value. But what if that single hit takes a minute to complete? This is where S comes in.

S is server speed, which is vague sounding on purpose. There are many factors that weigh into S... client bandwidth, server bandwidth, network utilization, server load, etc. This is the variable that might sway the value of Efficiency the greatest amount. You might have come up with a brilliant flux that generates thousands of links in 10 hits or so, but some person on the other side of the world hates it because of the time it takes to harvest 10 links. The relaxing part of this is that the files you want tend to be on the same server as the HTML file, so if someone is agonizing about 10 hits, they won't want to get the files in the first place. These variables bring us to rule number two:

Rule #2: Keep server hits to a minimum by using make, href and img sparingly, but not at the cost of wasting flux size.

Since flux is just text, its helpful for others to learn from your flux files as well. If you just generate a huge list of "add" lines, not only is it horribly inefficient (L will be really close to 1.0, which is lame), but you won't be accidentally educating many others that might decide to read your flux and understand the magic behind it.

When you look back at the original formula, you'll see that you want L to be ridiculously large, and H and/or S to either be zero or really small. A list of zeq and fusker strings in a flux file is about as efficient as you can get, unless you can cut it down to even less zeq and fusker strings of URLs, followed by a single "make" command that might only net a handful of hits. Most of this type of thing comes with understanding the way the page you are trying to URLToys holds its links.

Rule #3: Try to recognize common practices on how sites are laid out and adjust your method accordingly, instead of attempting the same few commands on every page.

Let's try to understand this all by example. Here's a great web site, full of magical pictures:

http://urltoys.com/arles/

Ok, so the pictures aren't very magical, but they help illustrate a point. There's a million ways to get these pictures from this gallery, but many are stupid. Let's start with the most stupid.
add http://urltoys.com/arles/
make
keep \.jpg
You run these commands, only to find out that you get zero JPEG files. "What gives?!", you say. "I thought 'make' did all of the work for me!", you say. Look over the index again. If you wave over the links and look at the status bar of your browser , you'll see that all of the links on this first page take you either to another index page (index2.html), or they take you to an image page (image1.html). It seems that image1.html contains the actual meat and potatoes that we are looking for.

"NO PROBLEM!", you shriek in delight, armed with this new knowledge of the page's links. You attempt:
add http://urltoys.com/arles/
make
nodupes
keep imagepages
img
keep \.jpg
nsort
Sure, it looks good at first glance. You're proud of yourself to remove the duplicates before running the "img" command. You even remembered to sort the JPEG files you found at the end. The problem with this approach is that you only got the images from the first page, and it required 16 hits on the server before you had the list! 16 hits to net a list of 15 images is horrible efficiency ... you end up with an incredibly high H value, and depending on the server speed (S), you might have killed the small advantage you had by only having 1 "addition line".

To correct the first oversight of not getting every pic, you look over the gallery and realize that there are 3 indexes, so you do this:
add http://urltoys.com/arles/
add http://urltoys.com/arles/index2.html
add http://urltoys.com/arles/index3.html
make
nodupes
keep imagepages
img
keep \.jpg
nsort
... you finally have all of the images in your list. However, you had to complete 46 hits to get 43 final images, plus your "addition lines" tripled (sure, its dramatic, but I am illustrating a point). If you had followed Rule #3 and just looked at the site a little more, you'd notice that this list of images you are trying to get happen to be in the exact same place, the only difference is that the last number is incrementing from 01 to 43. So you try this instead:

zeq http://urltoys.com/arles/images/picture43.jpg
What a huge efficiency difference! A single addition line and zero hits nets you the exact same list that required 46 hits and 3 lines earlier. The difference? You gave yourself a minute to look for the pattern of numbers, and drew on your knowledge of the zeq, seq, and fusker command. You could have just as easily done this:

fusker http://urltoys.com/arles/images/picture[01-43].jpg

Of course, that's a few bytes longer than the first way, but that might be crossing the line a bit. Not that writing an article like this isn't. Here's a doozy for you ... what if the file names were random? Check out this site:

http://urltoys.com/arlesrandom/

This one is a little tougher, since we don't have the simplicity of the numbering scheme to help us out. We could always mimic what we did a few steps ago...
add http://urltoys.com/arlesrandom/
add http://urltoys.com/arlesrandom/index2.html
add http://urltoys.com/arlesrandom/index3.html
add http://urltoys.com/arlesrandom/index4.html
make
nodupes
keep imagepages
img
keep \.jpg
nsort
It works, as expected. But it took 54 hits to get there... ugly. We'd love to lean on the power of zeq and fusk to get us out of this one, but there's no real help, not a number in a URL in sight. This is a little tricky, but since the article is emphasizing efficiency and not simplicity, it's appropriate. Click on the first image in that set. Notice anything funny? There's a pulldown menu there ... and guess what? The names of EVERY PICTURE are in the list! Wow, if only there was a command that let us get URLs from any portion of text we please...

add http://urltoys.com/arlesrandom/imagepages/image1.html
make >\d+\s+(\S+)<\/option>
replace imagepages images
nodupes
nsort
One addition line, one hit, 50 images? It sounds too good to be true. This is only one step worse than the zeq line we had earlier, and this would handle any Arles gallery with a pulldown menu. How does it work? If you look at that scary looking make line, you'll see portions of HTML in it, including the OPTION tag. OPTION tags are what makes up pulldown menus, and make uses a "regular expression" to tear the names out of the links. After that make line, you have a list of URLs to images, but since make doesn't know better (since its not taking them out of standard A tags), it thinks that the images are in the "/imagepages" directory. The replace command fixes that one up instantly. This brings up to step 4:

Rule #4: Know your URLToys commands.

There's nothing that helps a flux file achieve efficiency than a well written regex with make, or a well placed replace or strip command. Be creative! Keep your L's large, and your H's small. We don't have a lot of leverage with S, but no one wants flux files that point to bad servers in the first place. And don't forget:

E = L - ( H * S )
Ok, so the math was a bit corny... shoot me.

Written by Joe Drago, Copyright (C) 2004, Under the BSD License