RRR Sorority Division presents PIMPPA v0.6.0 / 17.oct.2006 http://pimppa.sourceforge.net/ 1. Legal junk ------------- See file COPYING 2. What is this anyway ---------------------- Imagine the unfortunate situation where much more stuff (and even more crap, spam, etc) comes from the Net than you are bothered to handle manually. Shall we say, several hundred files daily, totalling to over hundreds of thousands of files in the long run. Direct net connection and access to an adequate news server easily provides such conditions. PIMPPA is designed to help in such a case, to minimize the required human interaction, striving for complete "hands free" operation, where you could once a week (for example) drop by and check out the already filtered and polished catch. The system is suited for all types of files, but it was first designed with pictures in mind. :) PIMPPA is able to automatically fetch and process files from newsgroups and "web image gallery indexes", while performing the necessary decoding, duplicate discarding and further file processing tasks, like simple classifying: files downloaded by pimppa tools can be automatically deposited to appropriate directories based on their filenames. PIMPPA also provides nifty command line interfaces for file access and backup operations. In addition, there's a Gnome GUI. 3. Requirements --------------- See file SETUP. From the file you see that PIMPPA depends on much external stuff. Its because there is no reason to reinvent the wheel. I think it better to use the best applications available for specific tasks (read: I'm lazy). When the external programs get better, PIMPPA gets better. If completely new programs surface (performing similar tasks) they could be easily integrated into the system. 4. Install & Upgrade -------------------- See file SETUP. "Third floor, second door on the left". Isn't bureacracy such a nice thing? In this case the standard GNU file INSTALL is not of much use. :( 4.1 Uninstall ------------- There's a script "uninstall_db" in "extras/" to destroy all pimppa related stuff from MySQL databases. Use that, and then remove the pimppa installation, e.g. use "make uninstall" in the source distribution main dir. Other distributions than the source might require distribution specific uninstallation. 5. How to use ------------- Here's some examples and hints. 5.1 Getting started on newsgroup sucking ---------------------------------------- (If you dislike hot command line action, you might want to use the GUI "bowser" instead of following these instructions. You'll still have to do things in the same order: Settings -> Preferences -> Add some fileareas, add/modify news server, add some newsgroups. After that you can probably leech.) To leech stuff from the news, you should first tell the pimppa the hostname of your primary newsserver. shell> pnewsrv -s -i 1 Then we need to setup some destination (incoming!) fileareas for the files to be added on when downloaded. Use the command: shell> pnewarea -i -n -p As you now have a couple of fileareas, assign some newsgroups to them. Any number of newsgroups may be assigned to a single area. shell> pnewgrp -n -d Ok, after this tedious configuration phase we can finally leech. shell> pleech -v If the leech succeeds, you should have several files decoded and added to the destination fileareas. Most of them will likely be crap, but lets pretend that there's some famous Goatlord-stuff (pictures) called "blah*.jpg" there you wish to keep. There's not yet any location for Goatlord's material, so you would want to create a new filearea for them, a "non-incoming" area which is meant only for quality stuff. shell> pnewarea -n goatlord -p /stuff/goatlord/ Now you have a new filearea called "goatlord". Enter the directory where you got the "blah*" files and execute "pmv blah* goatlord". Files will be moved, and in future all files fetched by "pleech" matching a special assign pattern (see 9.2) created for "blah*" will be automatically moved to the "goatlord" filearea, and not to the newsgroup's default destination. If rest of the files you got were junk, you can delete them (with "prm", or to be really sure that you'll never see them again in the future, pmv them to 0DOOM). See chapter 7 for details. Now it might be a good idea to make "crond" execute "pleech -q" nightly. :) In that case, remember to set the crontab variable HOME correctly to point to a directory where there's a .my.cnf to provide you access to the MySQL database. 5.2 Downloading through web gallery index sites ----------------------------------------------- Suppose there's www.smutpeddler.org that daily lists links to image galleries. The links must point directly to image galleries, so that image thumbnails are seen on the page where the link points to. "pleechwww" can crawl through such index sites, given a regex keyword to match. For example, shell> pleechwww -u http://www.smutpeddler.org/ -d 1 -k ".*cute.*" would crawl all links with word "cute" in them. The found jpeg files would be inserted into filearea number 1 with the usual pimppa duplicate checks etc applying. The files will also be tagged with all words as keywords that were found in the link or its description. 5.3 Files from other sources ---------------------------- You can just normally "mv" the files to some directory assigned as a PIMPPA-filearea. Then use "padopt" to add them to the PIMPPA database. Later, those files are just as any which were sucked from the news or www. Note that adoption will do some duplicate checking and may discard files. But suppose you download a lot of unsorted crap to a single directory, which is not a PIMPPA filearea. Then you can use shell> padopt -m where is the default destination location for these files. WARNING: all unsuitable files from the current directory will be deleted (dupes, invalidly named, etc) and the rest is moved according to knowledge, or to default. 5.4 Viewing & backupping files ------------------------------ See chapter "Scripts" for quick descriptions of some viewing interfaces. When your harddisk starts to burst with all the stuff you have downloaded, it might be time to store them elsewhere. See chapter "Utilities" for util called "pbackup". 5.5 Manual duplicate checking ----------------------------- While surfing the web, you encounter some tasty files, named "blood_*.jpg". Unfortunately, you can't remember if you already have them. (It might be hard sometimes). Easy, util "pf" is just for that purpose. Use shell> pf -p blood% and you get the result rather quick even if you had over 100000 files on your PIMPPA system. If the files are online, you could view them just as speedily by using "pv_name", e.g. shell> pv_name blood% 5.6 Changing preferences ------------------------ PIMPPA stores most of its preferences in table p_misc, reminiscent of infamous registry on windoze. Utilities will look for (key,value) pairs from the database and use them if found. Otherwise compile time defaults are used. Commandline values override both. You can set preferences using "pcfg -k -d " and view them by just "pcfg". You can also use the GUI, see the next chapter. Keys suitable for modification are: CFG_BOWSERSINCE Value: Number. How many days since does bowser display the files on startup. Default: 1 CFG_FILETYPES Value: String. Extended RegExp pattern of allowed filetype extensions. When downloading newsgroup articles, we try to match any of these extensions from the Subject: line, and find out the filename that way. Example: \.jpg|\.png|\.gif|\.mpg|\.avi|\.asf|\.rar|\.r[0-9][0-9]|\.s[0-9][0-9] CFG_MINASSIGNNAMELEN Value: Number. Filename must have this many characters before the extension for the assign pattern (see 9.2) to be created/noted at all. Short patterns tend to go wrong often. Default: 4 (means: don't create assign pattern for e.g. file a10.jpg) CFG_NEEDNUMBERS Value: Number. Require this many numbers in all filenames, before the extension. Files having less will be skipped. (example: blah123.r01 == 3 numbers) Default: 0. My suggestion: 2 CFG_NEEDOTHERS Value: Number. Require this many non-number characters in all filenames, before the extension. Files having less will be skipped. (example: blah123.r01 == 4 others) Default: 0. My suggestion: 2 CFG_NEWSWAITAFTER Value: Number. After leeching this many articles, wait CFG_NEWSWAITSECS (to allow the server to catch its breath). Default: 10 CFG_NEWSWAITSECS Value: Number. After leeching CFG_NEWSWAITAFTER messages, wait this many seconds. Default: 2 CFG_NOSPACE Value: Boolean. Do NOT accept any filenames containing spaces. 1 == TRUE, 0 == FALSE. Default: 0. My suggestion: 1. CFG_STRICTMD5 Value: Boolean. In case of MD5 collision, delete the newer file. 1 == TRUE, 0 == FALSE. Default: 1. CFG_TMPDIR Value: String. The directory where pimppa keeps its temporary files, like uucoded material downloaded from the news. This directory should usually have atleast several hundred megabytes of free space. Default: "/tmp/" CFG_TOLOWERCASE Value: Boolean. Convert incoming filenames to lowercase. 1 == TRUE, 0 == FALSE. Default: 1 CFG_VIEWER Value: String. Use this viewer executable/script to view files in bowser and from scripts. It must view all files in the current directory. To use e.g. feh, set it to "feh -F -Z -d -S filename -r ." Default: gqview 5.7 Advanced: Examples for a collector -------------------------------------- In the wide world there might be just a few series of files you want, but you just don't bother to personally seek them out. There are two styles of REGEXP rules that can be used (and defined by "bowser" by altering p_rules table with Preferences-menu) for filtering. The other type defines patterns that must be matched in filenames in order to accept/reject the files and the other type defines patterns to be matched in message subjects, for similar purposes. For usage instructions, see section 9.1. 5.8 Advanced: Using suckkillfile -------------------------------- You can use a suckkillfile with pimppa news leech. As default, pimppa will try to read the file as "/usr/local/share/pimppa/suckkillfile". This can be changed by bowser or pcfg (CFG_KILLFILE). When leeching article headers, suck will download only those headers which pass the killfile. Others are discarded, and pimppa won't ever see them. To put it another way, suckkillfile puts a lot of potential spam-prevention power at your disposal. See "man suck" for details of how to use the killfile. 6. GUI ------ PIMPPA has a Graphical User Interface called "bowser", which lets you download newsgroups and perform various operations on your fileareas, files, assign patterns, file types, newsgroups and other settings. The GUI usage should be self-explanatory. Most operations (like View/Delete/Move) are accessed by pressing the right mouse button on a list window. NOTES: 1) The GUI does not support WWW leeching or keywords at the moment 2) GUI uses "gqview" to view files. It can be changed from Settings/Preferences/Miscellaneous. 3) Warning: "View", "Delete" and "Kill" operations automatically affect all selected items! They do not ask confirmations. 4) When downloading newsgroups from multiple servers, a selected group is always leeched from all servers its defined for. Most of the really powerful operations of PIMPPA are done using plain shell utilities and scripts - without fancy user interfaces. :) So after you get bored with the GUI (its a bad hack), take a closer look at the commandline stuff. 6.1. Setting preferences from GUI ---------------------------------- All fields in prefs reflect their respective SQL tables and table contents. With a little thinking and reading this file you should be able to get 'em right. Remember that you have to restart the GUI for some of the "Miscellaneous" preferences to take effect. NOTES: 1) When adding fileareas with the GUI, remember to include "/" to the end of the directory path. 2) Remember to set all areas which newsgroups are directed to as INCOMING. 7. Utilities ------------ There's plenty of these little bastards. The most important are probably pleech, pleechwww, pmv, prm and padopt. Some of these utilities have been integrated to the GUI, but unfortunately with no ability to adjust their parameters. padddir (obsolete at the moment, see Note) ------- This adds the contents of the current directory to filearea 0. It can be used to prevent the stuff there from bothering you ever again (they will be skipped if re-encountered in future). Usage: padddir [-v] -v Verbose execution Note: Perhaps a better way to do this would be to move the files to some 'dump' filearea, add them there with 'padopt' and then kill them with 'prm'. That would produce a couple of assign patterns instead of keeping all the filenames in the database. Result is more general: all similarly named files will be deleted in the future. You can also move unwanted files to 0DOOM. This keeps the MD5 checksums of the files as well (preventing renamed versions of the same files surfacing again). padopt ------ This goes through all the PIMPPA fileareas and adds the files that are found in the dirs, but not in the database, to the database. Usage: padopt [-v] -a SQL area name pattern to operate on, def=all -m Move files from current directory (unless dest. for file is known, move to ) -v Verbose execution -m switch moves files matching assign patterns from current directory (which shouldn't be any filearea!) to fileareas and the rest to area_id. WARNING: Unsuitable files and duplicates are deleted! Note: padopt updates assign patterns. passign ------- This browses through all your files and automatically recreates the "assign patterns" for them. See section 9.2 for explanation. Usage: passign [-c] -c Delete all previous patterns. Does not affect patterns with destinations 0 or -1. -f Show current assign for -r Refresh (recreate) all assign patterns -0 Delete all assign patterns with target area 0. -n Delete all assign patterns with target area -1. -t With -f, set a new target area id instead Note: Assign patterns are not created or used for contents of AREA_INCOMING or AREA_NOASSIGN areas. pbackup ------- The backup util. Files which have "file_backup==0" in database [or the value specified with "-r" on the commandline] will be included in the backup. Backuped files will be given a new file_backup ID [or the one specified with "-r"] ... Files will be included until media size is reached. Media size can be specified on the commandline (in Megabytes). In normal operation the next backup will start from the area where the last backup operation ended. A shadow directory of the backuped files is created to a specified location ("-p path") along with a filelist. Default is "/tmp/test/". You can then create an ISO-image out of the created directory structure, stream it to tape or whatever. e.g. after pbackup shell> cd /tmp/test shell> mkisofs -J -a -d -D -L -R -r -f -o /tmp/cd.img -V JUNK03 . shell> cdrecord -data /tmp/cd.img And to get rid of the backuped files from HD, use shell> pclean -i Note1: Without "-n", this utility tries to fill the backup media to the brim. In any case, it can't know which files should go together. For example, disk archives belonging to the same product may end up on different backups. Note2: After using this utility, its suggested to do atleast "du -L" in the destination directory to see that everything went ok. Note3: Contents of INCOMING areas will not be backupped. Usage: pbackup [options] -n Be nice, stop at first file not fitting on media -p Path to make the shadow directory into -r Specify an explicit backup ID number -R Randomize area order -c Skip this file area, can be entered many times -s Media size in megabytes (1MB==1024*1024 bytes) pcfg ---- A simple command line tool to configure pimppa, that is, to modify p_misc table. You can also use "bowser" or "mysql" directly. Usage: pcfg [options] -k MUST, the configuration key -d MUST, the value associated with this key Without options it prints the current setup. See 5.6 for known keys. pchkfn ------ Checks if a filename can be accepted to system (including rules, duplicate & assign pattern check). Usage: pchkfn [options] -a An INCOMING area of which's destination regexp patterns are used -s Operate as kill program for suck -n Nazi mode, returns nonzero if no positive destination area for file is already known NOTE: -s is obsolete. PIMPPA won't call "pchkfn -s" anymore, though you could use it if you run suck manually by hand with your own killfile, with PROGRAM=pchkfn -s. Returns 0 if the filename is ok. pclean ------ Performs various delete operations. Use with caution! Usage: pclean [options] -a Purge duplicates from -d Delete files found in db from the current dir -f Delete files which failed integrity check -i Delete files associated with backup id -n Delete files from which have no numbers in filenames. -o Delete all offline files without backup id -p 'Purify' current dir -s Simulate only, don't delete anything -t Remove trash (files with file_area=-1) -z Zuper slaughter, immense -v Verbose execution -a : Deletes those files from which exist on other areas. Works on filename basis only. -d : Deletes ALL files from the current dir which are found in the database. Don't use this on actual pimppa filearea dir. -f : Physically deletes all files which have failed the integrity check. Files will be marked as FLAG_OFFLINE and replaced later if the file is re-encountered e.g. by "pleech". -i : To be used after a successful backup operation. It deletes all files associated with a given backup_id, and gives them an FLAG_OFFLINE status to the database. -n : Delete files from which have no numbers in filenames -p : Checks the current directory (which shouldn't be a filearea!) against your current file validity rules and assign patterns. Files which have negative assign patterns or violate the rules (e.g. NEEDNUMBER, see "src/pimppa.h") will be deleted. -s : Simulate execution only, don't change anything. Most useful with '-v'. -t : Deletes file-entries from database having file_area=-1. Such files can occur if you move files to 0DOOM instead of deleting them. -z : Deletes offline files from areas marked AREA_INCOMING, creates negative assign patterns for ALL of those files, and deletes the files from the database. pdesc ----- Used to give text descriptions to files. Usage: pdesc <"Desc"> The files you wish to describe in current dir <"Desc"> The description you wish to give them e.g. shell> pdesc goat*.jpg "Happy Goats" pf -- "Pimppa Find". Searches for files from the database. By default it looks for filename-based patterns. Requires SQL style wildcards (use '% for *' and '_ for ?'). Usage: pf [-dpv] Filename pattern -d Try to match file descriptions instead -p Print pathnames as well -v Verbose, print file descriptions The printing format is '[path/][filename] [backup_id if nonzero]'. pkw --- Appends more keywords for files on a filearea matching current dir. Usage: pkw [FILES] -k key The keyword to tag with (can be used multiple times) -v Verbose execution Currently there is no easy way to remove keyword tags. Removing the file with "prm" will remove the keywords for that file from the database. prm ----- Deletes files from the filearea matching current dir. Both the physical files and the records in the p_files -table will be deleted. A new assign pattern will be created for each file having -1 as the destination area, so that all similar files will be killed when encountered in the future. Usage: prm [-v] -n Nice kill. Don't create negative assign patterns or delete the file entry from database, just mark the file as offline and delete the actual file. -v Verbose execution Use prm with caution. :---) Note: If you wish to keep the MD5 checksums of the files, preventing the renames of the same files entering your system again, you should use "prm -n" or move the files to area 0DOOM, e.g. "pmv 0DOOM". pleech ------ This is the command line downloading tool. It fetches files from the newsgroups or ftp's you have defined. In case of news, promising articles are downloaded, then decoded, duplicates are re-checked, and suitable files are assigned and added to their appropriate fileareas if they match an assign pattern. Otherwise they will be entered to the default destination area. FTP-downloading is similar. As default, pleech tries out all newsgroups defined from all servers you've defined them on. Usage: pleech [options] -a Wait After n messages, def=10 -f Read (ftp-url, target area name) pairs from file, leech the url recursively -g Newsgroup name SQL pattern to match, def=all -H NNTP server name SQL pattern, def=all -i Insert msg subjects into database as file descs -k Keyword hunt. Accept only messages whose subject line matches a keyword pattern in p_rules. -l Lenient article selection while leeching news (leech everything, select files after decoding) -n Nazi behaviour. Try to skip all dl'ed files which don't match an already defined assign pattern. -q Quiet operation (don't print newsleech BPS etc) -r Restart newsgroup(s) -s Sloppy download mode (take & accept everything) -v Verbose output -w Wait secs after [-a n] messages, def=2 pleech depends on "suck" and "uudeview" for news, and "fget" for ftp. They must be on command execution path. NOTE: fget *NEEDS* to be patched (patch in "patches/" dir)! [The ftp support is most likely deprecated anyway.] Newsleech example: shell> pleech -v -g %supreme.moose% Verbosely downloads all active newsgroups containing string "supreme.moose" in the group name. FTP use: With -f, the urlfile is a text file containing one "URL AREA_NAME" pair per line, e.g. ftp://ftp.gigaturd.org/pub/pics incoming ftp://user:pass@grindcore.com/core incoming2 pleechwww --------- A tool to crawl through web image gallery index sites and fetch the images. Usage: ./pleechwww -a types comma separated list of filetypes accepted (default: "jpg,jpeg") -u url URL to fetch (required) -k regex regex to match in link text (required) -d destarea_id area id to move the files to (required) -v be verbose (default=off) -V print version info pleechwww works by taking as input an url, keywords to match (as regexp) and a PIMPPA destination area number. The page the url points to is assumed to contain links to image galleries, and that the link description matches the regex keyword if its to be leeched. All the words in the link desc ('bob sent images of these lovely bananas') will be added as keywords for the received files in the database (i.e. you can do e.g. 'pv_key bob' to see the files). Browsed links will be marked up in the database and won't be rebrowsed on further runs of pleechwww. For WWW leeching, the standard pimppa md5sum checks etc apply. The fetched files will be renamed according to the link description (since the web-gallery images are usually named just "01.jpg" etc ...). WWW leech requires wget to be installed. Note: to download all links, regex ".*" may be used. pmarkoff -------- Goes through your fileareas and marks offline files (not available in their directories). Such files are skipped by most operations, like viewing and backuping. pmd5sum ------- This is used to create/print/lookup/scan MD5 checksums of the files in system. Some of the operations below affect only online files. Usage: pmd5sum -b Create md5sums for backup id -c Create md5sums for online files -l Look up files matching a given hex MD5 sum -p Print sums of files matching an SQL filename pattern -s Scan whole database for collisions -v Verbose operation -w Wipe DB from MD5-colliding files (in the case of collision, the oldest file is kept). Other usage is self-explanatory except that of -b. If you use it, you must either move the files back online, or the easiest way is that if you have backuped the stuff using pbackup, and all your fileareas reside under "/stuff", do something like "mv stuff old ; ln -s /cdrom /stuff", then do "pmd5sum -b " and after that replace the softlink "/stuff" with the original directory. pmv --- PIMPPA mv. Moves files from the filearea matching current directory to some destination area. The matching area will be found out by using getcwd() and p_areas -table. Usage: pmv Example: shell> pmv fish*.jpg aquarium would move fish*.jpg from current filearea to a filearea called 'aquarium'. The actual destination directory location is fetched from the database. NOTE: Moving files around with just "mv" will mess up the database, e.g. pimppa won't know the files have changed their location. pnewarea -------- Adds a new filearea to the database. Note that you must create the actual directory yourself with "mkdir". Usage: pnewarea -n -p [options] -a Mark area as AREA_NOASSIGN -c Context number for this area, default: 0==global -i Mark area as AREA_INCOMING -I Suggest id number for the new area -n Area name, MUST -t Mark area as AREA_NOTRANS -T RegExp dupecheck pattern, default $ -p Area directory path, MUST Example: shell> pnewarea -n pasture -p /stuff/pasture/ would add a new filearea with name "pasture" and directory path as "/stuff/pasture/". pnewgrp ------- Adds a new newsgroup to the system. Every group needs some filearea as a default destination for the decoded files. Usage: pnewgrp -d -n -d Destination area name, MUST -n Newsgroup name, MUST -i Newsserver, default=1 -i is used to tell which server this group is downloaded from. If you wish to download same group from different servers, just re-add it with all the server id's you want. Server 1 is the primary (default) server. pnewsrv ------- Tool to add/modify newsservers known to the system. Usage: ./pnewsrv -s [options] -s Server name, MUST -u Optional username -p Optional password -i Suggest newsserver id is unique ID of this server. To set a newsgroup to be leeched from a certain server, use this number. Server 1 is the primary (default) server. ptest ----- File integrity checker. Searches for untested files and tests them. Sets the database flag "file_integ" accordingly. The utils used for integrity checking of different filetypes are defined in PIMPPA SQL table "p_types" and can be modified from the GUI "bowser". If column "type_testokstr" is an empty string, the test command return value will be used instead. 0 == FILE OK, anything else == FAILED. Usage: ptest [options] -n Nazi behaviour. Delete files which fail the integrity check. Use with CAUTION. -v Verbose execution Note: "sql/example_types.sql" -file has example configs for some integrity checkers and file transformers. ptrans ------ File transformer. This can perform tasks like file optimization, conversion or perhaps add/remove spam (yuk). It searches for integrity-test passed files which have not yet been transformed and tries to transform them. The utils for transformation are defined in PIMPPA SQL table "p_types" and can be modified from the GUI "bowser". If some filearea has "area_flags" set to AREA_NOTRANS or AREA_INCOMING, the contents of the area will be skipped. Usage: ptrans [-v] -v Verbose execution Notes: 1) Successful transformation will usually modify the md5-checksum of the file. 2) "sql/example_types.sql" -file has example configs for some integrity checkers and file transformers. 8. Scripts ---------- These are just shell scripts, so you can easily edit them with a text editor to suit your needs. Note: the viewing scripts are currently for picture material, and default to "gqview" as the picture viewer. To change it, modify p_misc key CFG_VIEWER (see instructions in 5.5). All the viewing scripts affect all files anywhere on the PIMPPA system (unless they are offline or on incoming areas!). p_areas ------- Lists all your fileareas, their paths and their area ID numbers. p_con ----- Prints out the contents for a given backup volume id in "Area | Megs" format. p_contexts ---------- Prints out the contexts you have defined along with their id's and descs p_groups -------- Prints all newsgroups and their target areas. Usage: p_groups -a Print only active groups -d Print only disabled groups p_gtog ------ Toggle active/disabled status of newsgroups matching given SQL newsgroup name pattern. Disabled newsgroups are not leeched by pleech or bowser. Examples: shell> p_gtog %humbug% (toggles all groups with humbug in the name) shell> p_gtog alt.test (toggles just group "alt.test") p_leechctx ---------- Downloads all newsgroups whose target areas belong to the given context. Usage: p_leechctx p_loc ----- Finds out backup id's containing files from a given filearea, specified by area name (or sql wildcard containing pattern). p_maint ------- Useful to run daily from crond after "pleech". It just performs "padopt", "ptest" and "ptrans". p_prunejpeg ----------- This script can be activated to be automatically run after news decoding to delete too small jpegs. See Section 9.7 for details. p_rename -------- Renames files on the database and on the disk. Usage: p_rename Script submitted by A.E. pv_desc ------- Views files matching a given SQL format file description. Verbal descriptions can be specified for files with "pdesc". pv_key ------- Views files matching ALL given keywords. Keywords can be added with "pkw". pv_last ------- Views files that arrived since the last run of this script (Bowser's "Extras/View since last" just executes this script.) pv_name ------- Views files matching a given SQL format filename pattern. E.g. to display your kitten collection: shell> pv_name pussy% pv_since -------- Views files which are newer than given number of days. pv_sql ------ Views files by any suitable SQL WHERE statement. MySQL manual is a suitable starting point if you don't know SQL. rc2sql ------ Converts and inserts a suck .rc file (newsgroup list) to 'p_groups' table. viewdeep -------- Actually not much to do with PIMPPA system. If you have used 'wget' to mirror some website, but do not bother to click around the zillion directories, you can for example use 'viewdeep "*.jpg"' to give you a quick access to all jpg files in the current dir and all its subdirs. 9. Some behaviour notes ----------------------- How it works? What it eats? 9.1 RegExp file classification rules ------------------------------------ A) Filename based rules, r_type == 0 In table "p_rules" you can specify (with "bowser") filename matching RegExp patterns, and tell pimppa to move the matching files to areas of your choice (-1 meaning skip/discard, in r_target). The rules will be checked before the assign patterns (as the rules are human generated and assign patterns usually made by pimppa, and I trust the user more). With this mechanism, you can for example always redirect jpegs to one area and gifs to another, or delete particularly annoying and re-occurring file series. Note that pleech option "-n" can be used to specify operation where only such files are downloaded/kept that match positive, known rules. Examples: 1) Rule (r_rule="goat.*jpg",r_context=0,r_target=1) would accept all files named "goat*jpg" globally (r_context=0) and send them to filearea having area_id 1. 2) Rule (r_rule=".*hairy.*",r_context=1,r_target=-1) would skip/discard (target=-1) all files having string "hairy" in the filename, if they were meant for any area belonging to context 1. B) Keyword rules (r_type == 1) Other rule type is the subject keyword rules. These rules define patterns that must be matched on the messages subject line for the message to be accepted. This ruletype can also be used to automatically discard matching messages, in that case set negative value for r_target (0DOOM). pleech option "-k" can be used to specify operation where only such messages are accepted that match some specified *positive* target rule. Examples: 3) Rule (r_rule=".*bear.*",r_type=1,r_target=1) would accept all messages that have "bear" in the subject line, providing that the rest of the checks pimppa does pass as well. 4) Rule (r_rule=".*llama.*",r_context=0,r_type=1,r_target=-1) would globally reject all messages having "llama" on the subject line. Both rule types follow contexts. Context 0 is the global context (all areas). Otherwise the rule will be applicable only on newsgroups connected to filearea having the same context. See also section 12: "context". 9.2 Assign patterns ------------------- If pimppa didn't find a matching rule from those you've set, "pleech" (and bowser Leech, which uses the same routines) tries to decide based on assign patterns where it should store the files whose filename matches some pattern in the database. For each filename in the system there can be a (pattern, context, dest_area) triplet in "p_assign" -table, telling where similarly named files should be moved in the future. (See chapter 12: "context"). Naturally all files belonging together should map to the same pattern for this scheme to do any good. The pattern is constructed from the filename as follows: 1) All numbers are converted to character '0'. 2) Letters after the last number and before the last dot ('.') are considered as indexes, and converted to '1' IF there's no more than two of them. 3) After last '.', all alphabetical letters stay intact. E.g. filename => pattern -------- ------- "ab-103-h.jpg" => "ab-000-1.jpg" "ab-115-z.jpg" => "ab-000-1.jpg" "ab-ccc-1.jpg" => "ab-ccc-0.jpg" "ab-1-1ab.jpg" => "ab-0-011.jpg" "ab-1-def.jpg" => "ab-0-def.jpg" "ab-01a-2.jpg" => "ab-00a-0.jpg" The assign patterns are by no means foolproof. One reason is different files being created around the world with same names. However, its fairly good with really big series having some uncommon filename prefix like "gwo-bah-???.zip". But it fails with files named imaginatively like "image001.jpg" which surface on every corner. Example: If you have a pattern "bozo_000.png" pointing to area 5, "pleech" would send files named "bozo_123.png" and "bozo_124.png" to area 5, but files "bozo_abc123.png" and "bozo_100.jpg" would end up on the default destination area. PIMPPA utils like "pmv", "padopt" and Bowser automatically update and create assign patterns, and the whole pattern database can be reconstructed with "passign". Assign patterns are not created or used for areas marked as AREA_NOASSIGN or AREA_INCOMING. NOTE: Special destination area (a_dest) values: -1 : A negative assign pattern destination area id will cause all matching files to be quietly discarded by "pleech" and Bowser in the future. In english: KILL the matching files. 0 : Destination 0 means that the particular pattern is disabled and won't be used. The pattern won't be replaced by pimppa when matching files are moved or adopted. Sorting those files will be left to the user. >0 : Some normal filearea. The value 0 must be set by hand, e.g. in case you notice some particular pattern causing incorrect classifications all the time. 9.3 Miscellaneous ----------------- By default, "pleech" and bowser leech convert all filenames to lowercase, and discard all 1) duplicate files (see 9.3 RegExp dupecheck) 2) MD5 -checksum colliding files (see 9.4) For better spam avoidance, you should probably configure pimppa to discard 3) files that have no numbers in the filenames (CFG_NEEDNUMBERS) 4) files that have only numbers before the extension. (CFG_NEEDOTHERS) 5) files that have whitespace in the filenames (CFG_NOSPACE) Usually such files are renames, spam or just plain nuisance. Use "pcfg" or "bowser" to change the settings to your liking. See also 5.6, Changing Preferences. Some additional behaviour options may be added in the future, if some useful come to mind. I'll happily receive all suggestions and ideas! 9.4. Restricting dupechecking with RegExps ------------------------------------------ As a default, pimppa leech checks based on filenames that incoming files do not already exist on any filearea. If they do, the incoming counterparts are called duplicates and deleted (unless the already existing files have failed the integrity check - in that case they are replaced with the new ones). You may wish to tighten the duplicate checking to check only from particular areas. Example case. You get "goat100.jpg" from "alt.binaries.pictures.animals", and later a file with the same name from "alt.worship.goatlord". Now there's a good change these are not the same files. You might prepare for cases like this by relaxing the dupechecking as follows: Group: "alt.binaries.pictures.animals" => default destination area "0animaltmp" => set "0animaltmp" to dupecheck from areas "cats", "dogs", "sheep" Group: "alt.worship.goatlord" => default destination area "0occulttmp" => set "0occulttmp" to dupecheck from area "weirdstuff" only The dupechecking is set for the destination areas, not for the sources themselves. (E.g. many newsgroups may map to the same destination area and follow the same dupecheck patterns). To set RegExp duplicate checking, just set a proper area_id RegExp pattern for any incoming area (modify 'area_targets' -column). Examples: $ Default, check from all areas ^1$ Dupecheck only from area with ID 1 ^7$|^10$|^15$ Dupecheck from areas 7,10,15. You can set these from "bowser" or directly by "mysql". NOTE: RegExp duplicate checking also affects assigning by leech operations: only those assign target areas are seen valid which are matched by RegExp. If there is no match, default destination area is used. If assign target is negative (0DOOM), the file will be deleted (is this behaviour is wise?), no matter what RegExp says. 9.5 MD5 -based dupecheck ------------------------ For each file entered to pimppa system, a 128bit MD5sum will be calculated. The sum is compatible with RFC 1321. If STRICT_MD5 is used (as default), pimppa utilities will delete all incoming files which have an md5sum colliding with some existing md5sum. This gives a really good duplicate discarding system, though some innocent files might be deleted because of false checksum collisions. After going through my database I didn't find such a case, but over hundred "valid" collisions (which were renames: exactly same file, but with a different filename). 9.6 Downloading & file skipping ------------------------------- How does the download work in normal, non-lenient operation? This might be useful knowledge if you wish to understand how (and why) pimppa discards incoming stuff. For efficiency, newsgroups are downloaded in two phases. First, only headers are taken, then the articles themselves. Header phase: suck will output only those headers which pass the user-specified suckkillfile (default: accept all). Then each header is considered by pimppa. If no filename matching P_FILETYPES is found, we skip this article. Otherwise we have parsed a filename. First we check that this filename passes the current requirements for filenames (example: NOSPACE) and that its not a duplicate. Only after then we check the regexp rules for subject lines. After that, we check there is no a filename regexp rule or an assign pattern pointing to a negative filearea (=> kill this file). If the header passes this mechanism and all parts exist, the respective full articles are downloaded in the next phase. Article phase: Accepted articles are downloaded. Decode phase: Articles are decoded. After decoding, pimppa re-checks the filenames that they're not duplicates or on their way to oblivion (assigns or rule pointing to negative area). This is done here again because the subject line does not always have the same filename as was included in the uucoded data of the message itself. Next, a prune script is run (if its defined for the file type, see type_prunecmd) to decide if the file should be deleted or not. Finally, MD5 checksum of each file is matched globally and the file is discarded if a duplicate is found. 9.7 Prune scripts ----------------- The prune scripts are a heavy, possibly content-based way of spam avoidance. They are defined per file type and can remain undefined. These shell scripts are executed after news decoding, before adding the file to the database. The script is given the full path of the decoded file as an input, and the script can use whatever measures it pleases to decide if the file is acceptable or not. It can look at the image statistics, dimension, content, and so on, depending on what you want -- and are able to code. There is an example prune script provided for jpegs, 'p_prunejpg' which deletes too small images (params at the start of the script). It needs to be manually entered to "p_types" table to be used. The prune scripts can be defined to be used from Preferences in "bowser", the field is "Filetypes/type_prunecmd". The script must be on command execution path. 10. MySQL table explanation --------------------------- Main database is "pimppa" and it's owned as by user "pimppa". Some of these can be modified from "bowser" preferences. "p_areas" is the table containing all your fileareas. area_id Unique area ID number area_name Area name. Should be logical and quick to type. area_path The directory path for the files of this area. area_flags Properties of this area (hints for utils) area_context The group of fileareas this area belongs to area_targets RegExp pattern for dupechecking incoming (see 9.4) "p_assign" contains the assign patterns - where "pleech", "bowser" and "padopt -m" should deposit certain files. Negative destination area makes utils to delete the incoming file. Zero destination means that this pattern is disabled. a_pattern The filename pattern to match a_context Context where this pattern is valid a_dest Destination area_id for all matching files "p_contexts" contains information about defined contexts, i.e. groupings of the areas. a_name A short name identifier for the context, unique a_id A numeric context identifier, primary key a_desc A free form description of this context "p_crawled" contains the urls of the image galleries (not the index sites) already crawled by "pleechwww" so that they're not crawled again. This table is never pruned of old entries at the moment. crawled_url The url crawled crawled_data The date when the url was crawled "p_files" contains all the files you have. file_id Unique file ID number file_name File name, unique per filearea. file_size File size in bytes file_area Area ID where this file should be file_integ File integrity check status file_trans File transformation status file_date The date when you got this file file_backup The ID of the backup, 0 if none. file_desc Optional ASCII text description of this file file_flags Is there something special with this file (offline?) file_md5sum 128 bit MD5 checksum for this file "p_groups" contains information about newsgroups. g_name Unique name of the newsgroup, e.g. "alt.binaries.test" g_last Last msg read -pointer g_flags Newsgroup flags (can be |= GROUP_DISABLED) g_dest Destination filearea id "p_keywords" stores optional keywords for kept files. Multiple keywords can be specified for the same file (key_target) and same keyword can exist for multiple files. key_name text name of the keyword, e.g. 'green' key_target the file_id that this keyword matches in p_files table "p_misc" is a really general table for various PIMPPA-utils to store their status information and configuration. misc_key Identifying unique key for the info misc_data The actual data associated with the key "p_rules" contains user-specifiable RegExp file/message classification rules r_rule The regexp rule itself (eg. ".*jpg") matching some filenames r_context Context where this rule is valid r_type 0 == FILENAME PATTERN, 1 == SUBJECT KEYWORD PATTERN r_target Area_id where the matching files should go "p_servers" has all the newsservers you want to use. s_id Unique server id s_name The server hostname (e.g. news.hypermecha.com) s_user Username on the server (empty=none=default) s_pass Password on the server (empty=none=default) s_flags Server flags "p_types" contains the information how to handle various filetypes. If type_testokstr is an empty string, the test command return value will be used. 0 == FILE OK, anything else == FAILED. For the prune command, return value of 0 means that file should be kept, and 1 that it should be deleted. Other values will be interpreted as error (operationally as 0). Be careful. PIMPPA will provide both the test and prune commands the file path as input. type_ext File extension of this type, e.g. "JPG". type_testcmd Command to use to test a file of this type type_transcmd Command to use to transform a file of this type type_testpos Position of the success string for testcmd (deprecated) type_testokstr The actual "all correct" string given by testcmd. type_transto Destination file type when transforming. type_prunecmd Command to use to decide if a file should be deleted Trick: set "type_prunecmd" to "false" to delete all files of this type. :P 11. Feedback & discussion ------------------------- PIMPPA has been maintained by myself with the help of various contributors over the net. For discussion related to this software or general file-hoarding, please use the users mailing list freely. And no need to write like I do. Relax. :D For something related to the developing, please send your messages to the developer mailing list. Any bug-reports, patches, comments or suggestions are welcome. Especially if you have an idea about some new functionality/tool/script that would benefit PIMPPA, don't hesitate. And if you know how i to actually do it, all the better! . If you like, you can also contact the head honcho directly, . PGP key at the end of this file for the paranoid. 12. Glossary ------------ "context" --------- A group of fileareas. Assign patterns and classifying rules can be specified to function only in specific contexts. This happens by setting a nonzero context number for a filearea and the same number for your rule(s). Patterns and rules specified for context 0 (global) are always matched, unless a >0 context rule exists. Note that if you use contexts (default is "all are global"), you probably shouldn't have any filearea belonging to context 0, because if you kill files there, you also invalidate those filenames for the other contexts. "0DOOM" -------- This is a point-of-no-return -filearea which should exist on all systems. Its area_id is "-1" and area_path "/dev/null". All files moved or assigned to 0DOOM will never be seen again. MD5 -checksums of the moved files will be kept but the files assigned there in the future won't leave a trace. "assign pattern" ---------------- Downloading decides based on assign patterns where files should be stored whose filename match an assign pattern in the database. Section 9.2 tells how the patterns currently operate. "filearea" ---------- PIMPPA is structured so that every file is on a certain filearea. Fileareas are created with "pnewarea". All files of similar content should be on a certain filearea, so you can easily find them. It's quite like a normal directory, except that some additional info of the filearea contents is kept in the p_files database to make the backup, duplicate- check, lookup, etc, operations possible and fast. "Incoming" (AREA_INCOMING) -------------------------- Fileareas to which "raw" material from newsgroups is decoded to. Incoming area contents will not be backuped or transformed. In ideal operation pimppa is like a sorting network, the files can be seen travelling like this: Newsgroup Filter 1 Filearea Filter 2 Filearea group1-| |->-[autofilter]->- Incoming1 ->-[humanfilter]->- Quality1 group2-| | | | | | `------------>--------------->-------------' | | Discard Discard etc. Due to the assign patterns, recognized files can be moved to correct fileareas without human interaction. Also the duplicate checking (filename and md5check) and certain requirements for filenames allow some incoming files to be discarded automatically. Filter 1 mentioned above is described in 9.6. "transform" ----------- Operation which can be performed (once!) for a file of a certain filetype. This operation can be a conversion of ".GIF" to ".PNG", an optimizing of a JPEG, or whatever. You can specify the transformation command and result filetype (which can be same as source type) in 'p_types'- table. Some example transformations are in "sql/example_types.sql" "offline" --------- Files which exist on some filearea but not in its respective directory are called offline. The files may have been backupped and deleted, or just lost. Most PIMPPA utilities skip offline files. Files which are present are called "online". "PIMPPA" -------- PIMPPA is a fabulous content seeking/devouring creature or monster in the forgotten scandinavian mythologies. 13. PGP ------- This is my PGP public key. Do you trust it hasn't been tampered with by MAN IN THE MIDDLE? Hell no. Just send normal mail like everyone else. ;) -----BEGIN PGP PUBLIC KEY BLOCK----- Version: GnuPG v1.0.1 (GNU/Linux) Comment: For info see http://www.gnupg.org mQGiBDawXNERBADWkpWZoDuB//oUsvlmCKNufobFKdTX1SGfqDffoHgOO+01LVU9 QZl+snoSoAz7TkRnEep34mdRcx8IRe/xuLi88jEiI2zVLCHoYFBDB6lpxfSGYjJq IfqCxv1GKMTigykLI5oxyLgYyrONL00uZzPRuTnSrwZqWuqqoHANlxed1QCg/3Xy FLgvmPRZkBVZ7cjAV+ZlNEEEAKNFmfPGTwKd5X0MBSI4aIUXKO3G+Narjn9vVDoi LsU41LR3VO50UeWvnbSEi93YTDwXpFuKK4nMKG1Z5T8pZPQ6NqcOu6NxBm6GHMoR L6ZI+oRgbdAG2shTlfbnZ10qyPhTg0VDf+zfD7sx/vK7jo5uywrmiTUvQvEli4ps 6KCFBADGSC6Uf/5lNB7+4VAko0j1G6m3cnOekvl6/XX+FihPob9NECZbiJR3lvzR ayuJaOSoX/zNJEHDzfOu1qzjOvbWBWY+JBkJ1QYvpB1A+Y84QFcBJRlrvTcXxgfv urMkptlO3yUY7UPKIt7+52XTAesfWZkmkgC9cPQ5gD2l1FjyZbQhSWdvciBXcm9u c2t5IDxpd3JvbnNreUB5YWhvby5jb20+iEsEEBECAAsFAjd9HuMECwMBAgAKCRAg /QqD5XhjksYfAKD8QRg9EPsiLacwNPTGppusFCCvpQCg6k0tB7FVJaLjunbWKItZ LuPHvFm5Ag0ENrBc0hAIAPZCV7cIfwgXcqK61qlC8wXo+VMROU+28W65Szgg2gGn VqMU6Y9AVfPQB8bLQ6mUrfdMZIZJ+AyDvWXpF9Sh01D49Vlf3HZSTz09jdvOmeFX klnN/biudE/F/Ha8g8VHMGHOfMlm/xX5u/2RXscBqtNbno2gpXI61Brwv0YAWCvl 9Ij9WE5J280gtJ3kkQc2azNsOA1FHQ98iLMcfFstjvbzySPAQ/ClWxiNjrtVjLhd ONM0/XwXV0OjHRhs3jMhLLUq/zzhsSlAGBGNfISnCnLWhsQDGcgHKXrKlQzZlp+r 0ApQmwJG0wg9ZqRdQZ+cfL2JSyIZJrqrol7DVekyCzsAAgIH/AzQXkgMRpwsbEVh XSEH/5kbN4Ls9LbFMkPelmaODl2W2wjmWa+7loBFnKn+9WHh77/GLMzHGPYoTzZv wp6bAYbcq4cu20qdW2tTIfUXJz+ey3r5rwFR5y5qkiBqfFczepY0biUcUI7dWt/Q LUyN6oVyVAjclmfvA/JWi7LmMRl6Jo1doKXLYhHOuFXkoGqIExrO9EUKTMGsa0Lm uJVv6kb0v9EAyiJU/zzMvKotPtdzzPqz2m+0mt/XsMhfbT6xl2XkmESvQhgev6Yh DYpzVSZOZeZ7Etzpp2eDwfP4AU23ge6KFO7g33cSEJilBe7x3ZkiTb5Hgqs3FnWi t9EqwDmIPwMFGDawXNIg/QqD5XhjkhECewUAn3P1gtt1Y3DZRRWvJ9TgNCtc+qcp AJ9p6oceyAzcCw87KNm3kW7u6gBK6g== =Hq/M -----END PGP PUBLIC KEY BLOCK-----