2008-08-23 14:38 |
dm.lua
David Manura <dm.lua at math2.org>
A few comments...
(1)
When Sputnik raises an unexpected exception, a stack traceback is
displayed on the web page:
<snip>
There was an error in the specified application. The full error message follows:
...epler-1.1/rocks//wsapi/1.0rc1-1/lua/wsapi/common.lua:183: cannot
obtain information from file `redirect:/cgi/sputnik.cgi'
stack traceback:
[C]: in function 'assert'
...epler-1.1/rocks//wsapi/1.0rc1-1/lua/wsapi/common.lua:183: in
function <...epler-1.1/rocks//wsapi/1.0rc1-1/lua/wsapi/common.lua:182>
(tail call): ?
...utnik/kepler-1.1/rocks//wsapi/1.0rc1-1/bin/wsapi.cgi:16: in
function <...utnik/kepler-1.1/rocks//wsapi/1.0rc1-1/bin/wsapi.cgi:14>
(tail call): ?
[C]: in function 'xpcall'
...epler-1.1/rocks//wsapi/1.0rc1-1/lua/wsapi/common.lua:135: in
function 'run_app'
...epler-1.1/rocks//wsapi/1.0rc1-1/lua/wsapi/common.lua:159: in
function 'run'
...k/kepler-1.1/rocks//wsapi/1.0rc1-1/lua/wsapi/cgi.lua:16: in
function 'run'
...utnik/kepler-1.1/rocks//wsapi/1.0rc1-1/bin/wsapi.cgi:26: in main chunk
[C]: ?
</snip>
It could be argued that the end user of the web site shouldn't see a
stack traceback. First, there may be security implications in
allowing the end user to know how the web site is implemented and
installed. Second, the stack traceback is more useful rather to the
administrator of a web site, so perhaps it should be recorded instead
to a log file on the server, and end user should only see a ticket
number that the administrator can cross reference against the log
file. I did some searching on this concern just now:
[1] http://www.jankoatwarpspeed.com/post/2008/06/02/Exception-handling-best-practices-in-ASPNET-web-applications.aspx
[2] http://www.securitypark.co.uk/article.asp?articleid=26905
[3] http://www.infosecwriters.com/text_resources/pdf/Top_10_Configuration_Security_Vulnerabilities_Part_One_BSullivan.pdf
The stack traceback in Sputnik is triggered by error_html in
rocks/wsapi/1.0-2/lua/wsapi/common.lua, so this might instead be a
WSAPI/Kepler concern.
(2)
After installing Sputnik, I had difficulty finding a complete list of
all the configuration pages. Only some were on the start page. I
later discovered they were listed on the "sputnik" page--e.g.
http://sputnik.freewisdom.org/en/sputnik . (BTW, the "_navigation"
link on this page is broken.) I think the "sputnik" configuration
page should be linked from the start page on the initial installation.
(3)
More generally, is there a way to obtain a complete list of all pages
that exist (without indexing them on Google)? Perhaps I'm setting up
a new wiki and want to remove unnecessary pages. On lua-users wiki, I
just enter an empty search in http://lua-users.org/wiki/FindPage .
(4)
I'm quite in favor of adding a built-in full-text search engine that
works out-of-the box, at least as a fallback, even if that may be
inferior in some ways to Google. A discussion about this was here:
http://lua-users.org/lists/lua-l/2008-02/msg00950.html
A potentially common use case is to use Sputnik internally on a small
wiki by an individual or small group. In that case, simple linear
search through the pages (much like grep) would be sufficient and
trivial to implement. More generally, you'd want to maintain an
inverted index, possibly using an existing production-grade search
engine (e.g. http://swish-e.org and others) or Google, but if you want
something trivial to implement now, here's the code used by the usemod
wiki ( http://www.usemod.com/cgi-bin/wiki.pl ), which is the wiki upon
which lua-users.org is based:
sub SearchTitleAndBody {
my ($string) = @_;
my ($name, $freeName, @found);
foreach $name (&AllPagesList()) {
&OpenPage($name);
&OpenDefaultText();
if (($Text{'text'} =~ /$string/i) || ($name =~ /$string/i)) {
push(@found, $name);
} elsif ($FreeLinks) {
if ($name =~ m/_/) {
$freeName = $name;
$freeName =~ s/_/ /g;
if ($freeName =~ /$string/i) {
push(@found, $name);
}
} elsif ($string =~ m/ /) {
$freeName = $string;
$freeName =~ s/ /_/g;
if ($Text{'text'} =~ /$freeName/i) {
push(@found, $name);
}
}
}
}
return @found;
}
Boolean AND/NOT logic and phrase searching would be a simple extension
to that (e.g. ' "hello world" -goodbye '). You do not need word
tokenization (since there is no inverted index of words) nor stemming,
synonyms, etc., which would complicate the otherwise simple logic.
(5)
When previewing edits to template/config pages, it would be useful for
Sputnik to apply the templates being edited in the preview. This is
especially true since edits to these pages can break the wiki, so it
would be desirable to preview them first.
2008-08-23 14:54 |
jnwhiteh
Jim Whitehead II <jnwhiteh at gmail.com>
On Sat, Aug 23, 2008 at 5:38 PM, David Manura <dm.lua@math2.org> wrote: > A few comments... > > (1) > > When Sputnik raises an unexpected exception, a stack traceback is > displayed on the web page: > > <snip> > There was an error in the specified application. The full error message follows: > > ...epler-1.1/rocks//wsapi/1.0rc1-1/lua/wsapi/common.lua:183: cannot > obtain information from file `redirect:/cgi/sputnik.cgi' > stack traceback: > [C]: in function 'assert' > ...epler-1.1/rocks//wsapi/1.0rc1-1/lua/wsapi/common.lua:183: in > function <...epler-1.1/rocks//wsapi/1.0rc1-1/lua/wsapi/common.lua:182> > (tail call): ? > ...utnik/kepler-1.1/rocks//wsapi/1.0rc1-1/bin/wsapi.cgi:16: in > function <...utnik/kepler-1.1/rocks//wsapi/1.0rc1-1/bin/wsapi.cgi:14> > (tail call): ? > [C]: in function 'xpcall' > ...epler-1.1/rocks//wsapi/1.0rc1-1/lua/wsapi/common.lua:135: in > function 'run_app' > ...epler-1.1/rocks//wsapi/1.0rc1-1/lua/wsapi/common.lua:159: in > function 'run' > ...k/kepler-1.1/rocks//wsapi/1.0rc1-1/lua/wsapi/cgi.lua:16: in > function 'run' > ...utnik/kepler-1.1/rocks//wsapi/1.0rc1-1/bin/wsapi.cgi:26: in main chunk > [C]: ? > </snip> > > It could be argued that the end user of the web site shouldn't see a > stack traceback. First, there may be security implications in > allowing the end user to know how the web site is implemented and > installed. Second, the stack traceback is more useful rather to the > administrator of a web site, so perhaps it should be recorded instead > to a log file on the server, and end user should only see a ticket > number that the administrator can cross reference against the log > file. I did some searching on this concern just now: > > [1] http://www.jankoatwarpspeed.com/post/2008/06/02/Exception-handling-best-practices-in-ASPNET-web-applications.aspx > [2] http://www.securitypark.co.uk/article.asp?articleid=26905 > [3] http://www.infosecwriters.com/text_resources/pdf/Top_10_Configuration_Security_Vulnerabilities_Part_One_BSullivan.pdf > > The stack traceback in Sputnik is triggered by error_html in > rocks/wsapi/1.0-2/lua/wsapi/common.lua, so this might instead be a > WSAPI/Kepler concern. I agree 100% and my thoughts were to write the following: 1. A configuration page in the core that is displayed whenever a sputnik-layer error occurs. This is somewhat difficult to manage considering that the sputnik error may be preventing nodes from being displayed in the first place. It could be a static page, so I'm open to ideas on this one. 2. A module that sends all errors that occur in the system to some endpoint so they can be viewed after the fact. There are again issues with this (how to ensure that the filesystem/db isn't consumed with these errors, etc). > After installing Sputnik, I had difficulty finding a complete list of > all the configuration pages. Only some were on the start page. I > later discovered they were listed on the "sputnik" page--e.g. > http://sputnik.freewisdom.org/en/sputnik . (BTW, the "_navigation" > link on this page is broken.) I think the "sputnik" configuration > page should be linked from the start page on the initial installation. All configuration pages, or at least the major sputnik configuration page should be linked from the main page post-installation so this just needs to be updated in the node template. > More generally, is there a way to obtain a complete list of all pages > that exist (without indexing them on Google)? Perhaps I'm setting up > a new wiki and want to remove unnecessary pages. On lua-users wiki, I > just enter an empty search in http://lua-users.org/wiki/FindPage . Well there are a few issues with this. Some pages only exist at the point they are queried and have no transient state, while the rest can be viewed directly on the file system or whatever backend the repository is using. I'm not sure if we have a way using LuaRocks to figure out what modules are possible provided, but that would be the primary issue there.l > I'm quite in favor of adding a built-in full-text search engine that > works out-of-the box, at least as a fallback, even if that may be > inferior in some ways to Google. A discussion about this was here: > > http://lua-users.org/lists/lua-l/2008-02/msg00950.html > > A potentially common use case is to use Sputnik internally on a small > wiki by an individual or small group. In that case, simple linear > search through the pages (much like grep) would be sufficient and > trivial to implement. More generally, you'd want to maintain an > inverted index, possibly using an existing production-grade search > engine (e.g. http://swish-e.org and others) or Google, but if you want > something trivial to implement now, here's the code used by the usemod > wiki ( http://www.usemod.com/cgi-bin/wiki.pl ), which is the wiki upon > which lua-users.org is based: > > sub SearchTitleAndBody { > my ($string) = @_; > my ($name, $freeName, @found); > > foreach $name (&AllPagesList()) { > &OpenPage($name); > &OpenDefaultText(); > if (($Text{'text'} =~ /$string/i) || ($name =~ /$string/i)) { > push(@found, $name); > } elsif ($FreeLinks) { > if ($name =~ m/_/) { > $freeName = $name; > $freeName =~ s/_/ /g; > if ($freeName =~ /$string/i) { > push(@found, $name); > } > } elsif ($string =~ m/ /) { > $freeName = $string; > $freeName =~ s/ /_/g; > if ($Text{'text'} =~ /$freeName/i) { > push(@found, $name); > } > } > } > } > return @found; > } > > Boolean AND/NOT logic and phrase searching would be a simple extension > to that (e.g. ' "hello world" -goodbye '). You do not need word > tokenization (since there is no inverted index of words) nor stemming, > synonyms, etc., which would complicate the otherwise simple logic. I'm sure I speak for both Yuri and myself when I say we would welcome a contribute module that provides out of the box search for Sputnik. > When previewing edits to template/config pages, it would be useful for > Sputnik to apply the templates being edited in the preview. This is > especially true since edits to these pages can break the wiki, so it > would be desirable to preview them first. I'm not sure how feasible this is with the prototyped/inheritance system that Sputnik operates under, but I agree this would be useful. I can definitely see the use case for this. - Jim
2008-08-23 16:18 |
petite.abeille
Petite Abeille <petite.abeille at gmail.com>
On Aug 23, 2008, at 6:54 PM, Jim Whitehead II wrote: > I'm sure I speak for both Yuri and myself when I say we would welcome > a contribute module that provides out of the box search for Sputnik. Perhaps SQLite's FTS module would be of interest: http://www.sqlite.org/cvstrac/wiki?p=FtsTwo http://www.sqlite.org/cvstrac/wiki?p=FullTextIndex FWIW, Nanoki provides its own full-text search by implementing an inverted index in SQLite: http://dev.alt.textdrive.com/browser/HTTP/Finder.ddl http://dev.alt.textdrive.com/browser/HTTP/Finder.dml http://dev.alt.textdrive.com/browser/HTTP/Finder.lua Cheers, -- PA. http://alt.textdrive.com/nanoki/
2008-08-23 16:51 |
yuri
Yuri Takhteyev <yuri at sims.berkeley.edu>
This may repeat Jim's answers a bit. > When Sputnik raises an unexpected exception, a stack traceback is > displayed on the web page: That's simply a bug. We used to have a nicer page and then something changed in WSAPI and we haven't updated. That said, we should also focus on avoiding any error messages. :) So... > ...epler-1.1/rocks//wsapi/1.0rc1-1/lua/wsapi/common.lua:183: cannot > obtain information from file `redirect:/cgi/sputnik.cgi' ...this specific one (which David reported to me separately) has been forwarded to Fabio and fixed. > It could be argued that the end user of the web site shouldn't see a > stack traceback. First, there may be security implications in > allowing the end user to know how the web site is implemented and > installed. Second, the stack traceback is more useful rather to the > administrator of a web site, so perhaps it should be recorded instead > to a log file on the server, and end user should only see a ticket > number that the administrator can cross reference against the log > file. Agreed. Having the traceback displayed in the browser speeds up the development tremendously, but perhaps the best thing to do is add a config variable ("DISPLAY_TRACEBACK") that turns this on, keeping it off by default. Jim: for the logging module you suggest, wouldn't lualogging work? > The stack traceback in Sputnik is triggered by error_html in > rocks/wsapi/1.0-2/lua/wsapi/common.lua, so this might instead be a > WSAPI/Kepler concern. Again, this specific stack traceback was a WSAPI issue that has been fixed. However, the question of whether WSAPI should be displaying stack traces may be worth bringing up on the Kepler list. Though, perhaps this is an application-level issue. > After installing Sputnik, I had difficulty finding a complete list of > all the configuration pages. The links on the sputnik wiki didn't help? My assumption was that people start with the "Installation" page, which links to "Basic Configuration" (which is also the next item in the nav bar. "Basic Configuration" links to configuration (the next tab), which tells you all of your configuration options. Should something like this sentence go into the default homepage? :) > http://sputnik.freewisdom.org/en/sputnik . (BTW, the "_navigation" > link on this page is broken.) Oops. The node name changed, the page did not. This is part of the reason why I've been thinking of just moving all the documentation to the wiki, so it's all in one place. > More generally, is there a way to obtain a complete list of all pages > that exist (without indexing them on Google)? Yes: http://sputnik.freewisdom.org/en/sitemap And even in the "Sitemap" XML format: http://sputnik.freewisdom.org/en/sitemap.xml (you can give the latter URL to google via "Webmaster Tools" so that Google will know when new pages are added.) The only limitation (or feature, depending on how you look at it) is that this only displays pages that were edited at some point, skipping the default pages. (That is, your "sputnik" page won't be in this list, unless you edit it.) This could be changed. > Perhaps I'm setting up > a new wiki and want to remove unnecessary pages. On lua-users wiki, I > just enter an empty search in http://lua-users.org/wiki/FindPage . Makes sense. Though, I've never had to do it this way: Sputnik's default is to store the data in a very transparent way, each node being a directory inside wiki-data. So, when I did cleanup of that sort in the past, I've just cded into wiki-data, did an "ls" and then a "rm -rf" on the nodes I wanted to delete. Though, for full transparency of data you would want the Git plugin. With that, each node is a lua file, revisions are git revisions to that file, and subdirectories are subdirectories. (That is, "Tickets/000001" would map be stored in wiki-data/Tickets/000001.lua, and you could see the revision history by just running "git log Tickets/000001.lua") > I'm quite in favor of adding a built-in full-text search engine that > works out-of-the box, at least as a fallback, even if that may be > inferior in some ways to Google. A discussion about this was here: > > http://lua-users.org/lists/lua-l/2008-02/msg00950.html As Jim said, we are all in favor of this, for reasons that you mentioned and more! If there was a good search system that was easy to install and had decent Lua bindings, I swear I would write a plugin for it the next day. (I've discussed this issue a bit with Jim and Carregal, basically thinking of it in terms of an API that would subscribe to modifications and then be able to answer queries.) > something trivial to implement now, here's the code used by the usemod > wiki ( http://www.usemod.com/cgi-bin/wiki.pl ), which is the wiki upon > which lua-users.org is based: I've tried this before and it was very slow. However, I have since have made a simple fix to Saci (commit beda1d7) which improved the performance dramatically, making this a viable, though still somewhat slow approach. Additionally, I've been working on a Sputnik based photoblog application that is supposed to allow you to browse blog posts and albums by tag, which is to say that a version of "search" is meant to be used heavily. I ended up adding an experimental "application cache" option, which gives Sputnik or Sputnik-based applications an option to cache stuff until there is a change to the main storage system. So, my application would then cache the list of items for each tag from the moment the tag is first queried and until some new photos or posts are added. However, this feature is not supported by all versium implementations at this point. (In fact, by none that are checked in!) > sub SearchTitleAndBody { > my ($string) = @_; > my ($name, $freeName, @found); > > foreach $name (&AllPagesList()) { > &OpenPage($name); > &OpenDefaultText(); > if (($Text{'text'} =~ /$string/i) || ($name =~ /$string/i)) { > push(@found, $name); > } elsif ($FreeLinks) { > if ($name =~ m/_/) { > $freeName = $name; > $freeName =~ s/_/ /g; > if ($freeName =~ /$string/i) { > push(@found, $name); > } And it would be prettier in Lua. :) In fact, I will try to make an alternative to "sputnik-search" that does that. > Boolean AND/NOT logic and phrase searching would be a simple extension > to that (e.g. ' "hello world" -goodbye '). You do not need word > tokenization (since there is no inverted index of words) nor stemming, > synonyms, etc., which would complicate the otherwise simple logic. Part of me wants to go this route of adding first one little feature then another, and eventually implementing a great search engine in Lua. Another part of me wants to finish my dissertation. :) > When previewing edits to template/config pages, it would be useful for > Sputnik to apply the templates being edited in the preview. This is > especially true since edits to these pages can break the wiki, so it > would be desirable to preview them first. It does work this way for CGI, but not currently with Xavante. (Because in case of Xavante, the same Sputnik instance is re-used as for the previous call.) I'll look into this, though. Thanks for all the comments! - yuri -- http://sputnik.freewisdom.org/
2008-08-23 18:29 |
yuri
Yuri Takhteyev <yuri at sims.berkeley.edu>
>> Jim: for the logging module you suggest, wouldn't lualogging work? > > Yes, it shouldn't be too difficult to provide a sputnik-lualogging > that can be configured to the various loggers and would be able to > provide the standard DEBUG, INFO, WARNING, ERROR logging levels. > Actually, this would be a huge win in the Sputnik core as well for > troubleshooting things like why my user tokens are timing out so > quickly ;-) We have that! install lualogging with luarocks and then set: LOGGER = "file" LOGGER_PARAMS = {"/tmp/sputnik.log"} The only thing missing is that there is no way to set logging level. So, I just added a "LOGGER_LEVEL" parameter, so if you also set LOGGER_LEVEL = "WARN" then you will only get warn and error messages, but not the info and debug ones. (See lualogging website for more configurations. LOGGER_PARAMS are just passed to the logger constructor, so they are logger-specific. I've never used loggers other than logging.file, but there are many options, including an email one.) The commit is http://gitorious.org/projects/sputnik/repos/mainline/commits/5e6cdcdb In the same commit, I turned off stack trace display by default, replacing it with a message saying that you can turn stack traces on by setting SHOW_STACK_TRACE to true. One small issue with all this: it all works quite well if Sputnik initializes successfully and then runs into a problem when responding to a request. If it fails _before_ WSAPI even sents it any requests, then we just get the default WSAPI message. The reason is that WSAPI works in two steps: 1. create an application function 2. call it for each request In step 2 we generate a response that goes to the user. This gives us an option of handling errors in a smart way. In step 1, we just return a function that handles requests. We can't "say" anything to the user directly at this point. I am guessing that the thing to do is to catch errors happening during initialization and return a function that just responds with a formatted error message for any request. I'll look into this later. >> In fact, I will try to make an alternative to "sputnik-search" that does that. > > This could also be extended by a simple script that generates the > index once, and uses the post-action hooks I added to Sputnik in order > to update the index file when a page is changed. You would run into > concurrency issues but its interesting to think about. That's an option. It depends on what kind of data you have, how much and how often it is updated, and whether you want to update it from outside Sputnik. For my own use of the photoblog, I've been wanting to edit the content via git, but expect to do so at most once a week, so simply caching searches until the main storage is touched ends up being easier. We'll have to think what makes most sense as the default. - yuri -- http://sputnik.freewisdom.org/
2008-08-23 18:34 |
jnwhiteh
Jim Whitehead II <jnwhiteh at gmail.com>
On Sat, Aug 23, 2008 at 9:29 PM, Yuri Takhteyev <yuri@sims.berkeley.edu> wrote: >>> Jim: for the logging module you suggest, wouldn't lualogging work? >> >> Yes, it shouldn't be too difficult to provide a sputnik-lualogging >> that can be configured to the various loggers and would be able to >> provide the standard DEBUG, INFO, WARNING, ERROR logging levels. >> Actually, this would be a huge win in the Sputnik core as well for >> troubleshooting things like why my user tokens are timing out so >> quickly ;-) > > We have that! > > install lualogging with luarocks and then set: > > LOGGER = "file" > LOGGER_PARAMS = {"/tmp/sputnik.log"} > > The only thing missing is that there is no way to set logging level. > So, I just added a "LOGGER_LEVEL" parameter, so if you also set > > LOGGER_LEVEL = "WARN" > > then you will only get warn and error messages, but not the info and > debug ones. (See lualogging website for more configurations. > LOGGER_PARAMS are just passed to the logger constructor, so they are > logger-specific. I've never used loggers other than logging.file, but > there are many options, including an email one.) > > The commit is http://gitorious.org/projects/sputnik/repos/mainline/commits/5e6cdcdb Aye, I forgot about that but in truth I was more referring to the quality of the messages. The current debug messages from sputnik are pretty much useless unless you're the one who wrote them *nudge*. > In the same commit, I turned off stack trace display by default, > replacing it with a message saying that you can turn stack traces on > by setting SHOW_STACK_TRACE to true. > > One small issue with all this: it all works quite well if Sputnik > initializes successfully and then runs into a problem when responding > to a request. If it fails _before_ WSAPI even sents it any requests, > then we just get the default WSAPI message. The reason is that WSAPI > works in two steps: > > 1. create an application function > 2. call it for each request > > In step 2 we generate a response that goes to the user. This gives us > an option of handling errors in a smart way. In step 1, we just > return a function that handles requests. We can't "say" anything to > the user directly at this point. I am guessing that the thing to do > is to catch errors happening during initialization and return a > function that just responds with a formatted error message for any > request. I'll look into this later. > >>> In fact, I will try to make an alternative to "sputnik-search" that does that. >> >> This could also be extended by a simple script that generates the >> index once, and uses the post-action hooks I added to Sputnik in order >> to update the index file when a page is changed. You would run into >> concurrency issues but its interesting to think about. > > That's an option. It depends on what kind of data you have, how much > and how often it is updated, and whether you want to update it from > outside Sputnik. For my own use of the photoblog, I've been wanting > to edit the content via git, but expect to do so at most once a week, > so simply caching searches until the main storage is touched ends up > being easier. We'll have to think what makes most sense as the > default.
2008-08-23 20:05 |
dm.lua
David Manura <dm.lua at math2.org>
On Sat, Aug 23, 2008 at 12:54 PM, Jim Whitehead II wrote: > 1. A configuration page in the core that is displayed whenever a > sputnik-layer error occurs. This is somewhat difficult to manage > considering that the sputnik error may be preventing nodes from being > displayed in the first place. Full error info is usually reportable though, yes, sometimes not. This is like the "error in error handling message" in Lua. > how to ensure that the filesystem/db isn't consumed with > these errors, etc) This may or may not be a concern of Lua. For example, if a CGI writes to stderr, Apache just appends to the Apache error log file. It's assumed the user is running some cron job to roll and archive the error logs, or at least maybe the logs are on a separate partition. > Some pages only exist at the > point they are queried and have no transient state, while the rest can > be viewed directly on the file system or whatever backend the > repository is using. I'm not sure if we have a way using LuaRocks to > figure out what modules are possible provided, but that would be the > primary issue there. Yes. On Sat, Aug 23, 2008 at 2:51 PM, Yuri Takhteyev wrote: > Having the traceback displayed in the browser speeds up the > development tremendously, but perhaps the best thing to do is add a > config variable ("DISPLAY_TRACEBACK") that turns this on, keeping it > off by default. That would do. >> After installing Sputnik, I had difficulty finding a complete list of >> all the configuration pages. > The links on the sputnik wiki didn't help?... > My assumption was that people start with the "Installation" page, > which links to "Basic Configuration" That helped, but I more looking for just a simple (and complete) listing on my local install rather than a tutorial (e.g. like the sitemap below). >> More generally, is there a way to obtain a complete list of all pages >> that exist (without indexing them on Google)? > Yes: http://sputnik.freewisdom.org/en/sitemap That will do. > The only limitation (or feature, depending on how you look at it) is > that this only displays pages that were edited at some point, skipping > the default pages...This could be changed. I consider it a limitation--whether a config page is edited shouldn't affect whether it gets displayed in this list. Note there are two purposes for this list, and it depends whether the user is logged in as administrator and the permission settings on those pages. First, an administrator may want to see a complete list of pages (edited or not), including configuration pages. Perhaps the administrator is securing the web site. Each page, and in particular the configuration pages, is an interface that needs to be reviewed. The second purpose is for indexing by Google. Google should only see a subset of these pages (i.e. the public ones), and in particular that set should by default not contain configuration pages. > ...Sputnik's default is to store the data in a very transparent way, each node > being a directory inside wiki-data. So, when I did cleanup of that > sort in the past, I've just cded into wiki-data, did an "ls" and then > a "rm -rf" on the nodes I wanted to delete. > > Though, for full transparency of data you would want the Git plugin. > With that, each node is a lua file, revisions are git revisions to > that file, and subdirectories are subdirectories. (That is, > "Tickets/000001" would map be stored in wiki-data/Tickets/000001.lua, > and you could see the revision history by just running "git log > Tickets/000001.lua") Yes, that file system transparency would be a nice feature, as it is in Dokuwiki[1]. The structure you describe using Git is similar in structure to the plain-text file storage structure used by Dokuwiki, and perhaps the plain-text file storage used by Versium could benefit by taking an approach more like this as well. For example, here's roughly what the Dokuwiki plain-text file storage structure looks like: /data/pages/mypage.txt /data/media/wiki/dokuwiki-128.png /data/attic/mypage.1218929386.txt.gz /data/attic/mypage.1218167994.txt.gz /data/attic/mypage.1218308427.txt.gz /data/attic/mypage.1218332876.txt.gz /data/attic/mypage.1217553719.txt.gz /data/index/page.idx /data/index/pageword.idx /data/index/i[0-9]+.idx /data/index/w[0-9]+.idx /data/cache/[0-9a-f]/<md5sum>.(xhtml|js|css|i) /data/conf/* /data/locks/* /data/tmp/* The "page" directory contains the latest versions of all the wiki pages. These are human readable -and- editable markup text. The "media" directory contains resources used by those pages (e.g. images). The "attic" directory contains compressed/timestamped copies of previous versions of the pages. The "index" directory contains the index files used by the search engine (more on this later). The "cache" directory contains cached objects to improve performance. You can safely delete the attic/index/cache files if you no longer want them (or omit them from a backup process). Concerning the index directory, these are all text files. page.idx is a new-line delimited list of page names in the index. Each w[0-9]+.idx file is an unsorted, new-line delimited list of words of length [0-9]+ (note: this nicely makes all records fixed-width). pageword.idx maps each page number (index in page.idx) to a list of words identified in the form of (word_length, word_index) pairs. (The text of a word can be obtained by looking up the pair in the w[0-9]+.idx files.) The i[0-9]+.idx files correspond to the w[0-9]+.idx files, and the lines in the files correspond as well. These files represent the inverted index and likely constrain query performance, as they map word index to a list of (page_number, word_count) pairs. There's some further documentation in ( http://www.dokuwiki.org/indexer ). You can checkout the source code of the indexer.php file ( http://www.splitbrain.org/projects/dokuwiki )--it's only about 700 lines. fulltext.php implements the search routine. Obviously, less constrained approaches could give better performance, but it's interesting, for this being one of the major wikis, the approach they took given those constraints (file system and text files) and the performance/scalability they got. However, this is PHP, and I don't know if they keep the index around in memory between queries--lacking that, a different approach might be used. [1] http://www.dokuwiki.org/dokuwiki >> sub SearchTitleAndBody { > And it would be prettier in Lua. :) > In fact, I will try to make an alternative to "sputnik-search" that does that. >... > Part of me wants to go this route of adding first one little feature > then another, and eventually implementing a great search engine in > Lua. Another part of me wants to finish my dissertation. :) Just a basic case-insensitive substring search would go a long way. With the basics in place, others may improve upon it. (I'm not sure what a great search engine implemented in Lua would really offer, as opposed to a Lua binding to a great search engine implemented in C.) >> When previewing edits to template/config pages, it would be useful for >> Sputnik to apply the templates being edited in the preview... > It does work this way for CGI, but not currently with Xavante. Didn't do so in Apache/CGI when I tested it (e.g. previewing edits to the MAIN template) in the latest release version.
2008-08-27 05:46 |
yuri
Yuri Takhteyev <yuri at sims.berkeley.edu>
>> Having the traceback displayed in the browser speeds up the >> development tremendously, but perhaps the best thing to do is add a >> config variable ("DISPLAY_TRACEBACK") that turns this on, keeping it >> off by default. > > That would do. This is committed now. (See my message to the sputnik list the other day.) >> My assumption was that people start with the "Installation" page, >> which links to "Basic Configuration" > > That helped, but I more looking for just a simple (and complete) > listing on my local install rather than a tutorial (e.g. like the > sitemap below). Done in git. > Note there are two purposes for this list, and it depends whether the > user is logged in as administrator and the permission settings on > those pages. First, an administrator may want to see a complete list > of pages (edited or not), including configuration pages. Perhaps the > administrator is securing the web site. Each page, and in particular > the configuration pages, is an interface that needs to be reviewed. > The second purpose is for indexing by Google. Google should only see > a subset of these pages (i.e. the public ones), and in particular that > set should by default not contain configuration pages. It's not edited vs. non-edited pages. It's "real" nodes vs defaults. Real nodes are actual chunks of data that we have in our storage system. "Defaults" are things that we fall back onto, based on name patterns and other things. Note that the built-in pages like the one's now listed in the "sputnik" node, are just the simplest kind of defaults, but there are others. For example, if we get a request for "foo/bar", we'll check with the node "foo" whether it wants to tell us what to do with "foo/bar". So, a request for "foo/bar" may produce a proper response even if we don't have a node called "foo/bar". This could be used, for example, to map children of a node to some different data source. For instance, we could configure our "Source" node to treat its children as git IDs, so git commits, so that a request for "Source/3327741" would return the information about commit 3327741. What this all means is that there isn't a clear boundary between nodes that "exist" and those that "do not exist". That said, I'll try to think more about this and see if I come up with a way to offer a more complete listing. > Yes, that file system transparency would be a nice feature, as it is > in Dokuwiki[1]. The structure you describe using Git is similar in > structure to the plain-text file storage structure used by Dokuwiki, > and perhaps the plain-text file storage used by Versium could benefit > by taking an approach more like this as well. For example, here's > roughly what the Dokuwiki plain-text file storage structure looks > like: I see the attraction of this system, but I am wondering if the advantages it would offer justify the change. My approach has been to keep the default storage method as simple as i could make it (in terms of code), leaving it to other implementation to offer additional features. > The "page" directory contains the latest versions of all the wiki > pages. These are human readable -and- editable markup text. I assume, though, that editing them by hand does not affect the history. The nice thing about using git is that you can actually make changes and record history from the command line. (Or view history of edits made through the web interface.) > "media" directory contains resources used by those pages (e.g. > images). We've been trying to implement this in a generic way, so images are just nodes like any other. > The "attic" directory contains compressed/timestamped copies > of previous versions of the pages. I've thought of adding compression, but I am not sure if that would give much benefit. At least in my experience, most revisions of most nodes are under 4K. This means that for a "typical" page such as http://sputnik.freewisdom.org/en/Installation, gzipping each version individually only reduces the total size only by about 20%. On the other hand, concatenating the versions and _then_ gzipping them reduces it by 96.4 and may well be worth doing. Of course, this is also more complicated. One possible compromize is to concatenate and zip files in groups of ten, after reaching 10th, 20th, 30th, etc. revision. In this case the directory ends up looking like this: 00000.txt.gz 00003.txt.gz 00006.txt.gz 000081 000084 000087 00001.txt.gz 00004.txt.gz 00007.txt.gz 000082 000085 00002.txt.gz 00005.txt.gz 000080 000083 000086 ("00002.txt.gz" has versions 000020 to 000029.) This gives a reduction of 83%. This would perhaps be worth doing. But then again, if space is an issue, I am thinking that git would offer more compact storage. > "cache" directory contains cached objects to improve performance. You > can safely delete the attic/index/cache files if you no longer want > them (or omit them from a backup process). Why would you want to delete attic? I would tend to think of wiki's history being as important (if not more) than the latest revision. > Concerning the index directory, these are all text files. page.idx is > a new-line delimited list of page names in the index. Each > w[0-9]+.idx file is an unsorted, new-line delimited list of words of > length [0-9]+ (note: this nicely makes all records fixed-width). > pageword.idx maps each page number (index in page.idx) to a list of > words identified in the form of (word_length, word_index) pairs. (The > text of a word can be obtained by looking up the pair in the > w[0-9]+.idx files.) The i[0-9]+.idx files correspond to the > w[0-9]+.idx files, and the lines in the files correspond as well. > These files represent the inverted index and likely constrain query > performance, as they map word index to a list of (page_number, > word_count) pairs. There's some further documentation in ( > http://www.dokuwiki.org/indexer ). You can checkout the source code > of the indexer.php file ( http://www.splitbrain.org/projects/dokuwiki > )--it's only about 700 lines. Thanks for the links. If we try to do an indexer in Lua, though, we should try storing the index as a Lua file! :) My main issue with this approach though, is that indexing is expensive and happens only occasionally. Considering that wikis are likely to be updated more often than they are searched, I am wondering if with some clever caching indexing on demand (at the time of the query) may actually work better. > (I'm not sure > what a great search engine implemented in Lua would really offer, as > opposed to a Lua binding to a great search engine implemented in C.) Flexibility and ease of experimentation. >>> When previewing edits to template/config pages, it would be useful for >>> Sputnik to apply the templates being edited in the preview... >> It does work this way for CGI, but not currently with Xavante. > > Didn't do so in Apache/CGI when I tested it (e.g. previewing edits to > the MAIN template) in the latest release version. Can you email me specific steps? Otherwise I can't seem to reproduce it. - yuri -- http://sputnik.freewisdom.org/
2008-08-27 12:43 |
carregal
Andre Carregal <carregal at fabricadigital.com.br>
On Wed, Aug 27, 2008 at 4:46 AM, Yuri Takhteyev <yuri@sims.berkeley.edu> wrote: > (...) > It's not edited vs. non-edited pages. It's "real" nodes vs defaults. > Real nodes are actual chunks of data that we have in our storage > system. "Defaults" are things that we fall back onto, based on name > patterns and other things. Note that the built-in pages like the > one's now listed in the "sputnik" node, are just the simplest kind of > defaults, but there are others. For example, if we get a request for > "foo/bar", we'll check with the node "foo" whether it wants to tell us > what to do with "foo/bar". So, a request for "foo/bar" may produce a > proper response even if we don't have a node called "foo/bar". This > could be used, for example, to map children of a node to some > different data source. For instance, we could configure our "Source" > node to treat its children as git IDs, so git commits, so that a > request for "Source/3327741" would return the information about commit > 3327741. What this all means is that there isn't a clear boundary > between nodes that "exist" and those that "do not exist". > > That said, I'll try to think more about this and see if I come up with > a way to offer a more complete listing. What about asking the "real" nodes about the "virtual" ones? You could call node:getchildren() for example and then use the resulting list as part of the listing. This would also allow you to show this listing as an hierachy. > (...) > Thanks for the links. If we try to do an indexer in Lua, though, we > should try storing the index as a Lua file! :) > > My main issue with this approach though, is that indexing is expensive > and happens only occasionally. Considering that wikis are likely to > be updated more often than they are searched, I am wondering if with > some clever caching indexing on demand (at the time of the query) may > actually work better. I think you are overestimating the updating frequency. I'd say the typical wiki is searched more often. :o) >> (I'm not sure >> what a great search engine implemented in Lua would really offer, as >> opposed to a Lua binding to a great search engine implemented in C.) > > Flexibility and ease of experimentation. And, depending on the API you choose, this Lua engine could be replaced by a more powerful one when needed. Andr?
2008-08-28 03:59 |
dm.lua
David Manura <dm.lua at math2.org>
On Wed, Aug 27, 2008 at 3:46 AM, Yuri Takhteyev wrote: >> The "page" directory contains the latest versions of all the wiki >> pages. These are human readable -and- editable markup text. > > I assume, though, that editing them by hand does not affect the > history. The nice thing about using git is that you can actually make > changes and record history from the command line. (Or view history of > edits made through the web interface.) True. However, there is a utility that can check-in versions from the command-line utility: http://www.dokuwiki.org/cli >> "cache" directory contains cached objects to improve performance. You >> can safely delete the attic/index/cache files if you no longer want >> them (or omit them from a backup process). > > Why would you want to delete attic? I would tend to think of wiki's > history being as important (if not more) than the latest revision. Maybe a more desirable property is that a set of files outside of revision control has basically the same structure as the set of files inside revision control but with no older versions stored. There is therefore little need for an "svnadmin create" / "svn import" / "svn export" type of command. >> Didn't do so in Apache/CGI when I tested it (e.g. previewing edits to >> the MAIN template) in the latest release version. > > Can you email me specific steps? Otherwise I can't seem to reproduce it. Will try again later.