-->

Labs News

Issue Tracking at code.creativecommons.org

Nathan Yergler, September 5th, 2008

Continuing the trend started when we moved our source repository, today we’re rolling out issue tracking on code.creativecommons.org. Our goals are two fold:

  1. Increase transparency regarding what we are doing and plan to do. If you find a bug or suggest an idea, we’d like to make sure it’s tracked in a publicly accessible location where everyone can follow along.
  2. Tangentially, we’d like to make it easier for people to contribute to the work we’re doing. We [semi-]frequently hear people say they’d like to help, but don’t know where to start. We’ve had Developer Challenges forever but they’re not easy to find and poorly maintained. I’m personally hoping that keeping the ideas in the same system we [developers] use every day will keep them in the forefront of our minds.

With respect to the first, we’re initially tracking bugs here for three projects: the license engine, Herder (a translation tool we’ll be rolling out real soon now), and CC Learn’s Universal Education Search project. Feel free to create bugs, wishes, features for any CC project; we’ll create the Project identifiers as we go.

With respect to challenges, I’ve created a community keyword we’ll assign to projects that it’s unlikely we’ll tackle, but which might be appropriate for someone in the community who wants to contribute. Luis’ idea from earlier this week is the first. I hope we have a giant pile of ideas (and a corresponding giant pile of completed ideas) by next year’s Summer of Code.

No Comments »

Loggy: some results

Ankit Guglani, September 2nd, 2008

So after setting up EC2, S3, grabbing the files from S3, SCP-ing the python scripts and running them, one would expect to see some results. Upon the polite request of Asheesh here is a sampler.

The first script (dealing with urls that change their license, named licChange.py) results in an output which lists the URLs (that change their license [type, version or jurisdiction]), the license info and the date(s) of change:

http://blog.aikawa.com.ar/ [['by-nc-sa', '2.5', 'ar'], ['by-nc-nd', '2.5', 'ar']] ['21/Sep/2007:11:38:56 +0000', '22/Sep/2007:05:40:22 +0000']

The line above shows that the license for the URL ‘http://blog.aikawa.com.ar/’ was changed from ‘by-nc-sa 2.5 Argentina’ to ‘by-nc-nd 2.5 Argentina’ some time between 11:38:56 GMT on the 21st of September 2007 to 05:40:22 GMT on 22nd of September 2007. The format may seem a bit awkward but you can expect a facelift for the results file. I was previously planning to re-read the file to generate statistics but we can have a seperate file for storing data and another one for the stats.

Similarly, the following lines out of the results file for licChange.py from 2007-09 show license changes for ‘http://0.0.0.0:3000/’ and ‘http://127.0.0.1/actibands/castellano/licencias.htm’ and *many other internal URLs:

http://0.0.0.0:3000/ [['by-nc-sa', '3.0', ''], ['by-nc-sa', '3.0', ''], ['by-nc-sa', '3.0', ''], ['by-nc-sa', '3.0', ''], ['by-nc-sa', '3.0', ''], ['by-nc-nd', '3.0', 'nl'], ['by-nc-nd', '3.0', 'nl']] ['17/Sep/2007:08:10:28 +0000', '17/Sep/2007:17:50:28 +0000', '18/Sep/2007:16:25:47 +0000', '19/Sep/2007:13:03:23 +0000', '19/Sep/2007:13:11:16 +0000', '20/Sep/2007:22:16:09 +0000', '20/Sep/2007:22:16:39 +0000']

http://127.0.0.1/actibands/castellano/licencias.htm [['by-sa', '2.5', 'es'], ['by-nc-sa', '2.5', 'es'], ['by-sa', '2.5', 'es'], ['by-nc-sa', '2.5', 'es'], ['by-sa', '2.5', 'es'], ['by-nc-sa', '2.5', 'es']] ['27/Sep/2007:20:50:44 +0000', '27/Sep/2007:20:50:44 +0000', '27/Sep/2007:20:51:00 +0000', '27/Sep/2007:20:51:00 +0000', '27/Sep/2007:20:51:23 +0000', '27/Sep/2007:20:51:23 +0000']

The licenses for http://0.0.0.0:3000/ are ported for Netherlands (nl) and the one for http://127.0.0.1/actibands/castellano/licencias.htm are ported for Spain (es). Note that presently all the occurences of any URL that changes its license is outputted, this will be changed in the next nightly build. This included a better formatted result file with stats on total number of URLs changing licenses and even stats distinguishing changes between license change and version change.

Akin to this (licChange.py) there are 3 more scripts, licChooser.py, licSearch.py and deedLogs.py.

licChooser.py grabs metadata usage information and generates stats in absolute numbers and percentage of all entries, eg.: “16 out of 100 items are tagged as Audio [16%] of total entries and 29% of items with Metadata”

licSearch.py grabs information from the logs for search.creativecommons.org like the query, the engine and the search options (commercial use and derivatives).

deedLogs.py looks at the logs for the deed pages, employs MaxMind GeoIP to do a location lookup and grabs the deed page being loked at.

So this is what we have so far.

No Comments »

>>> py >> file … Also if __name__ == ‘__main__’:

Ankit Guglani, September 1st, 2008

Some major updates and we have the scripts running, thanks Asheesh for the redirection idea, it works but I couldn’t get it to give me a progress bar since everything was being redirected to the file. I tried using two different functions but they needed a shared variable, so that failed, but it was nice since now I ended up with “real” python files with a main().

The journey was interesting, we went from trying >> inside python to including # -*- coding: UTF-8 -*- and # coding: UTF-8 to get it to work and after a few more bumps finally figured out the __main__

I still need to update all the scripts, but licChange which is at the forefront of all the latest developments just got bumped upto version 8.2 (which reminds me of a dire need to update GIT:Loggy!).

This also gave me an idea of how to go about getting data out of S3 for “free” … S3 to EC2 is free … SCP from EC2 is free and voila! Why would I every want to do that? Well, for starters, the EC2 AMI runs out of space around 5 GB (note: logs for i.creativecommons.org are 4.7 GB) and secondly, the scripts seem to run faster locally. The icing on the cake, I wouldn’t have to scp the result files being generated. I could possibly automate the process of running the scripts.

Thats all for now … class at 0830 Hrs in the moring (it’s criminal, I know).

I guess, I’ll just have to keep at it.

No Comments »

EC2, S3Sync and back to Python.

Ankit Guglani, August 31st, 2008

So this is where we are.

Now we have EC2, we have S3Sync ruby scripts on the EC2 AMI to pull the data from S3 and we have updated python scripts that read one line at a time and use Geo-IP (which was suprisingly easy to install once GCC was functional and the right versions of the C and Python modules were attained). So deployment is on full throttle and one final bug fix for generating the final results and we are done.

So, now back to the python code. Now we have 4 scripts:

  • License Change (Logs for i.creativecommons.org) [Version 7]
  • License Chooser (Logs for creativecommons.org) [Version 5]
  • CC Search (Logs for search.creativecommons.org) [Version 4]
  • Deeds (Logs for creativecommons.org/licenses/*) [Version 2]

Each of which polls a directory for new logs, reads each new log in the stated directory, line by line and uses regular expressions to parse the information into usable statistics. Hitherto throughout the development phase, the results were passed on to stdout / console. With deployment, they now need to be written to a file, while interestingly is still to be resolved. (TypeError: ’str’ object is not callable sound familiar to anyone?)

I am greatful to Asheesh (whom I should have totally bugged more). I should’ve put in more work into the project when vactioning back home, also having less to do at school would’ve helped (studies + 3 research projects is not a recommended wotk load), but if it would be easy, it wouldn’t be fun! Oh well, I learnt a fair bit through the project and with a bit more troubleshooting we’d be good to go … for now!

No Comments »

License-oriented metadata validator and viewer: summertime is winding up

Hugo Dworak, August 16th, 2008

Google Summer of Code 2008 approaches its end, as less than forty-eight hours are left to submit the code that will then be evaluated by mentors, therefore it is fitting to pause for a moment and sum up the work that has been done with regard to the license-oriented metadata validator and viewer and to confront it with the original proposal for the project.

A Web application capable of parsing and displaying license information embedded in both well-formed and ill-formed Web pages has been developed. It supports the following means of embedding license information: Dublin Core metadata, RDFa, RDF/XML linked externally or embedded (utilising the data URL scheme) using the link and a elements, and RDF/XML embedded in a comment or as an element (the last two being deprecated). This functionality has been proven by unit testing. The source code of a Web page can be uploaded or pasted by a user, there is also a possibility to provide a URI for the Web application to analyse it. The software has been written in Python and uses the Pylons Web Framework and the Genshi toolkit. Should you be willing to test this Lynx-friendly application, please visit its Web site.

The Web application itself uses a library called “libvalidator”, which in turn is powered by cc.license (a library developed by Creative Commons that returns information about a given license), pyRdfa (a distiller that generates the RDF triples from an (X)HTML+RDFa file), html5lib (an HTML parser/tokenizer), and RDFLib (a library for working with RDF). The choice of this set of tools has not been obvious and the library had undergone several redesigns, which included removing the code that employed encutils, XML canonicalization, µTidylib, and the BeautifulSoup. The idea of using librdf, librdfa, rdfadict has been abandoned. The source code of both the Web application (licensed under the GNU Affero General Public License version 3 or newer) and its core library (licensed under the GNU Lesser General Public License version 3 or newer) is available through the Git repositories of Creative Commons.

In contrast to the contents of the original proposal, the following goals have not been met: traversal of special links, syndication feeds parsing, statistics, and cloning the layout of the Creative Commons Web site. However, these were never mandatory requirements for the Web application. It is also worth noting that the software has been written from scratch, although a now-defunct metadata validator existed. Nevertheless, the development does not end with Google Summer of Code — these and several new features (such as validation of multimedia files via liblicense and support for different language versions) are planned to be added, albeit at a slower pace.

After the test period, the validator will be available under http://validator.creativecommons.org/.

No Comments »

Flickr Image Re-Use for OpenOffice.org new updates

Mihai Husleag, August 12th, 2008

Since my last article new functionalities were implemented :

- more results per page (16 to be more exactly)

- an image is inserted if you double click on it(previous was on a single click)

- i add it the functionality for Impress and Calc

- fixed some bugs related to search

Unfortunately i have a problem with the popup menu on right click menu. It seems if that if set the location of the popup on the place where the right click happens, the popup indeed will appear but only for a moment. This happens not for all those 16 results, but for lets say more than half.

Now i found some settings and at this moment the popup will appear for each result, unfortunately the location where the popup appears is not exactly on the result (slightly above). I have to work more on this.

Some screenshots :

Results

Writer

Impress

Calc

Download extension (right click and save as)

2 Comments »

Asheesh’s liblicense interview

Steren Giannini, August 12th, 2008

Relatively to the liblicense 0.8 announcement, I recently made a video interview of Asheesh concerning his work on liblicense.

Watch it here.

In his demo, Asheesh uses liblicense twice:

  • in the online photo gallery to read and write metadata
  • in the Eye Of Gnome plug-in to read license metadata

This shows that liblicense is now mature enough to be used by your applications.

No Comments »

GeoIP Hates Me … phail.

Ankit Guglani, August 6th, 2008

Not that I am expecting much trouble coding using the Geo-IP module, but trying to get it on to the system itself has me believing that this module is out to get me! First, mac OS X (Leopard) doesn’t come with GCC installed (shocker!) and this module needs building, so I go to get it. GCC is in packaged in with the developers tool, which is about a 2 GB install and I can’t hand-pick the components … fail. So I go get myself darwin ports, and try that route. It installs, gives me the sweet *ding*, install complete sound and when I go to terminal and … fail … no such file or directory. So I give in to its terrorist demands and make room for the developers pack thinking I’ll make up for it by actually using these tools. So I wait 19 minutes for it to complete installing, I check I have GCC [i686-apple-darwin9-gcc-4.0.1] … happily I go and python setup.py build … and what followed was not nice … a screen full of Warnings and Errors and No Build. =(

I am going to find another source and try again till it finally works!

In other news, changing all my codes to methods and including append to file for results, looking to add file-list comparison as a feature. Coming soon to a GIT repository near you!

No Comments »

64 bit woes (almost) cleared up

Nathan Kinkade, August 2nd, 2008

As I mentioned in a recent post, we have upgraded our servers to 64 bit. All of them are now running amd64 for Debian. The first three server were upgraded remotely, but we noticed that a few applications were constantly dying due to segmentation faults. There was some speculation that this was a strange consequence of the remote upgrade process, so we upgraded the 4th server by reprovisioning it with Server Beach as a 64 bit system, cleanly installed from scratch.

Well, it turned out that even the cleanly installed 64 bit system was having problems. So I installed the GNU Debugger, which I had never actually used before. I attached it to one of the processes that was having a problem, and what should immediately reveal itself but:

(gdb) c
Continuing.
[New Thread 1090525536 (LWP 16948)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1082132832 (LWP 16865)]
0×00002aaaaacfcd91 in tidySetErrorSink () from /usr/lib/libtidy-0.99.so.0

Nathan Yergler made a few changes to cc.engine, the application that was having a problem, and which is based on Zope, to remove any dependencies to libtidy, and the segfaults ceased. We haven’t had the time to debug libtidy itself, but it would seem that there was some incompatibility between the version we had installed and a 64 bit system.

We are still having a problem with cgit segfaulting, and that is the next thing to look into … 1 down, 1 to go.

No Comments »

liblicense 0.8 (important) fixes RDF predicate error

asheesh, July 30th, 2008

Brown paper bag release: liblicense claims that the RDF predicate for a file’s license is http://creativecommons.org/ns#License rather than http://creativecommons.org/ns#license. Only the latter is correct.

Any code compiled with liblicense between 0.6 and 0.7.1 (inclusive) contains this mistake.

This time I have audited the library for other insanities like the one fixed here, and there are none. Great thanks to Nathan Yergler for spotting this. I took this chance to change ll_write() and ll_read() to *NOT* take NULL as a valid predicate; this makes the implementation simpler (and more correct).

Sadly, I have bumped the API and ABI numbers accordingly. It’s available in SourceForge at http://sf.net/projects/cctools, and will be uploaded to Debian and Fedora shortly (and will follow from Debian to Ubuntu).

I’m going to head to Argentina for a vacation and Debconf shortly, so there’ll be no activity from on liblicense for a few weeks. I would love help with liblicense in the form of further unit tests. Let’s squash those bugs by just demonstrating all the cases the license should work in.

No Comments »
Page 1 of 1012345»...Last »