Sunday, January 20, 2013

Setting a triple store up on the cloud


Solution 1: using rdflib on the App Engine
The first solution consists in running a python script on the Google App engine to manage the RDF store. The rdflib library can be used to parse RDF triples (as Turtle, N-triples, RDF/XML and RDFa) residing in local or remote files.
In addition, a SPARQL endpoint can be directly instantiated, as a query engine is included as plugin. A graph can be stored in a caching system, such as the GAP memcache, and retrieved with reduced overhead when needed. However, in many cases a persistent storage could be required. The google datastore is modeled as key-value table, and object can either and can store structures of objects. For an application dealing with multiple repositories, the name could be used as key to retrieve the Graph structure. The GAE provide also a transactional interface to access the datastore, so as to ensure the correct resolution of race conditions when multiple instances are trying to access the same object structure.
In conclusion, this solution is very easy to set up and works fine for small triplestores. The GAE makes writing scalable applications for the cloud very easy, but it is definetely less flexible than outher counterparts (e.g. Amazon web services). In fact, the GAE limits potential improvements to this solution in several ways:
• an application has only 60 seconds to respond to a user request
• developers can not rely on code residing outside the Python or Java interpreter sandbox, which means for example that it is not possible to upload C libraries
• A file in the application cannot be bigger than 32 MB. This applies both to resources (i.e. code, configuration files) and static files. Moreover, the total amount of files cannot be bigger than 10000 and must not exceed 150 MB. The only solution to mitigate the number of source files is to store them in a zip archive and import the zip into the application.
• in the free version, the application has a free quota, which is a maximum daily resource budjet of 28 free instance hours
• HTTP over SSL (HTTPS) is handled in a trasparent way by the application code. However, certificates for the secure connection are charged and not available in the free quota.

Solution 2: using dydra
Setting up a data repository in dydra is very fast. After naming the repository and defining basic permissions and visibility properties on the data, the repository is ready to go. In the next step a data set can be imported from a local file or a remote URI. In the trial we imported the classic Tim Berners Lee foaf description and we performed a simple query on the knowledge base.

2.1 Pros and cons with respect to the Google App Engine
Learning curve Users do not have to worry about managing the repos-
itory and its security aspects. Scalability is provided by the Amazon
platform on which Dydra is running.
Authentication API Applications can use an authentication key to gain
access to data.
Security The guide does not explain how to use HTTPS to access data.
Cost? It is not yet clear how dydra will charge this service.

3 Comments, opinion, advices
Cost Dydra should provide some basic account that is free of charge
Authentication API Applications may desire to create logical views of
the repository for different users. Applications should use some sort of
authentication protocol (e.g. OAuth) to access data)
Linked data from the real world Due to the large amount of self-
describing linked data produced by the increasing number of smart devices
and appliances, a neutral and reliable data repository that can flexibly
scale to different demand is required.
– How does Dydra intend to face this need?
– Would SPARQL be enough to query such a volatile data space? (see
C-SPARQL, SPARQLstream, EP-SPARQL, and CQELS)
– How do you intend to integrate Dydra with graph analytics solutions
(e.g. Hadoop) to give the possibility to perform data analytics over
event streams?

No comments:

Post a Comment