What If you Could SPARQL Anything?

How Sparql Anything may be a game changer for knowledge graphs

Nov 10, 2023

Assembly line going into a press machine entitled "SPARQL-Anything" with books, reports, and other documents going in and Elaborate gems coming out the other.

While RDF is a powerful tool for both modeling and inferencing, there’s long been an impediment to wide-spread adoption to RDF for the simple fact that most data is not in that form. This means that most solutions for knowledge graph ingestion are push oriented, which is not always ideal. It would be nice to be able to set up a manifest for loading resources in and to then transform them automatically depending upon their structural form.

SPARQL-Anywhere As Format Bridge

The open-source SPARQL-Anything package promises to flip that script, and as a consequence, change the way that we both write and query knowledge graph applications. Given the growing role that knowledge graphs have in the world of AI systems, this could go a long way towards using such knowledge graphs as the staging platforms for large language models, as well as for prompt enrichment when querying the same.

Ingestion is the process of loading data into a database, in this case a knowledge graph. Traditionally, such data came in as some form of RDF - n3 notation, Turtle, rdf-xml, json-ld or similar format. While these provided access they also generally required that the data was transformed from some other format, usually following an ingestion pipeline driven by kafka or similar external toolsets. As the vast majority of data on the web is not in these formats, this generally placed a strong dependency upon a-priori conversion, which slowed throughput dramatically and added complexity.

Sparql-Anything works by taking advantage of the SPARQL Service API, and specifically overloading the SERVICE operator to point to a specific URL or file endpoint. For instance, suppose that you have a JSON file (here at https://sparql-anything.cc/example1.json) that contains the following code:

[
  {
    "name":"Friends",
    "genres":[
      "Comedy",
      "Romance"
    ],
    "language":"English",
    "status":"Ended",
    "premiered":"1994-09-22",
    "summary":"Follows the personal and professional lives of six twenty to thirty-something-year-old friends living in Manhattan.",
    "stars":[
      "Jennifer Aniston",
      "Courteney Cox",
      "Lisa Kudrow",
      "Matt LeBlanc",
      "Matthew Perry",
      "David Schwimmer"
    ]
  },
  {
    "name":"Cougar Town",
    "genres":[
      "Comedy",
      "Romance"
    ],
    "language":"English",
    "status":"Ended",
    "premiered":"2009-09-23",
    "summary":"Jules is a recently divorced mother who has to face the unkind realities of dating in a world obsessed with beauty and youth. As she becomes older, she starts discovering herself.",
    "stars":[
      "Courteney Cox",
      "David Arquette",
      "Bill Lawrence",
      "Linda Videtti Figueiredo",
      "Blake McCormick"
    ]
  }
]

This is vanilla JSON, rather than JSON-LD. The SPARQL anything interface could then be invoked with the following SPARQL script:

PREFIX xyz: <http://sparql.xyz/facade-x/data/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX fx: <http://sparql.xyz/facade-x/ns/>

SELECT ?seriesName
WHERE {

    SERVICE <x-sparql-anything:https://sparql-anything.cc/example1.json> {
        ?tvSeries xyz:name ?seriesName .
        ?tvSeries xyz:stars ?star .
        ?star fx:anySlot "Courteney Cox" .
    }

}

with namespace prefix (x-sparql-anything:) identifying the resource in question. The output of this particular call is a standard SELECT output. For instance, if the mime-type requested is an HTML table, this should output as

seriesName
"Cougar Town"
"Friends"

To facilitate this, SPARQL-Anything makes use of an internal RDF ontology called Facade-X for managing lists, bags, and collections, built primarily around rdf: and rdfs: primitives. Once in this form, this output can then be passed via an internal pipeline to secondary CONSTRUCT blocks in order to map this to a more appropriate representation. This is done automatically in the case of a number of formats, including JSON, XML, CSV, HTML, Markdown, YAML, and other formats.

The SERVICE options include the ability to specify HTTP operators, meaning that both configuration information and payloads can be sent via GET, PUSH, POST, and other methods - you are not limited to HTTP GET. Because these are invoked from within SPARQL these are all synchronous operations, but as it is likely that the calls here would be done in the context of an asynchronous environment such as nodeJS, this is not the limitation it may once have been.

The application is built around Jena/ARC, and includes support for invoking SPARQL-Anything (SA) via Jena-Fuseki. Additionally, it can be run as a CLI (command line interface), making it possible to use SPARQL as a transformation language on local files. Beyond this, SA is also able to work on ZIP and TAR files, something I consider especially powerful because it opens up the possibility of working with “packages” consisting of mixed file types, such as JSON, XML, Turtle, binary media files, Excel and Word documents, and more.

I consider this a huge win because the ability to both read and persist packaged content makes it possible to define and upload both ontologies and configuration files as single units, a potent tool if you are looking at using a SPARQL endpoint as a web application server, like WordPress, Drupal, or similar systems. The fact that binary data such as images and media files can also be persisted in Base64 through SA makes this even more attractive (as well as the fact that EXIF data can be read from media files). With SA incorporating tools for more efficient regex and pattern-matching capabilities, generalized text is also supported.

SPARQL-Anything is still in beta. The one format it doesn’t support (yet) is PDF, but I figure this will likely come before SA moves into a 1.0 version.

Conclusion

I believe we are entering the phase where machine learning (AI) solutions are moving into domains where they are less efficient than existing solutions, and momentum is slowing as a consequence. I expect that existing knowledge graph solutions will likely continue to outperform CLMs for at least another year or more.

To that end, solutions like SPARQL-Anything should be seen as critical: ways of ingesting content regardless of their format. I look forward to exploring it in greater depth.

Kurt Cagle is the Editor of The Ontologist. He lives in Bellevue, Washington. Sign up for free Ontology Office Hours at Calendly.