GFQL Language Specification#

Introduction#

GFQL (Graph Frame Query Language) is a DataFrame-native graph query language designed for expressing graph patterns and traversals on tabular data. It operates on node and edge DataFrames, providing a functional, composable approach to graph querying with native GPU acceleration support.

Design Principles#

Dataframe-native: Type-safe functional bulk operations over dataframe libraries like pandas, cuDF
Declarative: Focus on what to retrieve, and give the engine freedom to optimize how
Accessible: Designed for both human readability and machine generation, and building on intuitions from popular tabular and graph systems
Performance-oriented: Vectorized operations by default, including GPU acceleration
Embeddable: Similar to DuckDB, can be embedded in different languages, and initially focused on Python data ecosystem
Computer-tier: Decoupling from storage enables flexible execution - embedded locally or via remote acceleration servers

Language Forms#

GFQL exists in three complementary forms:

Core Language: Abstract graph pattern matching language defined by this specification
Embedded DSL: Host language implementations (currently Python with pandas/cuDF)
Wire Protocol: JSON serialization for client-server communication (see Wire Protocol spec)

This specification focuses on the core language concepts. Examples use Python syntax for concreteness, but the patterns apply to any embedding.

Language Overview#

Core Concepts#

Graph Model#

Graphs consist of node and edge dataframes:

Edges: DataFrame with source and destination columns
Nodes: DataFrame with unique identifier column
Column names are user-defined globals for the graph:
- Node ID attribute: g._node (e.g., “node_id”, “id”)
- Edge source attribute: g._source (e.g., “source”, “from”)
- Edge destination attribute: g._destination (e.g., “destination”, “to”)
GFQL infers nodes from edge references when only edges are provided

GFQL Programs#

GFQL programs are declarative graph-to-graph transformations:

Enable use cases like search, filter, enrich, and traverse
Express what to find (ex: Cypher), not how to find it (ex: Gremlin)

Chains#

Path pattern expressions for matching graph structures:

Express graph patterns as sequences of node and edge matching operations
Similar to Cypher patterns but decomposed into composable steps
Define paths through the graph: start nodes → edges → end nodes
Each operation refines the pattern match based on previous results

Operations#

Act on graph entities (nodes and edges):

Node matchers: Filter and select nodes
Edge matchers: Traverse relationships
Operations work on the graph structure itself

Predicates#

Act on attributes of nodes and edges:

Filter based on property values
Comparison, membership, string matching, temporal checks
Composable within operations to build complex conditions

Values#

Type system matching modern data formats:

Scalars: numbers, strings, booleans, null
Temporal: ISO datetimes, dates, times with timezone support
Collections: lists for membership tests
Compatible with JSON, Arrow, and DataFrame type systems

Formal Grammar#

GFQL Grammar in Extended Backus-Naur Form#

(* Entry point *)
query ::= chain

(* Chain - path pattern expression *)
chain ::= "[" operation ("," operation)* "]"

(* Operations *)
operation ::= node_matcher | edge_matcher

(* Node Matcher *)
node_matcher ::= "n(" node_params? ")"
node_params ::= filter_dict ("," name_param)? ("," query_param)?
              | name_param ("," query_param)?
              | query_param

(* Edge Matchers *)
edge_matcher ::= edge_forward | edge_reverse | edge_undirected
edge_forward ::= "e_forward(" edge_params? ")"
edge_reverse ::= "e_reverse(" edge_params? ")"  
edge_undirected ::= ("e" | "e_undirected") "(" edge_params? ")"

(* Parameters *)
edge_params ::= edge_match_params ("," hop_params)? ("," node_filter_params)? ("," name_param)?

filter_dict ::= "{" (property_filter ("," property_filter)*)? "}"
property_filter ::= string ":" (value | predicate)

hop_params ::= hop_bound_params | hop_slice_params | hop_label_params | "hops=" integer | "to_fixed_point=True"
hop_bound_params ::= "min_hops=" integer | "max_hops=" integer
hop_slice_params ::= "output_min_hops=" integer | "output_max_hops=" integer
hop_label_params ::= "label_node_hops=" string | "label_edge_hops=" string | "label_seeds=True"
node_filter_params ::= source_filter ("," dest_filter)?
source_filter ::= "source_node_match=" filter_dict | "source_node_query=" string
dest_filter ::= "destination_node_match=" filter_dict | "destination_node_query=" string

name_param ::= "name=" string
query_param ::= "query=" string
edge_query_param ::= "edge_query=" string
edge_match_params ::= filter_dict | edge_query_param

(* Predicates *)
predicate ::= comparison | membership | range | null_check | string_pred | temporal_pred

comparison ::= ("gt" | "lt" | "ge" | "le" | "eq" | "ne") "(" value ")"
membership ::= "is_in(" "[" value ("," value)* "]" ")"
range ::= "between(" value "," value ("," "inclusive=" boolean)? ")"
null_check ::= "is_null()" | "not_null()" | "is_na()" | "not_na()"
string_pred ::= string_match | string_check
string_match ::= "contains(" string ("," "case=" boolean)? ("," "regex=" boolean)? ")"
              | "match(" string ("," "case=" boolean)? ("," "flags=" integer)? ")"
              | "fullmatch(" string ("," "case=" boolean)? ("," "flags=" integer)? ")"
              | ("startswith" | "endswith") "(" string ("," "case=" boolean)? ")"
string_check ::= ("isalpha" | "isnumeric" | "isdigit" | "isalnum"
               | "isupper" | "islower") "()"
temporal_pred ::= temporal_check "()"
temporal_check ::= "is_month_start" | "is_month_end" | "is_quarter_start" 
                 | "is_quarter_end" | "is_year_start" | "is_year_end" | "is_leap_year"

(* Values *)
value ::= scalar | temporal_value | collection
scalar ::= number | string | boolean | null
temporal_value ::= datetime_value | date_value | time_value
datetime_value ::= "pd.Timestamp(" string ("," "tz=" string)? ")"
                 | "datetime(" datetime_args ")"
date_value ::= "date(" date_args ")"
time_value ::= "time(" time_args ")"
collection ::= "[" (value ("," value)*)? "]"

(* Primitives *)
string ::= '"' [^"]* '"' | "'" [^']* "'"
number ::= integer | float
integer ::= ["-"]? [0-9]+
float ::= ["-"]? [0-9]+ "." [0-9]+
boolean ::= "True" | "False"
null ::= "None"
datetime_args ::= integer ("," integer)*
date_args ::= integer "," integer "," integer
time_args ::= integer "," integer ("," integer)?

Operations#

Node Matcher: `n()`#

Filters nodes based on attributes.

Syntax: n(filter_dict?, name?, query?)

Parameters:

filter_dict: Dictionary of attribute filters
name: Optional string label for results
query: Pandas query string expression

Examples:

n()                                    # All nodes
n({"type": "person"})                 # Nodes where type='person'
n({"age": gt(30)})                    # Nodes where age > 30
n(name="important")                   # Label matching nodes
n(query="age > 30 and status == 'active'")  # Query string

Edge Matchers#

Forward Traversal: `e_forward()`#

Traverses edges in forward direction (source → destination).

Syntax: e_forward(edge_match?, hops?, min_hops?, max_hops?, output_min_hops?, output_max_hops?, label_node_hops?, label_edge_hops?, label_seeds?, to_fixed_point?, source_node_match?, destination_node_match?, name?)

Parameters:

edge_match: Edge attribute filters
hops: Number of hops (default: 1; shorthand for max_hops)
min_hops/max_hops: Inclusive traversal bounds (default min=1 unless max=0; max defaults to hops)
output_min_hops/output_max_hops: Optional post-filter slice; defaults keep all traversed hops up to max_hops
label_node_hops/label_edge_hops: Optional hop-number columns; label_seeds=True writes hop 0 for seeds when labeling
to_fixed_point: Continue until no new nodes (default: False)
source_node_match: Filters for source nodes
destination_node_match: Filters for destination nodes
name: Optional label

Examples:

e_forward()                           # One hop forward
e_forward(hops=2)                     # Two hops forward
e_forward(min_hops=2, max_hops=4, output_min_hops=3, label_edge_hops="edge_hop")  # bounded + sliced + labeled
e_forward(to_fixed_point=True)        # All reachable nodes
e_forward({"type": "follows"})        # Only 'follows' edges
e_forward(source_node_match={"active": True})  # From active nodes

Reverse Traversal: `e_reverse()`#

Traverses edges in reverse direction (destination → source).

Syntax: Same as e_forward()

Undirected Traversal: `e()` or `e_undirected()`#

Traverses edges in both directions.

Syntax: Same as e_forward()

Predicates#

Comparison Predicates#

gt(value)    # Greater than
lt(value)    # Less than
ge(value)    # Greater than or equal
le(value)    # Less than or equal
eq(value)    # Equal
ne(value)    # Not equal

Membership Predicate#

is_in([value1, value2, ...])  # Value in list

Range Predicate#

between(lower, upper, inclusive=True)  # Value in range

String Predicates#

Pattern matching predicates:

contains(pat, case=True, regex=True)     # Contains pattern (substring or regex)
startswith(prefix, case=True)            # Starts with prefix
endswith(suffix, case=True)              # Ends with suffix
match(pat, case=True, flags=0)           # Matches regex from start of string
fullmatch(pat, case=True, flags=0)       # Matches regex against entire string

String type checking predicates:

isalpha()    # Alphabetic characters only
isnumeric()  # Numeric characters only
isdigit()    # Digits only
isalnum()    # Alphanumeric
isupper()    # All uppercase
islower()    # All lowercase

Null Predicates#

is_null()     # Is null/None
not_null()    # Is not null/None
is_na()       # Is NaN (numeric)
not_na()      # Is not NaN

Temporal Predicates#

is_month_start()    # First day of month
is_month_end()      # Last day of month
is_quarter_start()  # First day of quarter
is_quarter_end()    # Last day of quarter
is_year_start()     # First day of year
is_year_end()       # Last day of year
is_leap_year()      # Is leap year

Call Operations and Security#

Call Operations#

GFQL supports calling Plottable methods through the call() operation, providing controlled access to graph transformation and analysis capabilities:

call(function: str, params: dict) -> ASTCall

Call operations enable:

Graph algorithms (PageRank, community detection)
Layout computations (ForceAtlas2, Graphviz)
Data transformations (filtering, collapsing)
Visual encodings (color, size, icons)

Safelist Architecture#

For security and stability, Call operations are restricted to a predefined safelist of methods. This prevents:

Arbitrary code execution
Access to filesystem or network operations
Modification of global state
Unsafe graph operations

Safelist Categories#

Graph Analysis

get_degrees, get_indegrees, get_outdegrees: Calculate node degrees
compute_cugraph: Run GPU algorithms (pagerank, louvain, etc.)
compute_igraph: Run CPU algorithms
get_topological_levels: Analyze DAG structure

Filtering & Transformation

filter_nodes_by_dict, filter_edges_by_dict: Filter by attributes
hop: Traverse graph with conditions
drop_nodes, keep_nodes: Node selection
collapse: Merge nodes by attribute
prune_self_edges: Remove self-loops
materialize_nodes: Generate node table

Layout

layout_cugraph: GPU-accelerated layouts
layout_igraph: CPU-based layouts
layout_graphviz: Graphviz layouts
fa2_layout: ForceAtlas2 layout
ring_continuous_layout: Radial layout driven by numeric attributes
ring_categorical_layout: Radial layout grouping by categories
time_ring_layout: Time-series radial layout (accepts ISO timestamp bounds)

Note

time_ring_layout accepts ISO-8601 strings for time_start / time_end when sent over the wire. GFQL converts them to numpy.datetime64 before use so the behavior matches direct Plotter calls.

Visual Encoding

encode_point_color: Color nodes/edges
encode_point_size: Size nodes
encode_point_icon: Set icons
bind: Attach visual attributes

Embeddings & Dimensionality Reduction

umap: UMAP dimensionality reduction for graph embeddings

Validation#

Call operations undergo multiple validation stages:

Safelist Check: Function name must be in the safelist
Parameter Validation: Parameters validated against method signature
Type Checking: Runtime type validation
Schema Validation: Compatibility with graph schema

Error Codes#

E104: Function not in safelist
E105: Missing required parameter
E201: Parameter type mismatch
E303: Unknown parameter
E301: Required column not found (runtime)

Type System#

Value Types#

Scalars
- number: int, float
- string: Text values
- boolean: True/False
- null: None
Temporal Types
- datetime: Timestamp with optional timezone
- date: Calendar date
- time: Time of day
Collections
- list: Ordered sequence of values

Type Coercion#

GFQL performs automatic type coercion:

Python datetime → pandas Timestamp
Numeric types → appropriate precision
Collections → lists for is_in()

Execution Model#

Declarative Pattern Matching#

GFQL follows a declarative execution model similar to Neo4j’s Cypher:

Pattern Declaration: Chains express path patterns in the graph
- Users declare graph patterns as sequences of node and edge constraints
- Patterns specify what paths to match, not how to find them
- The engine optimizes pattern matching based on data characteristics
Set-Based Operations: All operations work on sets of entities
- No explicit iteration or traversal order
- Results include all matching patterns in the graph
- Current GFQL engines use a novel bulk-oriented execution model that is asymptotically faster than traditional iterative approaches used for Cypher, but this is not a requirement of the language itself
Lazy Evaluation: Chains define pattern transformations without immediate execution
- Allows engines to optimize path finding and pattern matching strategies \

Result Access#

Query execution returns filtered node and edge datasets. In the Python embedding:

result = g.gfql([...])
nodes_df = result._nodes  # Filtered nodes
edges_df = result._edges  # Filtered edges

Named Results#

Operations with name parameter add boolean columns to mark matched entities:

result = g.gfql([
    n({"type": "person"}, name="people"),
    e_forward(name="connections"),
    n({"active": True}, name="active_targets")
])

# Access all matched nodes and edges:
all_nodes = result._nodes
all_edges = result._edges

# Access specific matched nodes/edges using pandas filtering:
people_nodes = result._nodes[result._nodes["people"]]
connection_edges = result._edges[result._edges["connections"]]
active_nodes = result._nodes[result._nodes["active_targets"]]

# Or using standard pandas query syntax:
people_nodes = result._nodes.query("people == True")

This pattern is essential for extracting specific subsets from complex graph traversals.

Best Practices#

Use specific filters early: Filter nodes before traversing edges
Limit hops: Use reasonable hop limits to avoid explosion
Name important results: Use name parameter for analysis
Prefer filter_dict: More efficient than query strings
Use appropriate predicates: Match predicate to column type

GFQL Language Specification

Contents

GFQL Language Specification#

Introduction#

Design Principles#

Language Forms#

Language Overview#

Core Concepts#

Graph Model#

GFQL Programs#

Chains#

Operations#

Predicates#

Values#

Formal Grammar#

Operations#

Node Matcher: n()#

Edge Matchers#

Forward Traversal: e_forward()#

Reverse Traversal: e_reverse()#

Undirected Traversal: e() or e_undirected()#

Predicates#

Comparison Predicates#

Membership Predicate#

Range Predicate#

String Predicates#

Null Predicates#

Temporal Predicates#

Call Operations and Security#

Call Operations#

Safelist Architecture#

Safelist Categories#

Validation#

Error Codes#

Type System#

Value Types#

Type Coercion#

Execution Model#

Declarative Pattern Matching#

Result Access#

Named Results#

Best Practices#

See Also#

Node Matcher: `n()`#

Forward Traversal: `e_forward()`#

Reverse Traversal: `e_reverse()`#

Undirected Traversal: `e()` or `e_undirected()`#