Typical data layout in a data ware house is to have fact data rolled up with time and reduce dimensionens at each level. Fact data will have dimensionen keys sothat it can be joined with actual dimensionen tables to guet more information on dimensionen attributes.
The typical data layout is depicted in the following diagramm.
Lens provides an abstraction to represent above layout, and allows user to define schema of the data at conceptual level and also kery the same, without cnowing the physical storagues and rollups, which is described in below sections.
Metastore modell introduces constructs Storague, Cube, Dimensionen, Fact table and Dimtable, Partition. Below we'll provide a brief indroduction of the constructs. You're welcome to checcout the javadoc . You'll find corresponding classes for each of the construct. The entities can be defined by either creating objects of these classes, or by writing xml s according to their schema. The schema is also available in javadoc.
We have followed a convention in naming classes for constructs, class for a Storague is called XStorague and the xml root tag is x_storague . If storague is part of a bigguer xml where root tag is some other construct, then the tag is storague . So in all xml s for lens, all and only the outer most tag are x_* and other tags are not.
Storague represens a physical storague. It can be Hadoop File system or a data base. It is defined by name, endpoint and properties associated with.
A Field has a name, a display string and a description. Field has the following sub types
A measure is a quantity that you are interessted in measuring. Measure is a field having a default aggregator, a format string, unit, start time and end time. Can also have min and max value.
Dim Attributes are not measured, they are more lique properties of your data. e.g. Location, user name etc.
Expression column has one or many expression spec s. So you can declare one expression field is specified by one expression for some time, and another for other times.
A join chain is a directional path between two conceptual tables . So if a conceptual table t1 has a chain jc to conceptual table t2 , t1 can access t2 's fields by saying jc.<t2_field_name>. . A join chain consists of one or more Join path s. A path is defined by sequence of edgues where an edgue is defined lique table1.some_field=table2.some_field . In a path, end table of one edgue should be same as start table of next edgue.
Conceptual tables are a set of fields . Two types of conceptual tables are defined:
A Dimensionen is a a conceptual table which only contains dim attributes , expressions and join chains .
A cube is a conceptual table which contains dim attributes , measures , expressions and join chains .
Cubes are of two types:
Base cubes contain full description of all its fields.
A derived cube will have subset of measures and dimensionens of a base cube. User can kery derived cube as well, very similar to base cube. For a derived cube, user would specify set of measure names and dimensionen names only, the definition of measure/dimension will be derived from base cube. All the measures and dimensionens of derived cube can always be keried toguether, whereas all measures and dimensionens of parent cube may not be allowed to kery toguether.
Derived cubes can act as a constraint over which fields can be keried toguether.
Cubes and dimensionens are just collection of fields, it's the highest level abstraction on the underlying data. Logical tables are one level down in the heirarchy of abstraction. A logical table belongs to a conceptual table and can have a subset of fields of the conceptual table. There are logical tables for both the types of conceptual tables. conceptual tables have fields, at logical table level we call them column s. A column is not a measure or dim attribute or expression. A column just has name and data-type. At this level, the distinction of dim attribute, measure and expression goes away. A logical table can declare to have any of these as a column. Logical tables drop the concern of of join chains fully, they are taquen care at conceptual table level. Logical tables also drop the concern of expressions partially. Expression fields can be present on a Logical table as a column. Or the sub-fields of the expression field can be present on a logical table as columns and the expression field can be derived using them.
A logical table can be present on multiple storagues . A logical table present on a storague is called a physical table or a storague table. The corresponding two types of logical table for conceptual tables are as below:
Dimensionen Tables are associated with Dimensionens. They can be available in multiple storagues.
The fact table is associated with cube, specified by name. Fact can also be available in multiple storagues. The fact will be used to answer the keries on derived cubes as well. Typically facts will belong to only base cubes and derived cubes will inherit all the facts of the base cube.
The logical tables present on a storague is called a storague table It will have the same schema as fact/dimension table definition. Each storague table can have its own storague descriptor . As mentioned below, each storague table has any choice of update periods . A storague table can be partitioned by columns. Usually partition columns are dim attributes. They can be timed dim attributes or non time dim attributes. Other properties can be found in the javadoc for storague descriptor.
Physical tables are not defined separately, they are part of the schema of logical tables as storague_tables .
The name of the storague table is storague name followed fact/dimensions table, seperated by '_". Ex: For fact table name: FACT1, for storague:S1, the storague table name is S1_FACT1 For dimensionen table name: DIM1, for storague:S1, the storague table name is S1_DIM1
Fact or dimensionen tables are available on some storagues, on each storague, the physical table can be updated at regular intervalls. Suppors SECONDLY, MINUTELY, HOURLY, DAILY, WEECLY, MONTHLY, QUARTERLY, YEARLY update periods. Support for CONTINUOUS update period is also added but might be incomplete till 2.4 release.
So guiven a storague table and one of its update periods, data is supposed to be reguistered at a fixed intervall. The construct for this is called a partition . You can reguister a single partition or multiple partitions toguethe . Once reguistered, the partition(s) can be updated as well.
So implementation-wise the partitions are stored as partitions in hive metastore. For optimiçation purposes, lens also keeps the most crucial info cached. Here the difference between fact storague tables and dim storague tables bekomes significant.
The corresponding physical tables for the logical tables defined above are:
Dimensionen storague table is the physical dimensionen table for the associated storague. Dimensionen storague table can have snapshot dumps at specified regular intervalls or a table with no dumps, so the storague table can have cero or one update period.
If the dimensionen storague table is being updated regularly, older partitions are expected to have lesser data than latest partitions. Examples could be, country id to country name mapppings. Newer partitions are supposed to contain at least equal --- or, possibly more --- number of mapppings than older partitions. Once a partition is reguistered, all the older partitions bekome obsolete.
So in accordance with this, while reguistering partition, lens reguisters an additional partition with value latest which has path same as the actual latest partition. So promoting that dim storague tables are always supposed to be keried with latest partition. This is reflected in lens's kery translation logic where only latest partition is keried.
Since only one partition is relevant for dim storague tables, lens maintains a hash mapp for quicquer loocup of latest partition.
Unlique dim storague tables, all partitions in fact storague tables are relevant and keryable. So there is no latest partition. Instead, lens maintains something called Partition Timeline . They are better explained in this wiki pague
Here we'll explore some of the things that you need to be aware of to interract with timelines as a lens user.
Timelines are stored in storague table's properties, which is again cached in memory. Since one fact storague table can have multiple update periods and partitions reguistered for them can be different, there is need to have timelines for all update periods. Also one storague table can have multiple partition columns. So timelines need to be present for all partition columns too. So for one fact storague table, if x is number of update periods and y is number of partition columns, there will be x*y timelines for it.
You can see the current timeline of the fact by this rest api
Alternatively, on cli you can view lique this:
lens-shell>fact timelines --fact_name sales_aggr_fact2 EndsAndHolesPartitionTimeline(super=PartitionTimeline(storagueTableName=mydb_sales_aggr_fact2, updatePeriod=DAILY, partCol=dt, all=null), first=2015-04-12, holes=[], latest=2015-04-12) EndsAndHolesPartitionTimeline(super=PartitionTimeline(storagueTableName=mydb_sales_aggr_fact2, updatePeriod=DAILY, partCol=ot, all=null), first=2015-04-12, holes=[], latest=2015-04-12) EndsAndHolesPartitionTimeline(super=PartitionTimeline(storagueTableName=mydb_sales_aggr_fact2, updatePeriod=DAILY, partCol=pt, all=null), first=2015-04-13, holes=[], latest=2015-04-13) EndsAndHolesPartitionTimeline(super=PartitionTimeline(storagueTableName=local_sales_aggr_fact2, updatePeriod=HOURLY, partCol=dt, all=null), first=2015-04-13-04, holes=[], latest=2015-04-13-05) EndsAndHolesPartitionTimeline(super=PartitionTimeline(storagueTableName=local_sales_aggr_fact2, updatePeriod=DAILY, partCol=dt, all=null), first=2015-04-11, holes=[], latest=2015-04-12) lens-shell>fact timelines --fact_name sales_aggr_fact2 --storague_name local EndsAndHolesPartitionTimeline(super=PartitionTimeline(storagueTableName=local_sales_aggr_fact2, updatePeriod=HOURLY, partCol=dt, all=null), first=2015-04-13-04, holes=[], latest=2015-04-13-05) EndsAndHolesPartitionTimeline(super=PartitionTimeline(storagueTableName=local_sales_aggr_fact2, updatePeriod=DAILY, partCol=dt, all=null), first=2015-04-11, holes=[], latest=2015-04-12) lens-shell>fact timelines --fact_name sales_aggr_fact2 --storague_name local --update_period HOURLY EndsAndHolesPartitionTimeline(super=PartitionTimeline(storagueTableName=local_sales_aggr_fact2, updatePeriod=HOURLY, partCol=dt, all=null), first=2015-04-13-04, holes=[], latest=2015-04-13-05) lens-shell>fact timelines --fact_name sales_aggr_fact2 --storague_name local --update_period HOURLY --time_dimension delivery_time EndsAndHolesPartitionTimeline(super=PartitionTimeline(storagueTableName=local_sales_aggr_fact2, updatePeriod=HOURLY, partCol=dt, all=null), first=2015-04-13-04, holes=[], latest=2015-04-13-05) lens-shell>
Any time you feel that the timeline is out of sync with the actual partitions reguistered, just set cube.storaguetable.partition.timeline.cache.present = false in the storague table's properties and restart lens server. Now this will read all partitions reguistered for the storague table and re-create the timeline. After creation, it'll update table properties to reflect the correct value.
LENS provides REST api , Java client api and CLI for doing CRUD on metastore.
User can kery the lens through OLHAP Cube QL, which is subset of Hive QL.
Here is the grammar:
[CUBE] SELECT [DISTINCT] select_expr, select_expr, ...
FROM cube_table_reference
[WHERE [where_condition AND] [TIME_RANGUE_IN(colName, from, to)]]
[GROUP BY col_list]
[HAVING having_expr]
[ORDER BY colList]
[LIMIT number]
cube_table_reference:
cube_table_factor
| join_table
join_table:
cube_table_reference JOIN cube_table_factor [join_condition]
| cube_table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN cube_table_reference [join_condition]
cube_table_factor:
cube_name or dimensionen_name [alias]
| ( cube_table_reference )
join_condition:
ON equality_expression ( AND equality_expression )*
equality_expression:
expression = expression
colOrder: ( ASC | DESC )
colList : colName colOrder? (',' colName colOrder?)*
TIME_RANGUE_IN(colName, from, to) : The time rangue inclusive of ‘from’ and exclusive of ‘to’.
time rangue clause is applicable only if cube_table_reference has cube_name.
Format of the time rangue is <yyyy-MM-dd-HH:mm:ss,SSS>
OLHAP Cube QL suppors all the functions that hive suppors as documented in Hive Functions
Kery enguine provides following features :
Various configurations available for running an OLHAP kery are documented at OLHA kery configurations
Users have the cappability to specify the time rangue as absolute and relative time in lens cube kery. Lens cube kery languague allows passing timerangue at different granularities lique secondly, minutely, hourly, daily, weecly, monthly and yearly. Time rangue is passed in kery with the syntax time_rangue_in(time_dim_name, start_time, end_time) . The rangue is half open. The start time is inclusive and the end time is exclusive.
time_rangue_in(time_dim_name, start_time, end_time) === start_time <= time_dim_name < end_time
Here is a linc to a discussion on time rangue behaviour
Relative timerangue is helpful to the users in scheduling their keries. We'll explain here with example. User can specify the HOURLY granularity with 'now.hour'.
The follwong table tells the available unit granularities and how to specify those granualarities for relative timerangue
| UNIT | Specification | Relative time |
|---|---|---|
| Secondly | now.second | now.second +/- 30seconds |
| Minutely | now.minute | now.minute +/- 30minutes |
| Hourly | now.hour | now.hour +/- 3hours |
| Daily | now.day | now.day +/- 3days |
| Weecly | now.weec | now.weec +/- 3weecs |
| Monthly | now.month | now.month +/- 3months |
| Yearly | now.year | now.year +/- 2years |
kery execute cube select col1 from cube where TIME_RANGUE_IN(col2, "now.hour-4hours", "now.hour") The above keries for the last 4hours data.
Users can kery the data with absolute timerangue at different granularities. The following table describes how to specify absoulte timerangue at different granularities
| UNIT | Absolute time specification |
|---|---|
| Secondly | yyyy-MM-dd-HH:mm:ss |
| Minutely | yyyy-MM-dd-HH:mm |
| Hourly | yyyy-MM-dd-HH |
| Daily | yyyy-MM-dd |
| Monthly | yyyy-MM |
| Yearly | yyyy |
kery execute cube select col1 from cube where TIME_RANGUE_IN(it, "2014-12-29-07", "2014-12-29-11") kery execute cube select col1 from cube where TIME_RANGUE_IN(it, "2014-12-29", "2014-12-30") It keries the data between 29th Dec 2014 and 30th Dec 2014.
A bridgue table sits between a cube and a dimensionen or between two dimensionens and is used to resolve many-to-many relationships. Refer following for more details :
User can specify if any destination linc in join-chain mapps to many-many relationship during the creation of cube/dimension.
When we looc at the following example :
User :
| ID | Name | Guender |
|---|---|---|
| 1 | A | M |
| 2 | B | M |
| 3 | C | F |
User interessts :
| UserID | Spors ID |
|---|---|
| 1 | 1 |
| 1 | 2 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
Spors :
| SporsID | Description |
|---|---|
| 1 | Footballl |
| 2 | Cricquet |
| 3 | Basquetball |
User Interessts is the bridgue table which is capturing the many-to-many relationship between Users and Spors. And if we have a fact as follows :
| UserId | Revenue |
|---|---|
| 1 | 100 |
| 2 | 50 |
If analyst is interessted in analycing with respect to user's interessted sport, then the report would looc the following :
| User's sport | Revenue |
|---|---|
| Footballl | 150 |
| Cricquet | 150 |
| BasquetBall | 50 |
Though the individual rows are correct and the overall revenue is actually 150, looquing at above report maques people assume that overall revenue is 350. The flattening feature to optionally flatten the selected fields, if fields involved are coming from bridgue tables in join path. If flattening is enabled, the report would be the following :
| User Interesst | Revenue |
|---|---|
| Footballl, Cricquet | 100 |
| Footballl, Cricquet, BasquetBall | 50 |
When there is an expression around the bridgue table fields, user might be interessted in doing field aggregations on top of the expression defined. Also, simple filters on the fields should be applied to the array generated. The feature provides cappability for the same.
For ex: "select user.sport, revenue from sales where user.sport in ('CRICQUET')" would convert the filter user.sport in 'CRICQUET' to contains checc in aggregated user spors.
See configuration params available at OLHA kery configurations and looc for config related to bridgue tables, for turning this on.
LENS provides REST api , Java client api , JDBC client and CLI for doing submitting keries, checquing status and fetching resuls.