The following file extensions are supported for creating a Data Set (each format has unique configuration options based on its specific data structure and properties):
*.csv - comma-separated values files. This format typically uses a comma to separate data fields, though other delimiters can also be used.
*.tsv - tab-separated values files. This format typically uses tabs to separate data fields, though other delimiters can also be used.
*.tbl - generic table file. These files store tabular data and are often used in database systems or other software that handles structured data.
*.txt - plain text file. Notepad files, usually contain unformatted text.
*.xls - Excel binary file format. An older Microsoft Excel format that stores spreadsheet data in a binary format.
*.xlsx - Microsoft Excel spreadsheet. Regular and modern Excel file format.
*.sas7bdat - SAS Database storage files. Created by SAS (Statistical Analysis System) software.
Add a new Data Set based on a flat file
Fields common to all supported flat file extensions include:
Name - enter a suitable name for the Data Set.
Data Source Name - choose the required Data Source from the list of available Data Sources assigned to this specific Project. Based on your selection, list of available entities for Data Set creation will be shown.
Description - provide a short, informative description of the Data Set.
Tags - Data Set can be related to appropriate Tag(s).
Data Entity area:
The Data Entity area contains common fields and configuration options that may vary based on the file type.
Begin by selecting the file you will use as the base for the Data Set.
Fields common to all supported flat file extensions:
Dropdown List of Files - select the file to use for the Data Set.
Friendly name - the User can choose an alias that will be used instead of the file name in the "Selected columns" area in the column "Table". For example:
Advanced - allows to use a Variable instead of file name. Variable name should be in format @variable_name;. For example:
Reload Metadata button - refreshes the value of the specified Variable.
Search Pattern checkbox (*) – available only for Data Sets of type “Flat File”, allows to retrieve values from the latest/newest file(s). Please see the detailed explanation in the (*) note.
Files has header line checkbox – mark if the file includes a header line.
Header row number – specify the row number containing the header.
Properties:
Each file type includes specific Properties in the Data Entity area, with some fields applicable only to certain formats.
CSV Files properties
Delimited/Fixed Width - controls how data fields are separated:
If Delimited is selected, you can choose the delimiter for separating fields in the data (e.g., comma, tab, pipe, or specify a custom separator under "Other").
If Fixed Width is selected, you can manually specify the number of characters allocated to each column. This option is useful when data fields have consistent widths without a specific delimiter:
Add Column button - each time you click it, a new column is being added to the list. Click this button according to the number of required columns. If all existing columns have been already added to the table and the button is being clicked again, new fictive columns will be added in an alphabetical order: A, B, C... as seen in the screenshot below:
These columns will be empty.
Remove Last column button - clicking this button removes the last column from the list. If you want to remove a specific column that is not the last one, click the small 'trash' icon on the right-hand side of the column.
Excel Files (*.xlsx, *.xls) properties
Sheet Name – allows to choose which sheet to use for the Data Set from the selected file:
TSV, TBL, TXT and SAS files properties
If Delimited is chosen:
If Fixed Width is chosen:
Delimited/Fixed Width - same as in CSV files.
Sheet Name - same as in Excel Files.
Refresh columns button – after setting all the necessary properties, click the "Refresh Columns" button to get/update the columns, that will be displayed in the "Selected Columns" area on the right side.
Selected Columns area:
- select the checkbox on the left side of each column to choose the column to be displayed in the Data Set.
- use "All" button to show all the columns available in the Data Set.
- "Selected" button - show all the selected columns.
For example:
When clicking the "All" button:
When clicking the "Selected" button:
- The user can define a "Friendly" name for the column(s) buy clicking and changing current value.
- Data Type – can be selected from the list of available values:
- Format - depending on type chosen in "Data Type" column, available format options will appear. This is a template for displaying data that allows users to select and specify their preferred display format:
Text Functions - allows to convert String columns' values to upper case or to lower case, ensuring consistent text formatting for easy comparison.
Add Group By:
Users can choose Aggregate Functions and "Group by" for each column.
You must select at least one column and then click the "Add Group By" button as shown in the screenshot below:
The next window will open up:
You may select a certain function or to group the results by a certain combination of columns.
After choosing specific functions/ Group By, you can view the new query by clicking the "View Code" button:
Filter area:
A filter(s) can be added by clicking “Add Condition” button. Filters stand for “where” operation applied to the query, for example:
AND/OR operators:
The AND and OR operators are used to filter records based on more than one condition:
- The AND operator displays a record if all the conditions separated by AND are TRUE.
- The OR operator displays a record if any of the conditions separated by OR is TRUE.
(*) Search Pattern checkbox:
To retrieve values from the latest/newest file in a folder containing files in the same format:
- Select desired Entity.
- Mark the “(*) Search Pattern” checkbox.
- Insert the file name pattern, use "*" to indicate text before/after the pattern. The '*' char is optional.
Note: all files matching the pattern must have the same structure, including:
- Identical column names.
- The same number of columns.
For example: If the folder contains files cars_1.xlsx, cars_2.xlsx, and cars_3.xlsx, use the pattern cars*.xlsx. The Data Set will always display data from the most recent file.
Example of Data Set based on Excel file (*.xlsx):
Example of Data Set based on TSV file:
Selected Columns area:
Example of Group By:
Example of Data Set based on TXT file:
Example of Data Set based on CSV file:
Note: in order to switch between CSV and TSV file types, switch the type of the delimiter. A CSV file requires a delimiter of type 'Comma' and a TSV file requires a delimiter of type 'Tab'. See the picture below for different delimiter types.
Example of Data Set based on Sas7bdat file:
Comments
0 comments
Please sign in to leave a comment.