Skip to content

Commit f828911

Browse files
authored
Add metadata save design doc (#2852)
* add metadata save doc * update * update * rename md file
1 parent a1c85a4 commit f828911

1 file changed

Lines changed: 123 additions & 0 deletions

File tree

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# Get the Model Metadata When Generating Prediction Workflow Codes
2+
3+
## Background
4+
5+
Currently, we have saved the model structure and weights after training in the SQLFlow. However, there are some requirements that we have to save the model metadata too.
6+
7+
Let us take an example of the `TO TRAIN` and `TO PREDICT` statements.
8+
9+
```sql
10+
SELECT * FROM my_db.train_table
11+
TO TRAIN my_docker_registry/my_docker_image:latest/MyDNNClassifier
12+
...
13+
LABEL class
14+
INTO my_model;
15+
16+
SELECT * FROM my_db.test_table
17+
TO PREDICT my_db.test_table_prediction.class
18+
USING my_model;
19+
```
20+
21+
- We should save the docker image name which is used in the `TO TRAIN` statement so that we can use the same docker image in the `TO PREDICT` statement.
22+
23+
The `TO PREDICT` statement should use the same docker image as the `TO TRAIN` statement, i.e., `my_docker_registry/my_docker_image:latest` in the above example. Therefore, we should save the docker image name used in the `TO TRAIN` statement at the end of the training.
24+
25+
- When running `TO PREDICT` statement, we should know whether the trained model is a TensorFlow or XGBoost model, so that we can use different ways to generate the Python code.
26+
27+
The code generation may be quite different for TensorFlow and XGBoost models. When running the `TO TRAIN` statement, we can use the estimator name (i.e., `MyDNNClassifier` in the above example) to distinguish whether the model to train is a TensorFlow or XGBoost model. But when running the `TO PREDICT` statement, we can only get the trained model name (i.e. `my_model` in the above example) but not the estimator name. Therefore, we do not know whether the model is a TensorFlow or XGBoost model, and we cannot know how to generate the prediction Python code. As a result, we should save the estimator name at the end of the training.
28+
29+
## What Data Should Be Saved As the Model Metadata
30+
31+
We propose to save all fields in the `ir.TrainStmt`, so that we can get all necessary metadata of the `TO TRAIN` statement.
32+
33+
## How to Save the Model Metadata
34+
35+
### Open Source Version
36+
In the open source version, we can save the metadata in the DBMS along with the model structure and weights. Suppose that users write the following SQL statement:
37+
38+
```sql
39+
SELECT * FROM my_db.train_table
40+
TO TRAIN my_docker_registry/my_docker_image:latest/MyDNNClassifier
41+
...
42+
LABEL class
43+
INTO my_db.my_trained_dnn_model;
44+
```
45+
46+
We should save the metadata, model structure and weights in the DBMS table `my_db.my_trained_dnn_model` together.
47+
48+
In the current implementation, we had saved model structure and weights in the format of:
49+
50+
```
51+
+-----+-----------------------------+
52+
| id | block |
53+
+-----|-----------------------------+
54+
| 0 | |
55+
| 1 | model structure and weights |
56+
| ... | |
57+
+-----+-----------------------------+
58+
```
59+
60+
The model structure and weights were deserialized into byte stream and saved in multiple rows inside the DBMS table.
61+
62+
In the new design, we propose the saved data in the DBMS table is in the format of:
63+
64+
```
65+
+-----+----------------------------------------------------------+
66+
| id | block |
67+
+-----|----------------------------------------------------------+
68+
| 0 | |
69+
| 1 | (metadata_length, metadata, model structure and weights) |
70+
| ... | |
71+
+-----+----------------------------------------------------------+
72+
```
73+
74+
The first 64 bit is the metadata's length, then the second field is metadata, and the last field is model structure and weights. This design is almost the same with the current implementation except for the leading metadata fields. In this way, we can get metadata easily without loading the entire model.
75+
76+
### PAI Platform Version
77+
In the PAI platform version, we can save the metadata in the OSS bucket along with the model structure and weights. Suppose that users write the following SQL statement:
78+
79+
```sql
80+
SELECT * FROM my_db.train_table
81+
TO TRAIN my_docker_registry/my_docker_image:latest/MyDNNClassifier
82+
...
83+
LABEL class
84+
INTO my_pai_trained_dnn_model;
85+
```
86+
87+
We propose to save the model metadata in the OSS bucket `oss://sqlflow-models/user_id/my_pai_trained_dnn_model/metadata.json`, and to save the model structure and weights in the OSS bucket `oss://sqlflow-models/user_id/my_pai_trained_dnn_model/model_save`.
88+
89+
## How to Get the Model Metadata in Prediction Workflow Codegen
90+
91+
There are 2 situations when we do prediction:
92+
93+
- Case 1: we have trained a model beforehand, and we only run one `TO PREDICT` statement. That is to say, the total SQL statements to run contain only one SQL statement:
94+
95+
```sql
96+
SELECT * FROM my_db.test_table
97+
TO PREDICT my_db.test_table_prediction.class
98+
USING my_model;
99+
```
100+
101+
Since we have trained the model `my_model` beforehand, we can get the metadata from the DBMS table or OSS bucket when running these statements above.
102+
103+
- Case 2: the `TO PREDICT` statement uses the trained model from the previous workflow step. That is to say, the total SQL statements to run contain two SQL statements:
104+
105+
```sql
106+
SELECT * FROM my_db.train_table
107+
TO TRAIN my_docker_registry/my_docker_image:latest/MyDNNClassifier
108+
...
109+
LABEL class
110+
INTO my_model;
111+
112+
SELECT * FROM my_db.test_table
113+
TO PREDICT my_db.test_table_prediction.class
114+
USING my_model;
115+
```
116+
117+
Since the the trained model `my_model` will be only generated after the first workflow step runs, we cannot get the model metadata from the DBMS or OSS when we generate the workflow codes for the `TO PREDICT` statement. In this case, we should do some dependency analysis for the SQL statements. That is to say, we should check if there is any `TO TRAIN` statement that will generates the trained model, and get the model metadata from that `TO TRAIN` statement.
118+
119+
In conclusion, the way we try to get the model metadata when generating the workflow codes for the `TO PREDICT` statement is:
120+
121+
- Check if there is any `TO TRAIN` statement that generates the trained model used by the `TO PREDICT` statement.
122+
- If yes, use the metadata from the `TO TRAIN` statement directly.
123+
- If no, try to get the model metadata from the DBMS table or OSS bucket.

0 commit comments

Comments
 (0)