timelogger数据清洗

发表于 2024-12-08 更新于 2025-05-11 分类于计算机，实践阅读次数：评论数：

更多的属于timelogger软件的数据存储结构的介绍
目前这个数据的一条龙转换清洗已经建立完成。后续只需要定期替换原始库database.sqlite，再运行kellte中一条龙工作文件即可完成。
整个工作分为抽取数据到mysql的ods库、跨天记录按天拆分到new_ods库、数据重分类并写入到dwd表。

数据抽取

这部分又分为聚合、格式化日期、写入三部分。

聚合

由于需要的数据分散于源数据库的不同表中，所以需要使用先聚合在一起。具体的连接关系如下：


graph TD
A[interval] 
A -->|activity_id| B[activity]
A -->|is_deleted!=1| V[Filter: interval.is_deleted != 1]

B -->|type_id| C[type]
B -->|is_deleted!=1| W[Filter: activity.is_deleted != 1]
B -->|status=0| X[Filter: activity.status = 0]

C -->|parent_id| D[grouptab]
C -->|is_deleted!=1| Y[Filter: type.is_deleted != 1]

B -->|activity_id| E[activity_tag]

%% 只显示主要的连接关系
style A fill:#f9f,stroke:#333,stroke-width:2px
style B fill:#bbf,stroke:#333,stroke-width:2px
style C fill:#bbf,stroke:#333,stroke-width:2px
style D fill:#bbf,stroke:#333,stroke-width:2px
style E fill:#bbf,stroke:#333,stroke-width:2px
style V fill:#f3f3f3,stroke:#aaa,stroke-width:1px, stroke-dasharray: 5 5
style W fill:#f3f3f3,stroke:#aaa,stroke-width:1px, stroke-dasharray: 5 5
style X fill:#f3f3f3,stroke:#aaa,stroke-width:1px, stroke-dasharray: 5 5
style Y fill:#f3f3f3,stroke:#aaa,stroke-width:1px, stroke-dasharray: 5 5

结果数据对应的来源表对应如下

graph TD
    A[interval] -->|activity_id| B[activity]
    
    B -->|type_id| C[type/grouptab]
    
    C -->|parent_id| C
    
    B -->|activity_id| E[activity_tag]

    %% 结果字段
    A -->|interval.id| F[结果字段: interval.id]
    C -->|grouptab.name| G[结果字段: type_group]
    C -->|type.name| H[结果字段: activity_type]
    A -->|interval.from| I[结果字段: from]
    A -->|interval.to| J[结果字段: to]
    A -->|interval.updated| K[结果字段: interval_updated]
    A -->|interval.guid| L[结果字段: guid]
    B -->|activity.comment| M[结果字段: remarks]
    E -->|activity_tag.tag| N[结果字段: tags]

    %% 连接线样式
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#bbf,stroke:#333,stroke-width:2px
    style C fill:#bbf,stroke:#333,stroke-width:2px
	style E fill:#bbf,stroke:#333,stroke-width:2px
	style F fill:#fff,stroke:#888,stroke-width:1.5px,stroke-dasharray: 5 5
    style G fill:#fff,stroke:#888,stroke-width:1.5px,stroke-dasharray: 5 5
    style H fill:#fff,stroke:#888,stroke-width:1.5px,stroke-dasharray: 5 5
    style I fill:#fff,stroke:#888,stroke-width:1.5px,stroke-dasharray: 5 5
    style J fill:#fff,stroke:#888,stroke-width:1.5px,stroke-dasharray: 5 5
    style K fill:#fff,stroke:#888,stroke-width:1.5px,stroke-dasharray: 5 5
    style L fill:#fff,stroke:#888,stroke-width:1.5px,stroke-dasharray: 5 5
    style M fill:#fff,stroke:#888,stroke-width:1.5px,stroke-dasharray: 5 5
    style N fill:#fff,stroke:#888,stroke-width:1.5px,stroke-dasharray: 5 5

日期格式化

由于sqlite的数据为integer，而MySQL的对应字段为date，直接写入好像有问题，所以需要将日期进行格式化。使用了java脚本与字段筛选两个组件，我怀疑指需要字段筛选组件就可以了，不过没测试。

写入MySQL

使用更新与插入组件完成，这个组件根据条件进行更新或者插入，还是比较有用。

拆分

这部分是将跨天的数据按天进行拆分，为了方便后面分析。并且重新计算了活动的持续时间。具体方式为先清空表，然后调用shell脚本，通过shell脚本执行python脚本进行数据拆分与插入new_ods表。

写入dwd

这个部分全是通过sql脚本进行，某些为什么要写的原因也不清楚了。大致逻辑先清空dwd表，然后插入数据，然后按照规则进行编辑(update)