|
Modélisation de séries temporelles multiples et multidimensionnelles..Mireille Gettler-Summa, Bernard Goldfarb, Laurent Schwartz, Jean Marc Steyaert, Frédérique Lefaudeux. La revue MODULAD, numéro 42, 2010.
Résumé On présente ici une recherche de modélisation de séries temporelles multiples et multidimensionnelles extraites de données de sites officiels. La difficulté réside d’une part dans la construction des bases de données en raison des différents formats initiaux, des incohérences et des données manquantes, d’autre part dans le grand nombre de variables, endogènes et exogènes, et dans la multiplicité des entrées admissibles pour le problème. Les séries temporelles exogènes sont de plus munies d’une partition a priori. On présente dans cette recherche une approche pour la réduction des variables et des solutions de modélisation de ces données complexes que l’on construit à partir d’adaptation de solutions classiques au contexte temporel multidimensionnel.
Mots clés
séries temporelles multiples, codage, réduction de dimension, modélisation, épidémiologie du cancer, variables latentes et séries temporelles
Abstract
The most relevant elements in this paper are the automatic extraction of temporal data from Official databases and the modelization attempt of some multiple time series by exogenous other multiple time series. The results are applied on to an Epidemiological problem of modeling cancer rates incidence over twenty years, for different countries all over the world. Many issues come up when getting the data: most of the data bases are not available in the same format, some data bases are limited in terms of the number of lines that are allowed for a single query, and after importing the data, one needs to have coherence and continuity over time for each variable. The variables may cover various domains and their definition may have changed over time: expert knowledge is needed to achieve the final attribute coding and validate the retained data. A pre processing phase is then carried on: splines functions for smoothing atypical values and for filling the remaining missing data by interpolation, temporal transformation such as 5th order sum over past years lagged variables in the cancer data base. As an example the epidemiological data consists at that point in a complex set of data: multiple (25 countries in the example), multidimensional (socio economy, nutrition, health care, environment, standardized cancer rates etc.) time series (twenty one years). In order to reduce the data dimension, an exploratory phase builds and discovers the factor blocks that will be introduced in the models. Factors are computed with the Varimax rotation method because most of the variables are highly correlated. Grouping is also performed through clustering approaches for complex time series and the partition is one of the exogenous variable for the modelization phase. A generalized LISREL approach for multidimensional time series is finally performed: as an example, ecology, socio economy, nutrition, health care, style of life and environment are the latent variables of the epidemiological study whereas death cancer rates are the endogenous variables.
Key words
multiple temporal series, coding, dimension reduction, modeling, epidemiology of cancer, latent variables and series
Article
|