A comprehensive list of R packages for functional data is available here. We outline a few packages that members of our Lab have been involved with.
- Based on Xiao et al. (2017) “Fast covariance estimation for sparse functional data”.
- Contains functions for estimating covariance from sparse data and conducting functional principal components regression.
- Extensive set of regression methods for functional data, with scalar, functional, or longitudinal responses.
- Contains functions to conduct function-on-scalar regression, penalized functional regression, functional principal components analysis (sparse or dense data), functional principal components regression, functional generalized additive models, longitudinal functional data analysis, testing of functional predictors.
- Interactive plotting of functional data analyses from refund package.
- Contains functions to make lasagna plots, and plot output from functional principal components analysis, functional linear concurrent regression, function-on-scalar regression,
Below is a partial list of functional data datasets available through R.
dataset (package): description (from documentation)
- aemet (fda.usc): Series of daily summaries of 73 spanish weather stations selected for the period 1980-2009. The dataset contains geographic information of each station and the average for the period 1980-2009 of daily temperature, daily precipitation and daily wind speed.
- Ausmortality (fds): Age-specific mortality rates for Australia and Australian states.
- Australiafertility (rainbow): Age-specific fertility rates between ages 15 and 49 in Australia from 1921 to 2006. The age-specific fertility rates can be smoothed using a weighted median smoothing B-splines, constrained to be concave.
- beta (fdasrvf): Contains the MPEG7 curve data set which is 20 curves in 65 classes.
- Biscuit (fds): The experiment involved varying the composition of biscuit dough pieces. Two sets of dough pieces were measured, a calibration set and a prediction set. They were created and measured as two distinct sets, on separate occasions, and do not result from a random (or any other) split of a larger set.
- Cancerrate (fds): Age-specific breast cancer rates for Australian females with 9 age groups (45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80-84, 85+) from 1921 to 2001.
- CanadianWeather (fda): Daily temperature and precipitation at 35 different locations in Canada averaged over 1960 to 1994.
- cd4 (refund): CD4 cell counts for 366 subjects between months -18 and 42 since seroconversion. Each subject’s observations are contained in a single row.
- DTI (refund): Fractional anisotropy (FA) tract profiles for the corpus callosum (cca) and the right corticospinal tract (rcst). Accompanying the tract profiles are the subject ID numbers, visit number, total number of scans, multiple sclerosis case status and Paced Auditory Serial Addition Test (pasat) score.
- DTI2 (refund): A diffusion tensor imaging dataset used in Swihart et al. (2012). Mean diffusivity profiles for the corpus callosum (cca) and parallel diffusivity for the right corticospinal tract (rcst). Accompanying the profiles are the subject ID numbers, visit number, and Paced Auditory Serial Addition Test (pasat) score.
- ECBYieldcurve (fds): Provided by European Central Bank, this data set contains daily yield curve spot rate from 29/12/2006 to 24/07/2009 for government bond, nominal, all triple AAA issued companies,with maturity term at 3, 6 months and 1 to 30 years.
- Electricityconsumption (fds): This set of time series focus on the US monthly electricity consumed by the residential and commercial sectors from January 1973 up to February 2001 (336 months). This data set is a part of the original one which can be found at http://www.economagic.com.
- Electricitydemand (fds): These data sets consist of half-hourly electricity demands from Sunday to Saturday in Adelaide between 6/7/1997 and 31/3/2007.
- ElNino (rainbow): Original monthly sea surface temperatures have been restricted from January 1950 to December 2006. The monthly sea surface temperatures can be smoothed using smoothing spline with the smoothing parametric determined by generalized cross validation.
- ElNino2011 (rainbow): Original monthly sea surface temperatures have been restricted from January 1950 to December 2011. The monthly sea surface temperatures can be smoothed using smoothing spline with the smoothing parametric determined by generalized cross validation.
- Fat (fds): This data set is a part of the original one which can be found at http://lib.stat/cmu.edu/ datasets/tecator.
- gasoline (refund): Near-infrared reflectance spectra and octane numbers of 60 gasoline samples. Each NIR spectrum consists of log(1/reflectance) measurements at 401 wavelengths, in 2-nm intervals from 900 nm to 1700 nm.
- FedYieldcurve (fds): This data set contains monthly interest rate of the Federal Reserve from January 1982 to June 2009.
- growth_vel (fdasrvf): Combination of both boys and girls growth velocity from the Berkley Dataset.
- handwrit (fda): 20 cursive samples of 1401 (x, y,) coordinates for writing “fda”.
- hmdcountry (fds): This function returns a list of relevant demographic data currently available in the HMD, related to a specified country.
- infantGrowth (fda): Measurement of the length of the tibia for the first 40 days of life for one infant.
- lip (fda): 51 measurements of the position of the lower lip every 7 milliseconds for 20 repitions of the syllable ’bob’.
- MCO (fda.usc): The mithochondiral calcium overload (MCO) was measured in two groups (control and treatment) every 10 seconds during an hour in isolated mouse cardiac cells. In fact, due to technical reasons, the original experiment [see Ruiz-Meana et al. (2000)] was performed twice, using both the “intact”, original cells and “permeabilized” cells (a condition related to the mitochondrial membrane).
- medfly25 (fdapace): A dataset containing the eggs laid from 789 medflies (Mediterranean fruit flies, Ceratitis capitata) during the first 25 days of their lives. This is a subset of the dataset used by Carey at al. (1998); only flies having lived at least 25 days are shown. At the end of the recording period all flies were still alive.
- melanoma (fda): These data from the Connecticut Tumor Registry present age-adjusted numbers of melanoma skincancer incidences per 100,000 people in Connectict for the years from 1936 to 1972.
- Moisture (fds): This data set consists of near-infrared reflectance spectra of 100 wheat samples, measured in 2 nm intervals from 1100 to 2500nm, and an associated response variables, the samples’ moisture content.
- MontrealTemp (fda): Temperature in degrees Celsius in Montreal each day from 1961 through 1994
- nondurables (fda): US nondurable goods index time series, January 1919 to January 2000.
- Octane (fds): This data set comprises spectra from 60 gasoline samples, measured in 2 nm intervals from 900 to 1700 nm. The response variable is the octane numbers of the samples.
- onechild (fda): Heights of a boy of age approximately 10 collected during one school year. The data were collected “over one school year, with gaps corresponding to the school vacations” (AFDA, p. 84)
- phoneme (fda.usc): Phoneme curves.
- Phoneme (fds): This data set was formed by selecting five phonemes for classification based on digitized speech. There are n = 2000 pairs (xi , yi)i=1,…,n, where xi corresponds to the discretized log-periodograms whereas the yi gives the class membership (five phonemes: aa, ao, dcl, iy, sh).
- Pigweight (fds): The pig weight data set has 9 repeated weight measures on 48 pigs.
- pinch (fda): 151 measurements of pinch force during 20 replications with time from start of measurement.
- pobleanou (fda.usc): NOx levels measured every hour by a control station in Poblenou in Barcelona (Spain).
- refinery (fda): 194 observations on reflux and “tray 47 level” in a distillation column in an oil refinery.
- read.hmd (fds): This function allows users to read any data set from the Human Mortality Database (HMD).
- ReginaPrecip (fda): Temperature in millimeters in June in Regina, Saskatchewan, Canada, 1960 – 1993, omitting 16 missing values.
- seabird (fda): Numbers of sightings of different species of seabirds by year 1986 – 2005 at E. Sitkalidak, Uganik, Uyak, and W. Sitkalidak by people affiliated with the Kodiak National Wildlife Refuge, Alaska.
- sofa (refund): A dataset containing the SOFA (Sequential Organ Failure Assessment) scores (Vincent et al, 1996). for 520 patients, hospitalized in the intensive care unit (ICU) with Acute Lung Inury. Daily measurements are available for as long as each one remains in the ICU. This is an example of variable-domain functional data, as described by Gellar et al. (2014).
- Satellite (fds): The data were registered by the satellite topex/poseidon around an area of 25 kilometers upon the Amazon River. Each row of the data matrix is represented by its wave (i.e. curve) on the range (0, 70), and the satellite is registering 10 curves each second.
- SAtemp (fds): These data sets consist of half-hourly temperatures measured at Kent Town and Adelaide airport from Sunday to Saturday in Adelaide between 6/7/1997 and 31/3/2007.
- SOI (fds): Annual measures on Southern Oscillation Index (SOI): observed annual cycles in period 1900-2004.
- Spanishmigration (fds): This data set consists of migration number (in thousands) in Spain from 1999 to 2003. This data set contains the migration rates of 9 age groups, namely 0-9, 10-15, 16-19, 20-29, 30-39, 40-49, 50-59, 60-65, and 65+ for both females and males.
- StatSciChinese (fda): (x, y, z) coordinates of the location of the tip of a pen during fifty replications of writing ’Statistical Science’ in simplified Chinese at 10 millisecond intervals
- tecator (fda.usc): Water, Fat and Protein content of meat samples
- toy_data (fdasrvf): A functional dataset where the individual functions are given by a Gaussian peak with locations along the x-axis. The variables are as follows: f containing the 29 functions of 101 samples and time which describes the sampling.
- toy_warp (fdasrvf): A (aligned) functional dataset where the individual functions are given by a Gaussian peak with locations along the x-axis. The variables are as follows: f containing the 29 functions of 101 samples and time which describes the sampling which as been aligned
- Yieldcurve (fds): This data set contains monthly US Treasury bonds from January 1970 through December 2002. Based on the bid-ask midpoint average, the data consist of end of the month price quotes.
Functional Data Analysis Working Group @ Columbia University
PennSIVE @ University of Pennsylvania
SMART @ Johns Hopkin University