Skip to content

tpapouskova/getting_and_cleaning_data__project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

getting_and_cleaning_data__project

Knowing joining tables and grouping will be needed, I firstly load dplyr library.

library(dplyr)

Then the data was downloaded and unzipped.

download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip", destfile = "HAR_dataset.zip")
unzip("HAR_dataset.zip", exdir = "HAR_dataset")

Both train and test data sets were downloaded into R. As the data about the 561 observed variables were stored seperately (in X_train/test.txt) from information about subjects (in subject_train/test.txt) and activitis (in y_train/test.txt), it was necessary to combine the information from all these 3 sources to create a complete train/test data set. To name the columns in the dataset appropriately, names "subject", "activity" and variable names found in features.txt were used.

variable_names <- read.table("./HAR_dataset/UCI HAR Dataset/features.txt")

train_set <- read.table("./HAR_dataset/UCI HAR Dataset/train/X_train.txt")
names(train_set) <- variable_names$V2
train_labels <- read.table("./HAR_dataset/UCI HAR Dataset/train/y_train.txt", col.names = "activity")
train_subjects <- read.table("./HAR_dataset/UCI HAR Dataset/train/subject_train.txt", col.names = "subject")
train_data <- bind_cols(train_subjects, train_labels, train_set)

test_set <- read.table("./HAR_dataset/UCI HAR Dataset/test/X_test.txt")
names(test_set) <- variable_names$V2
test_labels <- read.table("./HAR_dataset/UCI HAR Dataset/test/y_test.txt", col.names = "activity")
test_subjects <- read.table("./HAR_dataset/UCI HAR Dataset/test/subject_test.txt", col.names = "subject")
test_data <- bind_cols(test_subjects, test_labels, test_set)

Next step was to merge train and test data sets to create one table.

data <- bind_rows(train_data, test_data)

As the columns of the complete table were already named appropriately, using grep function only columns containing information about mean ("...mean()...") and standard deviation ("...std()...") could be easily selected and then included in a new table (subset of the previous one).

mean_data <- data[,grep("mean\\(\\)",names(data))]
std_data <- data[,grep("std\\(\\)",names(data))]
data_subset <- bind_cols(data[,1:2],mean_data,std_data)

Afterwards numbers referring to a certain activity were replaced by the activity names. Names of the variables were also slightly changed in order to be a bit more readable.

activity_labels <- read.table ("./HAR_dataset/UCI HAR Dataset/activity_labels.txt")
data_subset$activity <- sapply(data_subset$activity, function(index){activity_labels$V2[index]})
names(data_subset) <- sapply(names(data_subset), function(name){gsub("\\(\\)", "", gsub("-", "_", name))})

The last step was to create a data set with the average of each variable for each activity and each subject. This was done using the group_by function from the dplyr library.

tidy_data <- group_by(data_subset, subject, activity) %>% summarise_each(funs(mean))
write.table(tidy_data, "tidy_data.txt", row.names = FALSE)

About

repo for Getting and Cleaning Data course on Coursera

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages