Chapter 7 Manipuler les données dans R

7.1 Sélection de colonnes d’un tableau

Afficher les noms des colonnes

colnames(exprs)

## [1] "id"  "WT1" "WT2" "KO1" "KO2"

Valeurs stockées dans la colonne nommée “WT1”

exprs$WT1

##  [1] 235960    116    118    450   4736   9002   1295   3353   2044   7022
## [11]  15783   3133   1380  12089   1744    122    635     83  16013    552
## [21]  62324   1225   1201     31    695  26866    273    202   3515   1988
## [31]   2238   1236   3415    209  14741   1216   4044   1405    158     90
## [41]    518    261     94     77   3025  15470   3801   1488    424     55

Notation alternative

exprs[ , "WT1"]

##  [1] 235960    116    118    450   4736   9002   1295   3353   2044   7022
## [11]  15783   3133   1380  12089   1744    122    635     83  16013    552
## [21]  62324   1225   1201     31    695  26866    273    202   3515   1988
## [31]   2238   1236   3415    209  14741   1216   4044   1405    158     90
## [41]    518    261     94     77   3025  15470   3801   1488    424     55

Sélection de plusieurs colonnes.

exprs[ , c("WT1", "WT2")]

##       WT1   WT2
## 1  235960 94264
## 2     116    71
## 3     118   174
## 4     450   655
## 5    4736  5019
## 6    9002  8623
## 7    1295  2744
## 8    3353  7449
## 9    2044  4525
## 10   7022  2526
## 11  15783 17359
## 12   3133  2775
## 13   1380  3079
## 14  12089  7958
## 15   1744  2247
## 16    122    66
## 17    635   427
## 18     83   246
## 19  16013 17642
## 20    552  1062
## 21  62324 33973
## 22   1225  1475
## 23   1201  1034
## 24     31   788
## 25    695  1825
## 26  26866 23111
## 27    273   112
## 28    202   181
## 29   3515  1981
## 30   1988  4788
## 31   2238   974
## 32   1236  2163
## 33   3415  1703
## 34    209   189
## 35  14741 36309
## 36   1216  4545
## 37   4044  2575
## 38   1405  8135
## 39    158    94
## 40     90    43
## 41    518   718
## 42    261   163
## 43     94   114
## 44     77    78
## 45   3025  3707
## 46  15470 11450
## 47   3801  2465
## 48   1488  1086
## 49    424   162
## 50     55    76

Sélection de colonnes par leur indice

exprs[ , 2]

##  [1] 235960    116    118    450   4736   9002   1295   3353   2044   7022
## [11]  15783   3133   1380  12089   1744    122    635     83  16013    552
## [21]  62324   1225   1201     31    695  26866    273    202   3515   1988
## [31]   2238   1236   3415    209  14741   1216   4044   1405    158     90
## [41]    518    261     94     77   3025  15470   3801   1488    424     55

exprs[ , c( 3, 2)]

##      WT2    WT1
## 1  94264 235960
## 2     71    116
## 3    174    118
## 4    655    450
## 5   5019   4736
## 6   8623   9002
## 7   2744   1295
## 8   7449   3353
## 9   4525   2044
## 10  2526   7022
## 11 17359  15783
## 12  2775   3133
## 13  3079   1380
## 14  7958  12089
## 15  2247   1744
## 16    66    122
## 17   427    635
## 18   246     83
## 19 17642  16013
## 20  1062    552
## 21 33973  62324
## 22  1475   1225
## 23  1034   1201
## 24   788     31
## 25  1825    695
## 26 23111  26866
## 27   112    273
## 28   181    202
## 29  1981   3515
## 30  4788   1988
## 31   974   2238
## 32  2163   1236
## 33  1703   3415
## 34   189    209
## 35 36309  14741
## 36  4545   1216
## 37  2575   4044
## 38  8135   1405
## 39    94    158
## 40    43     90
## 41   718    518
## 42   163    261
## 43   114     94
## 44    78     77
## 45  3707   3025
## 46 11450  15470
## 47  2465   3801
## 48  1086   1488
## 49   162    424
## 50    76     55

7.2 Sélection de lignes d’un tableau

Sélection des lignes 4 et 11 du tableau des expressions

exprs[c(4, 11), ]

##                 id   WT1   WT2   KO1   KO2
## 4  ENSG00000099958   450   655   301   472
## 11 ENSG00000119285 15783 17359 18591 20077

Sélection des identifiants de deux gènes d’intérêt

my_genes <- c("ENSG00000253991", "ENSG00000099958")

Vecteur booléen indiquant si chaque ID du tableau fait partie des gènes d’intérêt

exprs$id %in% my_genes

##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE

Indices des lignes correspondant aux IDs des gènes d’intérêt

which(exprs$id %in% my_genes)

## [1]  4 44

Afficher les lignes correspondantes

exprs[which(exprs$id %in% my_genes),   ]

##                 id WT1 WT2 KO1 KO2
## 4  ENSG00000099958 450 655 301 472
## 44 ENSG00000253991  77  78 134  92

7.3 Formulation plus intuitive

subset(x = exprs, id %in% my_genes)

##                 id WT1 WT2 KO1 KO2
## 4  ENSG00000099958 450 655 301 472
## 44 ENSG00000253991  77  78 134  92

Approche plus moderne, avec le package dplyr

## charger la librairie dplyr
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## envoyer le tableau exprs à la commande filter()
exprs %>% filter(id %in% my_genes)

##                id WT1 WT2 KO1 KO2
## 1 ENSG00000099958 450 655 301 472
## 2 ENSG00000253991  77  78 134  92

## plus avancé : enchaîner plusieurs commandes
exprs %>% 
  filter(id %in% my_genes) %>% 
  mutate(mean_KO = (KO1 + KO2)/2)

##                id WT1 WT2 KO1 KO2 mean_KO
## 1 ENSG00000099958 450 655 301 472   386.5
## 2 ENSG00000253991  77  78 134  92   113.0